Your first local voice becomes real when you treat it as a release artifact.
If you cannot bundle it, checksum it, run it from a subprocess, observe it, cancel it, and explain why it failed, you have not shipped TTS. You have played with it.
The target is simple:
Take some text, run it through Piper locally, and spit out a WAV file in a way that we can safely ship to users later.
We are not chasing the best voice today. We are building a reliable speech path.
That is how product infrastructure gets strong: one boring, repeatable capability at a time.
Start With Piper Because the CLI Is a Boundary
Piper is a good first engine because it gives you a fast path to local neural speech:
- It was built from the ground up for local synthesis.
- It can be driven easily from the command line.
- It uses standard ONNX voice models.
- It bundles eSpeak NG under the hood to handle tricky pronunciations.
- It has a real, active ecosystem, thanks to projects like Home Assistant.
For product work, a CLI gives you a clean boundary. A subprocess is easy to log, easy to kill, and easy to test before you spend weeks wiring up a deep native library integration.
Install for Playing, Bundle for Shipping
If you just want to poke around, the README gives you a simple path:
python -m pip install piper-tts
That is fine for your laptop and wrong for your users. You cannot rely on users having a working Python environment.
For shipping, you need to think in terms of bundles:
resources/
tts/
bin/
piper
piper.exe
voices/
en_US-lessac-medium.onnx
en_US-lessac-medium.onnx.json
licenses/
piper-GPL-3.0.txt
voice-license.txt
checksums.sha256
catalog.json
The structure earns its keep because local TTS is code plus model files plus licenses plus updates.
When a user inevitably opens a support ticket saying, “the voice stopped working after the update,” you need to be able to answer these questions fast:
- Which specific engine binary did they run?
- Which voice model did they actually have?
- Did the downloaded model file match its checksum?
- Did the config JSON match the ONNX file?
- Did the binary have execution permissions on their OS?
- Did our app pass the text through stdin correctly?
- Did Piper crash, or exit cleanly?
A well-structured bundle is what lets you answer those questions without guessing.
Make the Voice Catalog the Contract
Do not hard-code voice paths inside app logic. Put the bundle contract in a catalog file.
{
"voices": [
{
"id": "en-us-lessac-medium",
"name": "English US - Lessac Medium",
"engine": "piper",
"language": "en-US",
"model": "voices/en_US-lessac-medium.onnx",
"config": "voices/en_US-lessac-medium.onnx.json",
"sample_rate": 22050,
"license": "licenses/voice-license.txt",
"sha256": {
"model": "replace-with-real-checksum",
"config": "replace-with-real-checksum"
}
}
]
}
This JSON file becomes the contract between your app and your voice bundle. It’s also a great place to jot down quality notes later on:
{
"strengths": ["clear short-form UI narration"],
"known_issues": ["reads some acronyms literally"],
"recommended_max_chars_per_chunk": 260
}
It feels fussy until your app supports multiple voices. Then it feels obvious.
Keep the Runtime Path Linear
A useful first runtime does five things in order:
- Normalize the raw text.
- Split it into safe, bite-sized chunks.
- Send each chunk to the Piper subprocess.
- Save the resulting WAV file to a cache.
- Tell the app it’s ready to play.
At the edge of your adapter, you want a clean interface that looks something like this:
type SynthesisRequest = {
text: string;
voiceId: string;
outputPath: string;
};
type SynthesisResult = {
outputPath: string;
engine: "piper";
voiceId: string;
characters: number;
synthesisMs: number;
audioMs?: number;
};
The adapter owns the ONNX file path, Piper arguments, and eSpeak NG details. The app should only see the speech contract.
Spawn the Subprocess Carefully
Every OS has its own way of punishing you for writing sloppy subprocess code.
Always use argument arrays instead of trying to concatenate shell strings:
import { spawn } from "node:child_process";
const child = spawn(piperPath, [
"--model",
modelPath,
"--config",
configPath,
"--output_file",
outputPath
], {
stdio: ["pipe", "pipe", "pipe"],
windowsHide: true
});
child.stdin.end(text);
Argument arrays avoid the usual subprocess mistakes:
- Paths with spaces won’t break.
- User input won’t accidentally trigger a shell injection.
- You can easily capture
stderrto debug issues. - You can cleanly kill the process if the user hits cancel.
- Windows won’t flash an annoying black terminal window every time a word is spoken.
When Piper finishes, log structured telemetry right away:
{
"event": "tts.synthesis.complete",
"engine": "piper",
"voice_id": "en-us-lessac-medium",
"chars": 184,
"synthesis_ms": 612,
"exit_code": 0,
"output_bytes": 216488
}
One rule is non-negotiable: do not log raw text by default. People use local TTS because they want privacy. Respect that in your logs.
Do Not Cache Hope. Cache Meaning.
A cache isn’t strictly required, but it makes your app feel instantly responsive for UI phrases that get repeated a lot.
To do this, hash all the parameters that actually change the audio:
cache_key = sha256(
engine_version +
voice_id +
model_checksum +
normalized_text +
speaking_rate +
pitch +
volume
)
Hash the normalized text, not the raw input. If your normalizer turns “3 files” into “three files”, those should hit the same cache key.
Cross-Platform Bugs Are Usually Boring
If you are shipping to desktop, test this early:
- Windows: Double-check that paths are quoted, hide the subprocess window, put your cache files in
%LOCALAPPDATA%, and test with a Windows username that has spaces or non-ASCII characters. - macOS: Make sure the executable bits are set, plan for Apple’s signing and notarization early, store data in
~/Library/Application Support, and test on both Intel and Apple Silicon hardware. - Linux: Manually set the executable bit, plan around AppImage/Flatpak/Snap sandboxes, use proper XDG base directories, and test audio playback on both PipeWire and PulseAudio.
Don’t just test the easy path. Test the path with spaces:
C:\Users\Mike CK\AppData\Local\Your App\tts\voices\voice.onnx
Test the path with Unicode:
/Users/mike/Library/Application Support/TTS Test - sauti/voices/voice.onnx
And test running your app from a read-only directory to make sure your audio cache still writes to the right place.
Measure on Day One
Do not wait for production to discover where speech is slow.
From the very beginning, log:
- Subprocess startup time
- Total synthesis time
- The output file size
- How long the generated audio is
- Real-time factor (RTF)
- Cache hits vs misses
- Non-zero exit codes
- Snippets of
stderr - How long it takes to cancel a job
RTF is your most important metric:
real_time_factor = synthesis_ms / audio_duration_ms
If it takes 500ms to generate 10 seconds of audio, your RTF is 0.05. If it takes 14 seconds to generate that same 10 seconds, your RTF is 1.4. That might be acceptable for background audiobook rendering. It breaks real-time UI speech.
Write a Simple Smoke Test
Your first automated test does not need to be fancy:
Given voice en-us-lessac-medium
When I synthesize "The backup completed successfully."
Then the process exits with code 0
And the output file exists
And the output file is larger than 44 bytes
And the first four bytes are "RIFF"
And the file can be decoded as WAV
This test proves the plumbing works. That is enough for a baseline CI check.
Then, add a single test sentence that a human actually listens to:
The build failed because the API token expired at 14:05 UTC.
You’ll use that exact sentence to test pacing and numbers more often than you’d expect.
The Common Path Is the Product
Delay optimization until the CLI wrapper is observable, bundle-aware, and repeatable.
Once that is true, you can put a product API in front of it. Piper today, sherpa-onnx tomorrow, eSpeak NG when things go wrong.
That is the line between a local speech demo and local speech you can ship.


