Ship Your First Local Voice with Piper, Then Make It Boring

Your first local voice becomes real when you treat it as a release artifact.

If you cannot bundle it, checksum it, run it from a subprocess, observe it, cancel it, and explain why it failed, you have not shipped TTS. You have played with it.

The target is simple:

Take some text, run it through Piper locally, and spit out a WAV file in a way that we can safely ship to users later.

We are not chasing the best voice today. We are building a reliable speech path.

That is how product infrastructure gets strong: one boring, repeatable capability at a time.

Folder tree showing a Piper runtime bundle with binary, ONNX model, config, license, checksums, and voice catalog. — A voice bundle should be simple and predictable so your future self can debug it quickly.

Start With Piper Because the CLI Is a Boundary

Piper is a good first engine because it gives you a fast path to local neural speech:

It was built from the ground up for local synthesis.
It can be driven easily from the command line.
It uses standard ONNX voice models.
It bundles eSpeak NG under the hood to handle tricky pronunciations.
It has a real, active ecosystem, thanks to projects like Home Assistant.

For product work, a CLI gives you a clean boundary. A subprocess is easy to log, easy to kill, and easy to test before you spend weeks wiring up a deep native library integration.

Install for Playing, Bundle for Shipping

If you just want to poke around, the README gives you a simple path:

python -m pip install piper-tts

That is fine for your laptop and wrong for your users. You cannot rely on users having a working Python environment.

For shipping, you need to think in terms of bundles:

resources/
  tts/
    bin/
      piper
      piper.exe
    voices/
      en_US-lessac-medium.onnx
      en_US-lessac-medium.onnx.json
    licenses/
      piper-GPL-3.0.txt
      voice-license.txt
    checksums.sha256
    catalog.json

The structure earns its keep because local TTS is code plus model files plus licenses plus updates.

When a user inevitably opens a support ticket saying, “the voice stopped working after the update,” you need to be able to answer these questions fast:

Which specific engine binary did they run?
Which voice model did they actually have?
Did the downloaded model file match its checksum?
Did the config JSON match the ONNX file?
Did the binary have execution permissions on their OS?
Did our app pass the text through stdin correctly?
Did Piper crash, or exit cleanly?

A well-structured bundle is what lets you answer those questions without guessing.

Make the Voice Catalog the Contract

Do not hard-code voice paths inside app logic. Put the bundle contract in a catalog file.

{
  "voices": [
    {
      "id": "en-us-lessac-medium",
      "name": "English US - Lessac Medium",
      "engine": "piper",
      "language": "en-US",
      "model": "voices/en_US-lessac-medium.onnx",
      "config": "voices/en_US-lessac-medium.onnx.json",
      "sample_rate": 22050,
      "license": "licenses/voice-license.txt",
      "sha256": {
        "model": "replace-with-real-checksum",
        "config": "replace-with-real-checksum"
      }
    }
  ]
}

This JSON file becomes the contract between your app and your voice bundle. It’s also a great place to jot down quality notes later on:

{
  "strengths": ["clear short-form UI narration"],
  "known_issues": ["reads some acronyms literally"],
  "recommended_max_chars_per_chunk": 260
}

It feels fussy until your app supports multiple voices. Then it feels obvious.

Keep the Runtime Path Linear

Runtime pipeline from text normalization to sentence splitting, Piper process, WAV cache, and playback. — Keep your first runtime path perfectly linear. Make it observable before you try to optimize it.

A useful first runtime does five things in order:

Normalize the raw text.
Split it into safe, bite-sized chunks.
Send each chunk to the Piper subprocess.
Save the resulting WAV file to a cache.
Tell the app it’s ready to play.

At the edge of your adapter, you want a clean interface that looks something like this:

type SynthesisRequest = {
  text: string;
  voiceId: string;
  outputPath: string;
};

type SynthesisResult = {
  outputPath: string;
  engine: "piper";
  voiceId: string;
  characters: number;
  synthesisMs: number;
  audioMs?: number;
};

The adapter owns the ONNX file path, Piper arguments, and eSpeak NG details. The app should only see the speech contract.

Spawn the Subprocess Carefully

Every OS has its own way of punishing you for writing sloppy subprocess code.

Always use argument arrays instead of trying to concatenate shell strings:

import { spawn } from "node:child_process";

const child = spawn(piperPath, [
  "--model",
  modelPath,
  "--config",
  configPath,
  "--output_file",
  outputPath
], {
  stdio: ["pipe", "pipe", "pipe"],
  windowsHide: true
});

child.stdin.end(text);

Argument arrays avoid the usual subprocess mistakes:

Paths with spaces won’t break.
User input won’t accidentally trigger a shell injection.
You can easily capture stderr to debug issues.
You can cleanly kill the process if the user hits cancel.
Windows won’t flash an annoying black terminal window every time a word is spoken.

When Piper finishes, log structured telemetry right away:

{
  "event": "tts.synthesis.complete",
  "engine": "piper",
  "voice_id": "en-us-lessac-medium",
  "chars": 184,
  "synthesis_ms": 612,
  "exit_code": 0,
  "output_bytes": 216488
}

One rule is non-negotiable: do not log raw text by default. People use local TTS because they want privacy. Respect that in your logs.

Do Not Cache Hope. Cache Meaning.

A cache isn’t strictly required, but it makes your app feel instantly responsive for UI phrases that get repeated a lot.

To do this, hash all the parameters that actually change the audio:

cache_key = sha256(
  engine_version +
  voice_id +
  model_checksum +
  normalized_text +
  speaking_rate +
  pitch +
  volume
)

Hash the normalized text, not the raw input. If your normalizer turns “3 files” into “three files”, those should hit the same cache key.

Cross-Platform Bugs Are Usually Boring

Matrix comparing Windows, macOS, and Linux concerns for TTS binary, audio, storage, and updates. — Cross-platform bugs aren’t glamorous. They’re mostly file paths, permissions, and audio APIs.

If you are shipping to desktop, test this early:

Windows: Double-check that paths are quoted, hide the subprocess window, put your cache files in %LOCALAPPDATA%, and test with a Windows username that has spaces or non-ASCII characters.
macOS: Make sure the executable bits are set, plan for Apple’s signing and notarization early, store data in ~/Library/Application Support, and test on both Intel and Apple Silicon hardware.
Linux: Manually set the executable bit, plan around AppImage/Flatpak/Snap sandboxes, use proper XDG base directories, and test audio playback on both PipeWire and PulseAudio.

Don’t just test the easy path. Test the path with spaces:

C:\Users\Mike CK\AppData\Local\Your App\tts\voices\voice.onnx

Test the path with Unicode:

/Users/mike/Library/Application Support/TTS Test - sauti/voices/voice.onnx

And test running your app from a read-only directory to make sure your audio cache still writes to the right place.

Measure on Day One

Do not wait for production to discover where speech is slow.

From the very beginning, log:

Subprocess startup time
Total synthesis time
The output file size
How long the generated audio is
Real-time factor (RTF)
Cache hits vs misses
Non-zero exit codes
Snippets of stderr
How long it takes to cancel a job

RTF is your most important metric:

real_time_factor = synthesis_ms / audio_duration_ms

If it takes 500ms to generate 10 seconds of audio, your RTF is 0.05. If it takes 14 seconds to generate that same 10 seconds, your RTF is 1.4. That might be acceptable for background audiobook rendering. It breaks real-time UI speech.

Write a Simple Smoke Test

Your first automated test does not need to be fancy:

Given voice en-us-lessac-medium
When I synthesize "The backup completed successfully."
Then the process exits with code 0
And the output file exists
And the output file is larger than 44 bytes
And the first four bytes are "RIFF"
And the file can be decoded as WAV

This test proves the plumbing works. That is enough for a baseline CI check.

Then, add a single test sentence that a human actually listens to:

The build failed because the API token expired at 14:05 UTC.

You’ll use that exact sentence to test pacing and numbers more often than you’d expect.

The Common Path Is the Product

Delay optimization until the CLI wrapper is observable, bundle-aware, and repeatable.

Once that is true, you can put a product API in front of it. Piper today, sherpa-onnx tomorrow, eSpeak NG when things go wrong.

That is the line between a local speech demo and local speech you can ship.

Ship Your First Local Voice with Piper, Then Make It Boring

Start With Piper Because the CLI Is a Boundary

Install for Playing, Bundle for Shipping

Make the Voice Catalog the Contract

Keep the Runtime Path Linear

Spawn the Subprocess Carefully

Do Not Cache Hope. Cache Meaning.

Cross-Platform Bugs Are Usually Boring

Measure on Day One

Write a Simple Smoke Test

The Common Path Is the Product

Part of the "Local Cross-Platform TTS" series

Related Articles

Local TTS Starts After the Demo Voice Works

Start With Piper Because the CLI Is a Boundary

Install for Playing, Bundle for Shipping

Make the Voice Catalog the Contract

Keep the Runtime Path Linear

Spawn the Subprocess Carefully

Do Not Cache Hope. Cache Meaning.

Cross-Platform Bugs Are Usually Boring

Measure on Day One

Write a Simple Smoke Test

The Common Path Is the Product

Part of the "Local Cross-Platform TTS" series

Related Articles

Local TTS Starts After the Demo Voice Works

Share Article