Local text-to-speech starts as a model choice and quickly becomes a product commitment.
A demo asks one lazy question:
Which local TTS engine sounds the best?
A product asks the question that actually matters:
Which local TTS stack can I ship, update, debug, license, and trust on the machines my users actually own?
That second question is less glamorous. It is also the only one that prevents late rewrites.
A voice can sound beautiful and still be useless if it needs CUDA, drags in a license you cannot ship, bloats the installer, or takes two seconds before saying the first word. Local speech only becomes real when the boring parts are owned.
Local Means You Own the Failure
Local TTS means text becomes audio on the user’s device, or on infrastructure you explicitly control. No hosted synthesis API gets to sit on the critical path.
That buys you real advantages:
- Privacy: Sensitive information never needs to leave the device.
- Resilience: Speech functionality persists regardless of network availability or reliability.
- Cost predictability: You are freed from the recurring burden of per-character billing.
- Latency reduction: Network round-trips are eliminated from the critical path.
- Product continuity: Your application continues to speak even if a cloud vendor alters their pricing, limits, or terms of service.
But independence sends the bill somewhere else:
- Curating and bundling appropriate voice models.
- Managing platform-specific binaries.
- Normalizing raw text prior to synthesis.
- Orchestrating audio playback mechanisms.
- Benchmarking and mitigating latency on constrained hardware.
- Navigating the licensing labyrinth for both engines and voice models.
- Executing model updates without fracturing existing installations.
This trade is worth making when speech is core to the product, when private text must stay private, or when the device has to work off the grid. If speech is a decorative flourish, rent the API and move on.
The 2026 Engine Landscape
As of April 2026, four engines belong on the whiteboard:
- Piper
: A strong first engine for fast local neural speech, with a straightforward CLI and an embedding-friendly architecture. Managed by the Open Home Foundation, it embeds eSpeak NG for phonemization (its latest stable release at the time of writing is
v1.4.2). - sherpa-onnx : The serious choice when you need a broader offline speech toolkit with native and mobile integration. It reaches beyond TTS into automatic speech recognition (ASR) and voice activity detection (VAD).
- Kokoro ONNX : A quality candidate when naturalness matters most and you are willing to validate provenance, packaging, licensing, and runtime behavior yourself.
- eSpeak NG : The fallback that keeps speaking long after prettier options give up. It is compact, reliable, and supports over a hundred languages and accents.
A sane strategy is simple: start with Piper for a working desktop voice, move to sherpa-onnx when native control or mobile breadth matters, keep eSpeak NG wired as the emergency voice, and evaluate Kokoro ONNX when quality is worth the extra validation work.
Five Product Questions Before You Download a Model
The model can wait. These questions decide whether the model can ship.
1. Where will the speech actually run?
“Cross-platform” is a promise you make to Windows, macOS, Linux, and maybe mobile or embedded devices.
For each target platform, you need to know:
- CPU architecture: Are you running on x86_64, Apple Silicon (arm64), or something else?
- OS constraints: What’s the minimum OS version you support?
- Hardware acceleration: Can you rely on a GPU, or do you need to assume CPU-only?
- Resource limits: How much disk space and RAM can you safely eat up?
- Network access: Can your app download a large voice model after installation?
A voice that only runs perfectly on your top-tier dev laptop is a lab result.
2. What’s your latency budget?
Latency is not one number. Measure at least three:
- Time-to-first-audio: How long it takes from the user clicking “Play” to hearing the first sound.
- Real-time factor (RTF): How fast the audio generates compared to how long it takes to play it.
- Cancellation latency: How fast the audio stops when the user hits pause or interrupt.
The metric that matters depends on the product. UI alerts live or die by time-to-first-audio. Audiobook narration cares more about RTF. Screen readers need fast cancellation because speech that refuses to stop is hostile.
3. What kind of text are you feeding it?
TTS engines do not read the sentence you imagined. They read exactly what your app sends them.
Real product text is ugly:
- Raw log files and code snippets
- Markdown or HTML tags
- Dates and localized currency formats
- Internal acronyms and URLs
- Sentences mixing multiple languages
The model is only half the voice. The other half is text normalization: cleaning up the text before the engine gets a chance to embarrass you.
4. What do the licenses allow?
If you’re building a commercial product, you need to check licenses on three different levels:
- The license for the TTS engine itself.
- The license for the specific voice model you downloaded.
- The licenses of any underlying dependencies.
This matters if you are distributing closed-source apps or pushing to an app store. The goal is to know exactly what you are shipping.
5. How will you ship updates?
Local voice models are heavy—often tens or hundreds of megabytes. If you bundle three large voices directly into your main app, your users will feel it every time they download a minor bug fix.
Separate the release pieces early:
- The app code
- The TTS engine runtime
- The voice models and config files
- Your custom pronunciation dictionaries
Then a pronunciation fix can be a tiny dictionary update instead of a 500MB apology.
A Practical Decision Tree
Use this heuristic:
- Use Piper to get your first functional, local neural voice running on desktop.
- Switch to sherpa-onnx if you’re building a native app that needs tighter control than a command-line wrapper provides.
- Keep eSpeak NG wired up as a rock-solid fallback for missing languages or slow hardware.
- Explore Kokoro ONNX when you want the highest possible quality and have the time to validate its architecture.
You do not have to crown one engine. A healthy app usually stacks them:
Product API
-> Primary neural engine
-> Lighter neural fallback
-> eSpeak NG emergency fallback
That is the difference between a speech feature and a speech hope.
Make the Benchmark Boring
Do not wire the engine into a UI first. Create a directory named something like tts-fixtures and fill it with ten sentences your product will actually say.
Consider these examples:
The backup finished at 03:42 UTC.
CPU usage crossed 92 percent for five minutes.
Open /var/log/nginx/error.log and search for upstream timeout.
Invoice INV-2026-0417 is overdue by KES 12,450.
Dr. Njoroge reviewed version 2.1.0-beta.3 yesterday.
Run every candidate engine against the same fixture file. Log the results:
engine, voice, platform, chars, synth_ms, audio_ms, rtf, wav_bytes
piper, en_US-lessac-medium, macos-arm64, 321, 820, 14600, 0.056, 467244
The goal is to see how each engine fails under realistic text.
Be highly attuned to:
- Bizarre numeric pronunciations
- Mangled acronym expansions
- Unnatural pauses or erratic pacing
- Prematurely clipped sentence endings
- Prohibitive startup latency
- Punishing CPU spikes during synthesis
- Wildly inconsistent volume levels
- Silent crashes when processing lengthy text blocks
The fixture file is your referee. Keep it small enough that running it feels automatic.
Your UI Is Not a Speech Engine
The target shape should feel familiar:
App
SpeechService
TextNormalizer
SentenceQueue
VoiceCatalog
EngineAdapter
PiperAdapter
SherpaOnnxAdapter
EspeakFallbackAdapter
AudioCache
Player
The implementation language does not matter. Rust, Go, Python, TypeScript, Swift, Kotlin, and C++ can all work.
The rule matters:
Your app talks to
SpeechService, not to model files, CLIs, or random engine flags.
That boundary is what lets you swap engines later without rewriting every screen that needs speech.
The First Milestone
The first milestone is deliberately small:
Given an arbitrary string of text, produce a localized WAV file repeatedly, using a designated voice and a pinned engine version.
That sounds boring because it is. Boring is the foundation. Everything else in local speech gets easier once that sentence is true.

