How a full translator runs on your phone

No server. No API. After a one-time model download, the entire speech-to-speech pipeline — recognition, translation, synthesis, even voice cloning — executes on the phone in your pocket. Here's the actual engineering, with measured numbers.

The pipeline

1 · Listen Whisper (OpenAI) via whisper.cpp
Speech recognition runs on the phone GPU (Adreno, OpenCL). Free tiers use the tiny/base models; Pro uses large-v3-turbo (809M params, 452 MB quantized). Silero VAD splits long speech into chunks so the pipeline overlaps work.
2 · Translate MADLAD-400-3B, 4-bit quantized
A 3-billion-parameter multilingual model (Google research, Apache-2.0) running through llama.cpp — on the CPU, on purpose (see the war story below). One model covers every language pair, both directions. ~1.6 GB on disk.
3 · Speak System TTS or Chatterbox voice cloning
Standard output uses fast local synthesis. Pro clones a reference voice on-device — a multi-model stack (speech-token generator + neural vocoder) so the translation comes back sounding like you.
4 · Read (OCR) PP-OCRv5 + Tesseract
Camera translation detects and reads text on-device with neural OCR; scripts with stacked conjuncts (Tamil, Telugu, Hindi) route to a Tesseract path that handles them better.

War story #1: the GPU that lied

The obvious plan was to run everything on the phone's GPU. Speech recognition loved it. Translation didn't: on mobile OpenCL, long autoregressive decodes accumulated numerical error until the output silently corrupted — plausible-looking text that drifted wrong. The fix wasn't more engineering heroics on the GPU path; it was accepting that a well-quantized 3B model on modern phone CPUs is both correct and fast (~1.2 s for a sentence). Speech recognition stays on the GPU, translation runs on CPU, and the stages pipeline so they overlap.

War story #2: the licensing minefield

Shipping open models commercially eliminates most of the leaderboard. Meta's NLLB-200 translates beautifully — and is CC-BY-NC, non-commercial. Popular open TTS engines are GPLv3 (viral for an app) or "research only." Every model in this app was chosen twice: once for quality, once for a license that legally ships — which is how the translation stage ended up on Apache-2.0 MADLAD-400 and the voice-clone stack on permissively licensed models.

Measured on real hardware

Samsung Galaxy S24 Ultra (Snapdragon 8 Gen 3); timings from in-app instrumentation. GPU stages vary ±10% run to run.

Speech recognition (5.9 s clip, base model)~1.1 sfree tier
Speech recognition (5.9 s clip, large-v3-turbo)~9 sPro max-accuracy tier
Translation (short sentence, MADLAD-3B warm)~1.2 sall tiers — pre-loaded at app start
Translation model load from disk~0.7 shappens in the background at launch
Voice-clone synthesis engine start (warm cache)~4.3 s first usethen resident

Models aren't bundled in the APK — they download once (encrypted, ~30 MB/s including decryption) and live on the device. After that, airplane mode is a supported configuration: that's the whole point. The privacy property isn't a policy promise — there is simply no server for your conversations to reach.

Try it free — unlimited text translation + 10 spoken translations — then a one-time unlock from $2.99. No subscription.