How a full translator runs on your phone

No server. No API. After a one-time model download, the entire speech-to-speech pipeline — recognition, translation, synthesis, even voice cloning — executes on the phone in your pocket. Here's the actual engineering, with measured numbers.

The pipeline

1 · Listen — Whisper (OpenAI) via whisper.cpp

Speech recognition runs on the phone GPU (Adreno, OpenCL). Free tiers use the tiny/base models; Pro uses large-v3-turbo (809M params, 452 MB quantized). Silero VAD splits long speech into chunks so the pipeline overlaps work.

2 · Translate — MADLAD-400-3B, 4-bit quantized

A 3-billion-parameter multilingual model (Google research, Apache-2.0) running through llama.cpp — on the CPU, on purpose (see the war story below). One model covers every language pair, both directions. ~1.6 GB on disk.

3 · Speak — System TTS or Chatterbox voice cloning

Standard output uses fast local synthesis. Pro clones a reference voice on-device — a multi-model stack (speech-token generator + neural vocoder) so the translation comes back sounding like you.

4 · Read (OCR) — PP-OCRv5 + Tesseract

Camera translation detects and reads text on-device with neural OCR; scripts with stacked conjuncts (Tamil, Telugu, Hindi) route to a Tesseract path that handles them better.

War story #1: the GPU that lied

The obvious plan was to run everything on the phone's GPU. Speech recognition loved it. Translation didn't: on mobile OpenCL, long autoregressive decodes accumulated numerical error until the output silently corrupted — plausible-looking text that drifted wrong. The fix wasn't more engineering heroics on the GPU path; it was accepting that a well-quantized 3B model on modern phone CPUs is both correct and fast (~1.2 s for a sentence). Speech recognition stays on the GPU, translation runs on CPU, and the stages pipeline so they overlap.

War story #2: the licensing minefield

Shipping open models commercially eliminates most of the leaderboard. Meta's NLLB-200 translates beautifully — and is CC-BY-NC, non-commercial. Popular open TTS engines are GPLv3 (viral for an app) or "research only." Every model in this app was chosen twice: once for quality, once for a license that legally ships — which is how the translation stage ended up on Apache-2.0 MADLAD-400 and the voice-clone stack on permissively licensed models.

Measured on real hardware

Samsung Galaxy S24 Ultra (Snapdragon 8 Gen 3); timings from in-app instrumentation. GPU stages vary ±10% run to run.

Speech recognition (5.9 s clip, base model)	~1.1 s	free tier
Speech recognition (5.9 s clip, large-v3-turbo)	~9 s	Pro max-accuracy tier
Translation (short sentence, MADLAD-3B warm)	~1.2 s	all tiers — pre-loaded at app start
Translation model load from disk	~0.7 s	happens in the background at launch
Voice-clone synthesis engine start (warm cache)	~4.3 s first use	then resident

Models aren't bundled in the APK — they download once (encrypted, ~30 MB/s including decryption) and live on the device. After that, airplane mode is a supported configuration: that's the whole point. The privacy property isn't a policy promise — there is simply no server for your conversations to reach.

Try it free — unlimited text translation + 10 spoken translations — then a one-time unlock from $2.99. No subscription.

Get it on Google Play What offline makes possible