Skip to content

Voice

Speech-to-text and text-to-speech — the push-to-talk composer and spoken replies. Verified against api/routes.py, api/upload.py.


POST /api/transcribe — speech → text (multipart)

multipart/form-data with a single file field file (the audio). Note: no session_id — transcription is stateless.

Response (success){ "ok": true, "transcript": "…" }. Failure{ "error": "…" } (the server returns a JSON error body even on non-2xx, so a client can decode it uniformly).

Status Meaning
413 audio exceeds the max upload size
400 no file / no filename / transcription failed
503 speech-to-text unavailable — the STT module or provider isn't configured
401 not authenticated

POST /api/tts — text → speech

Body (JSON)

{
  "text": "…",                       // required, ≤ 5000 chars
  "voice": "zh-CN-XiaoxiaoNeural",   // default
  "rate": "…", "pitch": "…",         // string or number
  "engine": "edge"                   // edge (default) | elevenlabs | openai | browser
}

Response (success) — raw audio bytes (fully buffered, with Content-Length). Play them directly.

Status Meaning
400 invalid body / empty text / text too long / bad rate·pitch
429 rate-limited (a short per-client window)
503 the selected engine's provider key is missing (e.g. an ElevenLabs key)
401 not authenticated

Engine choice

engine: "browser" signals the client to use on-device speech synthesis instead of a server round-trip — useful offline or to avoid provider cost. edge is the default server engine and needs no key; elevenlabs/openai need their respective provider keys configured on the server.