Voice¶

Speech-to-text and text-to-speech — the push-to-talk composer and spoken replies. Verified against api/routes.py, api/upload.py.

`POST /api/transcribe` — speech → text (multipart)¶

multipart/form-data with a single file field file (the audio). Note: no session_id — transcription is stateless.

Response (success) — { "ok": true, "transcript": "…" }. Failure — { "error": "…" } (the server returns a JSON error body even on non-2xx, so a client can decode it uniformly).

Status	Meaning
`413`	audio exceeds the max upload size
`400`	no file / no filename / transcription failed
`503`	speech-to-text unavailable — the STT module or provider isn't configured
`401`	not authenticated

`POST /api/tts` — text → speech¶

Body (JSON)

{
  "text": "…",                       // required, ≤ 5000 chars
  "voice": "zh-CN-XiaoxiaoNeural",   // default
  "rate": "…", "pitch": "…",         // string or number
  "engine": "edge"                   // edge (default) | elevenlabs | openai | browser
}

Response (success) — raw audio bytes (fully buffered, with Content-Length). Play them directly.

Status	Meaning
`400`	invalid body / empty text / text too long / bad rate·pitch
`429`	rate-limited (a short per-client window)
`503`	the selected engine's provider key is missing (e.g. an ElevenLabs key)
`401`	not authenticated

Engine choice

engine: "browser" signals the client to use on-device speech synthesis instead of a server round-trip — useful offline or to avoid provider cost. edge is the default server engine and needs no key; elevenlabs/openai need their respective provider keys configured on the server.

Voice¶

POST /api/transcribe — speech → text (multipart)¶

POST /api/tts — text → speech¶

`POST /api/transcribe` — speech → text (multipart)¶

`POST /api/tts` — text → speech¶