Voice¶
Speech-to-text and text-to-speech — the push-to-talk composer and spoken replies. Verified against api/routes.py, api/upload.py.
POST /api/transcribe — speech → text (multipart)¶
multipart/form-data with a single file field file (the audio). Note: no session_id — transcription is stateless.
Response (success) — { "ok": true, "transcript": "…" }.
Failure — { "error": "…" } (the server returns a JSON error body even on non-2xx, so a client can decode it uniformly).
| Status | Meaning |
|---|---|
413 |
audio exceeds the max upload size |
400 |
no file / no filename / transcription failed |
503 |
speech-to-text unavailable — the STT module or provider isn't configured |
401 |
not authenticated |
POST /api/tts — text → speech¶
Body (JSON)
{
"text": "…", // required, ≤ 5000 chars
"voice": "zh-CN-XiaoxiaoNeural", // default
"rate": "…", "pitch": "…", // string or number
"engine": "edge" // edge (default) | elevenlabs | openai | browser
}
Response (success) — raw audio bytes (fully buffered, with Content-Length). Play them directly.
| Status | Meaning |
|---|---|
400 |
invalid body / empty text / text too long / bad rate·pitch |
429 |
rate-limited (a short per-client window) |
503 |
the selected engine's provider key is missing (e.g. an ElevenLabs key) |
401 |
not authenticated |
Engine choice
engine: "browser" signals the client to use on-device speech synthesis instead of a server round-trip — useful offline or to avoid provider cost. edge is the default server engine and needs no key; elevenlabs/openai need their respective provider keys configured on the server.