About this model
xAI Speech to Text v1 is an automatic speech recognition model released on April 18, 2026, alongside the company's companion Text-to-Speech API. It is the inaugural entry in xAI's dedicated speech-to-text family, marking the lab's first move into standalone audio transcription rather than voice features bundled inside Grok.
Because this is a version-one release, there is no same-family predecessor to benchmark against; instead it extends xAI's broader voice stack, sitting next to xAI TTS v1 for synthesis and the Grok 4.3 language model for downstream reasoning over transcripts. The model exposes transcription as both a single API call and a real-time WebSocket stream, with the server waiting for an initialization signal before audio is sent.
On capabilities, xAI's documentation lists support for 12 audio formats, word-level timestamps, multichannel audio, speaker diarization, Smart Turn end-of-turn detection, and 25 languages. Developers can supply a key-term hint and choose automatic language detection or a fixed language, and the response returns the full text plus per-word start and end times.
The feature set targets practical voice workloads where streaming latency and structured, timestamped output matter, including transcription pipelines and interactive voice agents. As a first-generation model, future same-family versions will provide the natural baseline for measuring its improvements.
This About section is AI-generated from public sources (Claude Opus 4.8), with no human editing. It may contain inaccuracies — verify critical details against the sources listed above.
Data sources: Venice API · HuggingFace · Wikipedia — enrichment updated 4d ago