xAIxAI·🎙️ Speech to Text

xAI Speech to Text v1

anonymized
Try on Venice.ai ↗
Quick reference
xAI Speech to Text v1 — TLDR
  • - 🆕 xAI's first standalone speech-to-text API, released April 2026.
  • - 🌐 Transcribes 25 languages with automatic or specified language detection.
  • - 🎯 Returns word-level timestamps, multichannel audio, and speaker diarization.
  • - ⚡ Real-time streaming via WebSocket plus single-call batch transcription.
  • - 🔧 Accepts 12 audio formats with custom key-term hinting.
  • - 💬 Includes Smart Turn end-of-turn detection for voice agents.
💰 Pricing
$0.113
per audio hour
📅 On Venice since
Apr 18, 2026
47 days ago
Provider

xAI is an American artificial intelligence company and wholly owned subsidiary of SpaceX. The company develops AI systems under the Grok brand, spanning language models, image generation, and video synthesis. xAI has quickly established itself as a multimodal…

Read full profile →
18 models on Venice
8 video · 4 text · 2 image · 2 inpaint · 1 tts · 1 asr
Since Jan 29, 2026

About this model

xAI Speech to Text v1 is an automatic speech recognition model released on April 18, 2026, alongside the company's companion Text-to-Speech API. It is the inaugural entry in xAI's dedicated speech-to-text family, marking the lab's first move into standalone audio transcription rather than voice features bundled inside Grok.

Because this is a version-one release, there is no same-family predecessor to benchmark against; instead it extends xAI's broader voice stack, sitting next to xAI TTS v1 for synthesis and the Grok 4.3 language model for downstream reasoning over transcripts. The model exposes transcription as both a single API call and a real-time WebSocket stream, with the server waiting for an initialization signal before audio is sent.

On capabilities, xAI's documentation lists support for 12 audio formats, word-level timestamps, multichannel audio, speaker diarization, Smart Turn end-of-turn detection, and 25 languages. Developers can supply a key-term hint and choose automatic language detection or a fixed language, and the response returns the full text plus per-word start and end times.

The feature set targets practical voice workloads where streaming latency and structured, timestamped output matter, including transcription pipelines and interactive voice agents. As a first-generation model, future same-family versions will provide the natural baseline for measuring its improvements.

This About section is AI-generated from public sources (Claude Opus 4.8), with no human editing. It may contain inaccuracies — verify critical details against the sources listed above.

Data sources: Venice API · HuggingFace · Wikipedia — enrichment updated 4d ago