AlibabaAlibaba·🔊 Text to Speech

Qwen 3 TTS 0.6B

private
Try on Venice.ai ↗
Quick reference
Qwen 3 TTS 0.6B — TLDR
  • - 🏢 Alibaba Qwen team's compact open-source text-to-speech model
  • - ⚡ End-to-end synthesis latency as low as 97ms
  • - 🌐 Supports 10 languages including Chinese, English, Japanese, Korean
  • - 🎯 3-second voice cloning and natural-language voice control
  • - 📚 Trained on over 5 million hours of speech
  • - 🔧 Discrete multi-codebook language-model architecture for end-to-end speech
  • - 🆕 Smaller sibling of the 1.7B variant
💰 Pricing
$87.50
per 1M chars
📅 On Venice since
Mar 10, 2026
85 days ago
Provider

Alibaba Group is a Chinese multinational technology company founded in 1999 and headquartered in Hangzhou, Zhejiang. Originally built around e-commerce and cloud computing, Alibaba has become one of the most prolific contributors to open-weight AI research,…

Read full profile →
46 models on Venice
17 text · 16 video · 5 image · 4 inpaint · 2 embedding · 2 tts
Since Jan 11, 2025

About this model

Qwen 3 TTS 0.6B is the compact member of Alibaba Cloud's open-source Qwen3-TTS family, released by the Qwen team for multilingual, controllable, streaming speech synthesis. It is built on the self-developed Qwen3-TTS-Tokenizer-12Hz, which performs efficient acoustic compression, and adopts a discrete multi-codebook language-model architecture for full-information end-to-end speech modeling. According to the official model card, it was trained on over 5 million hours of speech spanning 10 languages and supports 3-second voice cloning plus description-based control.

The headline feature is speed: the provider reports end-to-end synthesis latency as low as 97ms, suiting real-time interactive scenarios. The CustomVoice variant offers 9 premium timbres with fine-grained style control via natural-language instructions across languages including Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish and Italian.

Within the family, the 0.6B sits alongside its larger twin, Qwen 3 TTS 1.7B. Both share the same 12Hz-tokenizer foundation and capabilities, with the 0.6B being the smaller, lighter-weight variant. As the more compact option, it targets reduced compute and quicker inference.

A Qwen3-TTS Technical Report accompanies the release, and the models are distributed openly on Hugging Face and ModelScope. This makes the 0.6B a practical choice for low-latency, on-device or cost-sensitive speech applications.

This About section is AI-generated from public sources (Claude Opus 4.8), with no human editing. It may contain inaccuracies — verify critical details against the sources listed above.

Data sources: Venice API · HuggingFace · Wikipedia — enrichment updated 1d ago