Alibaba·🔊 Text to Speech·↑ Newer: Qwen 3 TTS 1.7B

Qwen 3 TTS 0.6B

private

Try on Venice.ai ↗

Quick reference

Qwen 3 TTS 0.6B — TLDR

- 🏢 Alibaba Qwen team's compact open-source text-to-speech model
- 📏 0.6B-parameter checkpoint, balancing speed and footprint
- ⚡ End-to-end streaming latency as low as 97ms
- 🎯 Supports roughly 3-second voice cloning from reference audio
- 🌐 Trained across multiple languages for multilingual synthesis
- 🔧 Discrete multi-codebook LM with Qwen3-TTS-Tokenizer-12Hz
- 💬 Voice design and natural-language instruction control
- 🔒 Distributed via Hugging Face, GitHub, and ModelScope

💰 Pricing

$87.50

per 1M chars

📅 On Venice since

Mar 10, 2026

132 days ago

Provider

Alibaba

Alibaba Group is a Chinese multinational technology company founded in 1999 and headquartered in Hangzhou, Zhejiang. Originally built around e-commerce and cloud computing, Alibaba has become one of the most prolific contributors to open-weight AI research,…

Read full profile →

51 models on Venice

20 video · 18 text · 5 image · 4 inpaint · 2 embedding · 2 tts

Since Jan 11, 2025

Wikipedia ↗Official site ↗

See 50 other models from Alibaba →

About this model

Qwen 3 TTS 0.6B is the lighter member of Alibaba's Qwen3-TTS family, open-sourced by the Qwen team at Alibaba Cloud in 2026 alongside the larger Qwen 3 TTS 1.7B. Both share a discrete multi-codebook language-model architecture that converts text into speech tokens and decodes them into waveforms, powered by the self-developed Qwen3-TTS-Tokenizer-12Hz for acoustic compression and semantic modeling. This streaming-first design helps push end-to-end synthesis latency as low as 97 milliseconds for real-time interactive use.

Compared with its 1.7B sibling, the 0.6B checkpoint trades some fidelity for a smaller footprint and faster generation, making it suitable for resource-constrained environments and low-latency synthesis. The two checkpoints share the same tokenizer and overall pipeline, with the 0.6B variant positioned as the speed-oriented option in the family.

The family supports voice cloning from a roughly 3-second reference clip, description-based voice design, and natural-language instruction control across multiple languages. Venice exposes this compact checkpoint for fast, low-latency speech synthesis, where its smaller size and streaming architecture suit conversational and assistant scenarios. Alibaba Cloud also offers Qwen-TTS speech synthesis through its Model Studio APIs, and the open-weight models and code are distributed on Hugging Face, GitHub, and ModelScope.

Sources

Speech synthesis - Qwen - Alibaba Cloud Model Studio - Alibaba Cloud Documentation Centeralibabacloud.com ↗

Qwen/Qwen3-TTS-12Hz-0.6B-Base · Hugging Facehuggingface.co ↗

This About section is AI-generated from public sources (Claude Opus 4.8), with no human editing. It may contain inaccuracies — verify critical details against the sources listed above.

Data sources: Venice API · HuggingFace · Wikipedia — enrichment updated 5d ago