AlibabaAlibaba·🔊 Text to Speech

Qwen 3 TTS 1.7B

private
Try on Venice.ai ↗
Quick reference
Qwen 3 TTS 1.7B — TLDR
  • 💬 Alibaba's larger Qwen3-TTS variant for expressive speech generation
  • 📏 1.7B parameters; shares the family architecture with 0.6B sibling
  • 🌐 Supports voice cloning, voice design, and multilingual synthesis
  • 🎯 Reported WER of 1.24 on test-en after post-training
  • ⚡ Streaming with first-packet latency around 101 ms
  • 📚 Stable long-form synthesis exceeding 10 minutes of speech
  • 🔧 Natural-language instructions steer tone, emotion, and pacing
  • 🏢 Developed by the Qwen team at Alibaba Cloud
💰 Pricing
$113
per 1M chars
📅 On Venice since
Mar 10, 2026
85 days ago
Provider

Alibaba Group is a Chinese multinational technology company founded in 1999 and headquartered in Hangzhou, Zhejiang. Originally built around e-commerce and cloud computing, Alibaba has become one of the most prolific contributors to open-weight AI research,…

Read full profile →
46 models on Venice
17 text · 16 video · 5 image · 4 inpaint · 2 embedding · 2 tts
Since Jan 11, 2025

About this model

Qwen 3 TTS 1.7B is the larger member of Alibaba Cloud's Qwen3-TTS speech-generation family, released by the Qwen team alongside its smaller counterpart Qwen 3 TTS 0.6B. According to the Qwen3-TTS technical report, the family supports voice cloning from short reference clips, natural-language voice design, predefined speakers, and low-latency streaming across multiple languages.

Both sizes share the same architecture, but the 1.7B variant trades higher VRAM use for greater nuance. The technical report indicates that scaling from 0.6B to 1.7B yields consistent gains, with the larger model reaching a reported word error rate of 1.24 on the test-en set after post-training.

Compared to the 0.6B sibling, which targets edge and consumer hardware, the 1.7B model targets higher output quality than the 0.6B variant, especially in long-form narration where it remains stable for over ten minutes of speech. The provider reports first-packet latency around 101 ms for the 1.7B variant, enabling real-time applications.

Natural-language instructions can steer tone, emotion, and pacing, making the 1.7B model suitable for narration, voice assistants, and accessibility tools where richer prosody matters. As the newest entry in the Qwen3-TTS line, it sits above the 0.6B configuration as the higher-capacity option within the same family.

This About section is AI-generated from public sources (Claude Opus 4.8), with no human editing. It may contain inaccuracies — verify critical details against the sources listed above.

Data sources: Venice API · HuggingFace · Wikipedia — enrichment updated 1d ago