About this model
Kokoro is an open-weight text-to-speech model created by Hexgrad, an individual open-source developer, and released in this v1.0 form in March 2025. With just 82 million parameters, it takes text in and produces audio out, positioning it as a compact alternative to far larger speech systems. Its small footprint is its defining feature, and the catalog describes it as offering efficient inference and natural-sounding voices.
The weights are distributed under the permissive Apache-2.0 license, making the model freely deployable across both production and hobby settings. A community ONNX port of the v1.0 release is published on Hugging Face, broadening the runtimes and platforms on which the model can operate.
The ecosystem around Kokoro includes Hexgrad's hosted demo Space on Hugging Face, where the model can be tried directly in the browser. As the catalog notes, it remains an efficient, low-cost open-source baseline for developers building voice interfaces, audiobooks, and accessibility applications.
This About section is AI-generated from public sources (Claude Opus 4.8), with no human editing. It may contain inaccuracies — verify critical details against the sources listed above.
Research & Papers
2 reference papers linked from the HuggingFace model card.
StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models(2023)
Yinghao Aaron Li, Cong Han, Vinay S. Raghavan et al.
In this paper, we present StyleTTS 2, a text-to-speech (TTS) model that leverages style diffusion and adversarial training with large speech language models (SLMs) to achieve human-level TTS synthesis. StyleTTS 2 differs from its predecessor by modeling styles as a latent random…
iSTFTNet: Fast and Lightweight Mel-Spectrogram Vocoder Incorporating Inverse Short-Time Fourier Transform(2022)
Takuhiro Kaneko, Kou Tanaka, Hirokazu Kameoka et al.
In recent text-to-speech synthesis and voice conversion systems, a mel-spectrogram is commonly applied as an intermediate representation, and the necessity for a mel-spectrogram vocoder is increasing. A mel-spectrogram vocoder must solve three inverse problems: recovery of the…
Data sources: Venice API · HuggingFace · Wikipedia · arXiv — enrichment updated 1d ago