This is a simplified guide to an AI model called Chatterbox-Turbo maintained by Resemble-Ai. If you like these kinds of analysis, you should join AImodels.fyi or follow us on Twitter.
Model overview
chatterbox-turbo is a 350M parameter text-to-speech model created by Resemble AI that prioritizes speed and efficiency without compromising audio quality. It represents the latest advancement in the chatterbox family, which also includes chatterbox-multilingual for 23+ languages and chatterbox-pro for expressive synthesis. The model reduces computational requirements and VRAM usage while maintaining high-fidelity output. A key engineering achievement involves distilling the speech-token-to-mel decoder, cutting generation steps from 10 to just one, making this model ideal for applications requiring low-latency voice synthesis.
Model inputs and outputs
The model accepts text inputs along with optional reference audio for voice cloning and various generation parameters. It outputs audio files in WAV format. The synthesis process can be controlled through temperature, sampling parameters, and optional seed values for reproducibility. Reference audio clips must exceed 5 seconds for effective voice cloning, or you can select from 20 pre-made voices.
Inputs
- Text: The content to synthesize (maximum 500 characters), supporting paralinguistic tags like [cough], [laugh], [chuckle], [clear throat], [sigh], [groan], [sniff], [gasp], and [sush]
- Voice: Pre-made voice selection from options including Andy, Abigail, Aaron, Brian, Chloe, Dylan, and others
- Reference Audio: Optional audio file for voice cloning (requires minimum 5-second duration)
- Temperature: Controls randomness in generation, ranging from 0.05 to 2.0 (default 0.8)
- Top P: Nucleus sampling threshold between 0.5 and 1.0 (default 0.95)
- Top K: Vocabulary limitation parameter between 1 and 2000 (default 1000)
- Repetition Penalty: Reduces token repetition with values from 1 to 2 (default 1.2)
- Seed: Optional integer for reproducible results
Outputs
- Audio File: Generated speech synthesis in WAV format with embedded Perth watermarking for responsible AI tracking
Capabilities
The model generates natural-sounding s…
