This is a simplified guide to an AI model called Speech-02-Turbo maintained by Minimax. If you like these kinds of analysis, you should join AImodels.fyi or follow us on Twitter.
The speech-02-turbo model from minimax transforms text into expressive speech with customizable voices, emotions, and multilingual support. This text-to-audio system stands out for its real-time performance and low latency, making it suitable for interactive applications. Unlike its sibling model speech-02-hd, which focuses on high-fidelity output, this turbo variant prioritizes speed.
Model inputs and outputs
The model takes text input and generates audio output, with extensive configuration options for voice customization. The system supports pauses between words through special markup and offers fine-grained control over speech parameters.
Inputs
- Text: Up to 5000 characters with optional pause control using <#x#> markup
- Voice Selection: 17 distinct voice options including Wise_Woman, Friendly_Person, and others
- Speech Parameters: Speed (0.5-2x), volume (0-10), pitch (-12 to +12)
- Emotion: Seven options including neutral, happy, sad, angry, fearful, disgusted, surprised
- Audio Settings: Configurable bitrate, sample rate, and mono/stereo output
- Language Support: Enhanced recognition for 25 languages and dialects
Outputs
- Audio File: URL to the generated speech audio file
Capabilities
The system excels at producing natural-…
