🌬️ Sopro TTS - Zero-Shot Voice Cloning

A lightweight (135M parameter) text-to-speech model with zero-shot voice cloning by Samuel Vitorino. Upload a 3-12 second audio clip to clone a voice!

Text to Synthesize

Reference Audio (3 to 12 seconds recommended)

Generated Audio

⚠️ Disclaimers

Sopro can be inconsistent. If the output sounds glitchy, try tweaking the Temperature and Style Strength.
Voice cloning quality is highly dependent on the microphone quality and ambient noise of the reference audio.
Generation length is currently capped at ~32 seconds to prevent hallucinations.