Seed-TTS by Byte: A Series of High-Quality Multi-functional Speech Generation Models

This month, ByteDance released SeedTTS. Currently, there is only a paper; the code has not been made public yet.

Official Demo

  1. Voice Factor Decomposition - Zero-shot Voice Conversion

Source Audio

Timbre Prompt

Converted Audio

  1. Preference Adjustment via Reinforcement Learning - Emotional Control in Zero-shot Contextual Learning

Prompt

Angry

Happy

  1. Fully diffusion-based speech generation -Zero-shot TTS

Prompt

Same Language Generation

I don't really care what you call me. I've been a silent spectator, watching species evolve, empires rise and fall. But always remember, I am mighty and enduring. Respect me and I'll nurture you; ignore me and you shall face the consequences.

Cross-linugal Generation

Suddenly, the atmosphere became heavy. At first glance, it seemed like all the troubles were surrounding me. I frowned, feeling the pressure, but I knew I couldn't give up, couldn't admit defeat. So, I took a deep breath, and a voice in my heart told me: "No matter what, you have to calm down and start again."

Application scenarios:

  • Audiobooks

"This pill... it can't be something like a sedative or an aphrodisiac, right? Why does the smell seem so similar to what the two sisters mentioned? Hmm, could it be that you... have ill intentions towards me?" Han Li was stunned for quite a while after hearing this. He suddenly felt like spitting out three bowls of blood. This girl's thoughts were too unpredictable. How could she associate Yingxiang Pill with an aphrodisiac? Oh well, Han Li wasn't sure whether he should admire her caution or cry out in protest of being falsely accused. "It seems like you're telling the truth. But I still need to take it to my second sister for inspection before using it. After all, as women, we must be careful." "Cough, cough, uh, do as you please." Han Li was speechless and could only cough a few times to cover up his embarrassment. He now thought that it would be better to keep some distance from this little sprite; otherwise, he might get depressed by her at any time. "Hmph, but if this medicine is really as effective as you say, then you've passed the test! If Master has any difficulties in the Mo residence in the future, you can definitely come to Caihuan for help. As long as I receive a small fee, I will surely solve your problems completely." "Alright, little sister, if Master needs help, I will certainly seek your assistance." Han Li then returned to his normal state, responding with a forced smile, but inside, he was thinking fiercely: "I'd rather not bother with you, you little money-grubber."

  • Cross-lingual content creation

Source Video

Generated Video

Summary

it does not rely on pre-estimated phoneme durations and performs speech generation through end-to-end processing. The research team demonstrated that this variant achieves comparable performance with the language-model-based variant in both objective and subjective evaluations, and showcased its effectiveness in speech editing.

Method

Overview of Seed-TTS Inference Process


  1. : Learn tokenization from reference audio.
  2. :Generate phonemes based on conditional text and voice.
  3. :Generate continuous speech representations in a coarse-to-fine manner, given the generated speech phonemes.
  4. :Generate high-quality speech from diffusion outputs.