PromptTTS 2: Describing and Generating Voices with Text Prompt
Original Paper: https://arxiv.org/abs/2309.02285
By: Yichong Leng, Zhifang Guo, Kai Shen, Xu Tan, Zeqian Ju, Yanqing Liu, Yufei Liu, Dongchao Yang, Leying Zhang, Kaitao Song, Lei He, Xiang-Yang Li, Sheng Zhao, Tao Qin, Jiang Bian
Abstract:
Speech conveys more information than text, as the same word can be uttered in various voices to convey diverse information.
Compared to traditional text-to-speech (TTS) methods relying on speech prompts (reference speech) for voice variability, using text prompts (descriptions) is more user-friendly since speech prompts can be hard to find or may not exist at all.
TTS approaches based on the text prompt face two main challenges:
1) the one-to-many problem, where not all details about voice variability can be described in the text prompt
2) the limited availability of text prompt datasets, where vendors and large cost of data labeling are required to write text prompts for speech.
In this work, we introduce PromptTTS 2 to address these challenges with a variation network to provide variability information of voice not captured by text prompts, and a prompt generation pipeline to utilize the large language models (LLM) to compose high quality text prompts.
Specifically, the variation network predicts the representation extracted from the reference speech (which contains full information about voice variability) based on the text prompt representation.
For the prompt generation pipeline, it generates text prompts for speech with a speech language understanding model to recognize voice attributes (e.g., gender, speed) from speech and a large language model to formulate text prompts based on the recognition results.
Experiments on a large-scale (44K hours) speech dataset demonstrate that compared to the previous works, PromptTTS 2 generates voices more consistent with text prompts and supports the sampling of diverse voice variability, thereby offering users more choices on voice generation.
Additionally, the prompt generation pipeline produces high-quality text prompts, eliminating the large labeling cost. The demo page of PromptTTS 2 is available online.
Summary Notes
Artificial Intelligence (AI) and machine learning have significantly improved text-to-speech (TTS) systems, making them sound more natural.
Yet, achieving a variety of voices easily remains a challenge. Microsoft Research introduces an innovative solution, PromptTTS 2, which uses text prompts to describe and generate different voice characteristics effortlessly.
Key Challenges with Voice Variability in TTS
Creating varied voices in TTS systems faces two main obstacles:
- One-to-Many Issue: A single text prompt might need to represent various voices, complicating the capture of every voice nuance.
- Data Gathering Difficulty: Compiling detailed text prompts that describe voice traits is expensive and time-consuming.
Introducing PromptTTS 2
PromptTTS 2 brings new strategies to overcome these challenges:
- Variation Network: It fills in the missing information about voice variability from text prompts, using a reference speech and a diffusion model during training. This tackles the one-to-many issue and allows for more flexible voice generation.
- Prompt Generation Pipeline: This automated system generates quality text prompts by understanding speech attributes and using a large language model (LLM) to write prompts. It reduces the need for manual prompt creation and enables the development of diverse voice datasets more easily.
How PromptTTS 2 Works
PromptTTS 2's system architecture includes:
- TTS Module: Responsible for synthesizing speech.
- Style Module: Extracts prompt and reference representations, allowing voice synthesis with just text prompts, thanks to the variation network.
The Role of LLM in Generating Prompts
Using a large language model for prompt generation is crucial. It detects speech attributes like gender and accent and creates descriptive sentences.
This ensures a wide range of detailed text prompts, leading to more varied voice synthesis.
Achievements and Future Possibilities
PromptTTS 2 marks a significant advancement by:
- Introducing a Diffusion-Based Variation Network to manage voice variability.
- Creating an Automated Text Prompt Generation system, reducing reliance on manual prompts.
- Showing excellent results in Voice Consistency and Variability with text prompts.
Looking ahead, there's potential to extract more attributes and apply this technique to other voice generation areas, opening up exciting opportunities for AI developers.
Conclusion
PromptTTS 2 represents a major step forward in solving voice variability challenges in TTS systems. By combining variation networks with LLM for prompt generation, it provides a flexible and quality solution for voice generation.
This innovation reduces the need for extensive manual work and paves the way for creating more natural and diverse speech outputs in TTS systems.
Further Reading
For those interested in exploring more, reviewing literature on TTS advancements, neural networks, and machine learning applications in speech synthesis is advisable.
This will deepen your understanding of the technologies and innovations driving the future of TTS systems.