research-papers

PromptTTS 2: Describing and Generating Voices with Text Prompt

Athina AI

12 Oct 2023 — 3 min read

Original Paper: https://arxiv.org/abs/2309.02285

By: Yichong Leng, Zhifang Guo, Kai Shen, Xu Tan, Zeqian Ju, Yanqing Liu, Yufei Liu, Dongchao Yang, Leying Zhang, Kaitao Song, Lei He, Xiang-Yang Li, Sheng Zhao, Tao Qin, Jiang Bian

Abstract:

Speech conveys more information than text, as the same word can be uttered in various voices to convey diverse information.

Compared to traditional text-to-speech (TTS) methods relying on speech prompts (reference speech) for voice variability, using text prompts (descriptions) is more user-friendly since speech prompts can be hard to find or may not exist at all.

TTS approaches based on the text prompt face two main challenges:

1) the one-to-many problem, where not all details about voice variability can be described in the text prompt

2) the limited availability of text prompt datasets, where vendors and large cost of data labeling are required to write text prompts for speech.

In this work, we introduce PromptTTS 2 to address these challenges with a variation network to provide variability information of voice not captured by text prompts, and a prompt generation pipeline to utilize the large language models (LLM) to compose high quality text prompts.

Specifically, the variation network predicts the representation extracted from the reference speech (which contains full information about voice variability) based on the text prompt representation.

For the prompt generation pipeline, it generates text prompts for speech with a speech language understanding model to recognize voice attributes (e.g., gender, speed) from speech and a large language model to formulate text prompts based on the recognition results.

Experiments on a large-scale (44K hours) speech dataset demonstrate that compared to the previous works, PromptTTS 2 generates voices more consistent with text prompts and supports the sampling of diverse voice variability, thereby offering users more choices on voice generation.

Additionally, the prompt generation pipeline produces high-quality text prompts, eliminating the large labeling cost. The demo page of PromptTTS 2 is available online.

Summary Notes

Artificial Intelligence (AI) and machine learning have significantly improved text-to-speech (TTS) systems, making them sound more natural.

Yet, achieving a variety of voices easily remains a challenge. Microsoft Research introduces an innovative solution, PromptTTS 2, which uses text prompts to describe and generate different voice characteristics effortlessly.

Key Challenges with Voice Variability in TTS

Creating varied voices in TTS systems faces two main obstacles:

One-to-Many Issue: A single text prompt might need to represent various voices, complicating the capture of every voice nuance.
Data Gathering Difficulty: Compiling detailed text prompts that describe voice traits is expensive and time-consuming.

Introducing PromptTTS 2

PromptTTS 2 brings new strategies to overcome these challenges:

Variation Network: It fills in the missing information about voice variability from text prompts, using a reference speech and a diffusion model during training. This tackles the one-to-many issue and allows for more flexible voice generation.
Prompt Generation Pipeline: This automated system generates quality text prompts by understanding speech attributes and using a large language model (LLM) to write prompts. It reduces the need for manual prompt creation and enables the development of diverse voice datasets more easily.

How PromptTTS 2 Works

PromptTTS 2's system architecture includes:

TTS Module: Responsible for synthesizing speech.
Style Module: Extracts prompt and reference representations, allowing voice synthesis with just text prompts, thanks to the variation network.

The Role of LLM in Generating Prompts

Using a large language model for prompt generation is crucial. It detects speech attributes like gender and accent and creates descriptive sentences.

This ensures a wide range of detailed text prompts, leading to more varied voice synthesis.

Achievements and Future Possibilities

PromptTTS 2 marks a significant advancement by:

Introducing a Diffusion-Based Variation Network to manage voice variability.
Creating an Automated Text Prompt Generation system, reducing reliance on manual prompts.
Showing excellent results in Voice Consistency and Variability with text prompts.

Looking ahead, there's potential to extract more attributes and apply this technique to other voice generation areas, opening up exciting opportunities for AI developers.

Conclusion

PromptTTS 2 represents a major step forward in solving voice variability challenges in TTS systems. By combining variation networks with LLM for prompt generation, it provides a flexible and quality solution for voice generation.

This innovation reduces the need for extensive manual work and paves the way for creating more natural and diverse speech outputs in TTS systems.

PromptTTS 2: Describing and Generating Voices with Text Prompt

Athina AI

Summary Notes

Key Challenges with Voice Variability in TTS

Introducing PromptTTS 2

How PromptTTS 2 Works

The Role of LLM in Generating Prompts

Achievements and Future Possibilities

Conclusion

Further Reading

Read more

How a Founder ran 100+ Voice Interviews in 48 Hours — without a Single Zoom Call, Powered by Dialog

Top 10 AI Agent Papers of the Week: 10th April - 18th April

Top 10 AI Agent Papers of the Week: 1st April - 8th April

Top 10 AI Agents Papers from March 2025