With the appearance of subtle pure language processing, text-to-speech (TTS) methods — software program packages designed to verbalize textual content — have turn into more and more environment friendly. Take Google’s Tacotron 2, for example, which may construct voice fashions based mostly on spectrograms alone.
One downside to those “neural TTS” approaches is that they require extra knowledge than conventional strategies, however which may not be the case for lengthy. In a new research penned by scientists at Amazon’s Alexa division, an AI TTS system skilled on voice knowledge from a number of audio system yielded more-natural-sounding speech than a single-speaker mannequin skilled on a larger variety of samples. Furthermore, the staff discovered the previous mannequin to be extra “secure” total: It dropped fewer phrases, “mumbled” much less incessantly, and prevented repeating single sounds in fast succession.
The analysis is scheduled to be introduced on the Worldwide Convention on Acoustics, Speech, and Sign Processing in Brighton subsequent month.
“[R]ecent [research] means that coaching NTTS methods on examples from a number of completely different audio system yields higher outcomes with much less knowledge,” wrote Alexa Speech utilized scientist Jakub Lachowicz in a weblog submit. “[We] current what we imagine is the primary systematic research of the benefits of coaching NTTS methods on knowledge from a number of audio system.”
As Lachowicz explains, neural TTS fashions usually include two parts: one which converts textual content into mel-spectrograms (50-millisecond snapshots of particular frequency bands) and a second community — a vocoder — that converts the mel-spectrograms into finer-grained audio indicators. Lachowicz and colleagues skilled certainly one of these methods on knowledge from seven completely different audio system utilizing a one-hot vector — a string of 0s with a single “1” amongst them — to affiliate particular person samples with audio system.
In experiments that tasked 70 human individuals with listening to dwell recordings of a human speaker and artificial speech modeled on the identical speaker, the neural TTS mannequin skilled on a number of audio system fared simply in addition to the one skilled on a single speaker. Maybe extra considerably, the scientists noticed “no” statistical distinction between the “naturalness” of fashions skilled on samples from audio system of various genders and fashions skilled on samples from audio system of the identical gender because the goal speaker.
Right here’s speech generated by the single-gender mannequin:
And right here’s speech generated by the mixed-gender mannequin:
Lachowicz notes that the multi-speaker mannequin ingested over 5,000 coaching samples in contrast with the single-speaker mannequin’s 15,000, and that past 15,000 utterances, he expects single-speaker NTTS fashions will outperform multi-speaker fashions. He and the research’s coauthors imagine, although, that blended fashions might make it simpler for builders to get artificial voices up.
“This opens the prospect that voice brokers might provide all kinds of customizable speaker kinds, with out requiring voice performers to spend days within the recording sales space,” he mentioned.