Text to speech

A synthetic voice announcing an arriving train in Sweden.

Problems listening to this file? See media help.

Text to speech (TTS) is the use of software to create a sound output in the form of a spoken voice. The program that is used by programs to change text on the page to an audio output of the spoken voice is normally a text to speech engine. Blind people, people who do not see well, and people with reading disabililties can rely on good text-to-speech systems. That way they can listen to pieces of the text. TTS engines are needed for an audio output of machine translation results.

Up until about 2010, there was the analytic approach: This approach uses multiply steps to convert the text to speech. Usually, an input text is transformed into phonetic writing. This says how the words are pronounced, and not how they are written. In the phonetic writing, phonemes can be identified. The system can then produce speech by putting together prerecorded or synthesized diphones. A problem is to make the language flow sound natural, what linguists call prosody.

As of 2022, deep learning is used. To get a good result, neural networks are trained with many good samples.

Historically, the first systems for speech synthesis used formants. Industrial Systems today, mostly use signal processing.

History

Reconstruction of Kempelen's speaking machine

Copy of the machine proposed by Helholz

Demonstration of the Voder, by Bell Labs

Long before there was electronic signal processing, people tried to build machines that were capable of producing human speech. Gerbert of Aurillac (the future Pope Sylvester II) is said to have constructed a talking head which was able to say yes, and no, in the year 1003. Albertus Mahnus (1198–1280) and Roger Bacon (1219–1294) are also said to have constructed machines, but no records survive. Christian Gottlieb Kratzenstein (1723-1795) built a machine in 1779, that was able to produce long vocals (a,e,i,o,u). He used modified organ pipes for his machine. The machine was called speech organ. Wolfgang von Kempelen started to develop a speech machine. In 1791, he published a paper "Mechanismus der menschlichen Sprache nebst der Beschreibung seiner sprechenden Maschine" (Mechanism of human language, besides description of its speaking-machine). Like Kempele's machine it use a set of bellows to emulate the lungs. Unlike Katzensein, he only used a mechanism with a single pipe, which is more like the human vocal tract. Other than vocals, it could produce plosives, and some fricatives. Attached to these "vocal chords" was a formable leather tube. Von Kempelen wrote:

"in einer Zeit von drei Wochen eine bewundernswerte Fertigkeit im Spielen erlangen, besonders wenn man sich auf die lateinische, französische oder italienische Sprache verlegt, denn die deutsche ist [aufgrund der häufigen Konsonantenbündel] um vieles schwerer." (gain an admirable skill in playing in a time of three weeks, especially if you go for Latin, French, or Italian, because German is much more difficult [due to the frequent consonant bundles].)

In 1837, Charles Wheatsone built a speaking machine following this principle, a copy can be found in the Deutsche Museum.In 1857, Jospeh Faber built Euphonia, which also uses this principle.

At the end of the 19th century, the focus changed: People no longer wanted to build copies of the vocal system, they wanted to simulate vocal space. Hermann von Helmholtz(1821-1895) used specially tuned forks to create vowels. The resonant frequencies are called formants. Combining different formants to generate speech was mainstream until the mid 1990s.

In the 1930s, Bell Labs developed vocoder, a machine that used a keyboard to synthesize speech. People said it generated speech that was understandable. Homer Dudley improved the design to the Voder, and presented it at the 1939 New York World's Fair. Voder used electronic oscillators to create formant frequencies.

In the late 1950s, computers were first used to generate speech. The first complete text-to-speech system using computers was completed in 1968.John Larry Kelly Jr. worked at Bell Labs at the time. He used an IBM 704 to make it sing the song Daisy Bell. Stanley Kubrick was so impressed that he integrated it into his movie 2001: A Space Odyssey.

Present

Synthesis example

The Chaos (short version) synthesized by VITS, a research deep-learning-based end-to-end text-to-speech method, using the LJ Speech dataset.

Problems listening to this file? See media help.

Up until about the millennium, the result of electronic speech synthesis was robot-like, and hard to understand, at times. Since then they are better to understand: the focus shifted from generating speech itself to recognizing the text, and putting together pre-recorded high-quality samples.^[1]^[2]^[3]