Speech synthesis (text to speech)

Behind all our services there is a server-based software performing the speech synthesis, called text-to-speech software. The voices we use are provided by different providers but the technique behind the different voices has many similarities. Therefore we like to tell you briefly about the development of speech synthesis and its history.

Over the last few years there has been a great development of the quality of the speech produced with speech synthesis. Many people think that synthetic speech sounds like robots from older movies. The truth is though that some voices almost sound like recorded speech and due to that we have seen a very strong growth of user groups for our services the last years.
When we invented talking web in 2001 the target group was people with reading difficulties but now we see that the user group is much broader.

The history of speech synthesis


Old TTS Audio DeviceWhat you maybe don't know is that the first synthetic speech was produced as early as in the late 18th century. The machine was built in wood and leather and was very complicated to use generating audible speech. It was constructed by Wolfgang von Kempelen
and had great importance in the early studies of Phonetics. The picture to the right is the original construction as it can be seen at the Deutsches Museum (von Meisterwerken der Naturwissenschaft und Technik) in Munich, Germany.

Here's an audio sample of the synthetic speech the machine produced (WAV-file 776 kB) .
(First there is a human that says a sentence and then the machine tries to say the same. This was made by a re-construction of Kempelens machine.)

In the early 20th century when it was possible to use electricity to create synthetic speech, the first known electric speech synthesis was "Voder" and its creator Homer Dudley showed it to a broader audience in 1939 on the world fair in New York.

Here's an audio sample of Voder, the first electronic speech synthesis ever (WAV-file 381 kB)

One of the pioneers of the development of speech synthesis in Sweden was Gunnar Fant. During the 1950s he was responsible for the development of the first Swedish speech synthesis OVE (Orator Verbis Electris.) By that time it was only Walter Lawrences Parametric Artificial Talker (PAT) that could compete with OVE in speech quality.

Here's a sample of OVE speech synthesis (WAV-file 77 kB).

and here's a sample of the PAT speech synthesis (WAV 117 kB).

OVE and PAT were text-to-speech systems using Formant synthesis.

Speech synthesis becomes more human-like

The greatest improvements when it comes to natural speech were during the last 10 years. The first voices we used for ReadSpeaker back in 2001 were produced using Diphone synthesis. The voices are sampled from real recorded speech and split into phonemes, a small unit of human speech. This was the first example of Concatenation synthesis. However, they still have an artificial/synthetic sound. We still use diphone voices for some smaller languages and they are widely used to speech-enable handheld computers and mobile phones due to their limited resource consumption, both memory and CPU.

It wasn't until the introduction of a technique called Unit selection, that voices became very naturally sounding. this is still concatenation synthesis but the used units are larger than phonemes, sometimes a complete sentence. We use different providers for different languages to always assure we can offer the best voices available for that language.

Here's the web addresses to some of the different providers of speech synthesis that we co-operate with to be able to offer you as natural speaking services as possible:


The techniques behind speech synthesis

Articulatory synthesis

In an articulatory synthesis, models of the human articulators (tongue, lips, teeth, jaw) and vocal ligament are used to simulate how an airflow passes through, to calculate what the resulting sound will be like. It is a great challenge to find good mathematical models and therefore the development of articulatory synthesis is still in research. The technique is very computation-intensive but memory requirements is almost nothing.

Formant

The synthesis is a sort of source-filter-method that is based on mathematic models of the human speech organ.
The approach pipe is modelled from a number of resonances with resemblance to the formants (frequency bands with high energy in voices) in natural speech.
The first electronic voices Voder, and later on OVE and PAT, were speaking with totally synthetic and electronic produced sounds using formant synthesis. As with articulatory synthesis, the memory consumption is small but CPU usage is large.

Concatenating synthesis

A concatenating synthesis is made of recorded pieces of speech (sound-clips) that is then unitized and formed to speech. Depending on how long sound-clips that are used it become a diphone or a polyphonic synthesis. The later in a more developed version is also called a Unit Selection synthesis, where the synthesizer has access to both long and short segments of speech and the best segments for the actual context is chosen.

Diphone

For a diphone synthesis the elements from the recorded speech are very small.
The strength in this case is that almost any sentence or expression may be read but quite often there are errors in the pronunciation and if the model used for prosody is not good, or modelling is difficult, the speech may sound a bit monotonic.
A diphone synthesis doesn't work that well in languages where there is a lot of inconsequence in the pronunciation rules (English, Swedish etc) and in special cases where letters is pronounced differently than in general. The diphone works better for languages that have large consistencies in the pronunciation (Spanish, Finnish etc.) Another advantage is that the prosody, the intonation, can be described in very much detail.

Unit selection

The greatest difference between a Unit selection and a diphone voice is the length of the used speech segments. There are entire words and phrases stored in the unit database. this implies that the database for the Unit selection voices are many times bigger than for diphone voices. Thus, the memory consumption is huge while the CPU consumption is low.

The most important issue is to still get a natural and smooth prosody. This is hard because the units contain both intonation and pronunciation since entire phrases are used almost directly from the recorded data. Since the first Unit selection voice was released, over eight years ago, there has been much improvements for each new voice with every release. This is by far the most widely used technique among our providers.

HMM synthesis

A quite new technology is speech synthesis based on HMM, a mathematical concept called Hidden Markov models. It is a statistical method where the text-to-speech system is based on a model that is not known beforehand but it is refined by continuous training. The technique consumes large CPU resources but very little memory. This approach seems to give a better prosody, without glitches, and still producing very natural sounding, human-like speech. We collaborate with providers offering this technique as well.

Customizations and improvements

On top of using the best voices available we also add our own layer of improvement, both general and customer specific customizations. We have linguists with long experience of speech synthesis working with transcriptions to tweak the pronunciation and reading of the spoken text. Therefore we can offer great help to all customers that like to put some effort to get as good quality of the speech as ever possible. Sometimes it is enough to do a quality control of a couple of hours listening to your website and correct the errors we find. Sometimes there is a lot of brands and specific words on the site with great importance that they are pronounced correctly.

One of the largest customizations we have made so far was for a customer who sent us a list of over 3000 words that had to be quality controlled. Another customization was for a site with about 200 000 pages where the same acronym or abbreviation should be expanded differently depending on at what part of the site it was mentioned. Many users wonder why the same voice reads so much better when it is used in our services compared to when the same voice, or text-to-speech system, is used for reading similar, or the same, content with other softwares or services. The answer is the above mentioned customizations.

 

Thanks to Professor Hartmut Traunmüller, Dept. of Linguistics at the University of Stockholm for a lot of the facts, the picture and the sound samples on this page.