Speech synthesis is the artificial creation of human speech through a computer or other
device. It is a counterpart of the voice recognition technology and can be implemented
into software or hardware products. Speech synthesis is commonly used to convert text-
based information into audio information or in applications like voice-enabled services
and some mobile apps. A device or software that is used for speech synthesis is known
as a speech synthesizer.
Speech synthesis is used in assistive technology to help individuals who are visually
impaired read text content. It is also used in entertainment productions such as games,
videos, and animations. When combined with speech recognition, speech synthesis
allows for interaction with mobile devices. Speech synthesis has been incorporated in
many computer operating systems as far back as the early 1990s. It is usually generated
by concatenating pieces of recorded speech, which is stored in a database. A type of
speech synthesis system is the text to speech system which converts natural language
text to speech. There is an advanced text to speech systems that mimic human speech
patterns, such as the text to speech mp3 with natural voices.
Other speech synthesis systems convert phonetic transcriptions and other symbolic
linguistic representations into speech. The systems vary based on the size of the speech
units they store. Some specific usage domains allow for the storage of whole words and
sentences and can render a high-quality output. The quality of a speech synthesizer is
measured by how similar the output is to a human voice and its ability to be understood
easily. In a text to speech system, there are two parts: the front-end and the back-end.
The front-end performs two major functions. The first function is to convert raw text
with symbols such as numbers and abbreviations into what is similar to written-out
words. This is known as text normalization or pre-processing. After this, the front-end
will attach phonetic transcriptions to each word, and separates the text into units like
sentences, clauses, and phrases. This process is known as text-to-phoneme or
grapheme-to-phoneme conversion. These are the two functions of the front-end and they make up the symbolic linguistic representation that is supplied by the front-end.
The job of the back-end is to convert the output by the front-end, known as the symbolic
linguistic representation into sound.
The two major technologies used in generating synthetic speech waveforms are formant
synthesis and concatenative synthesis. Each technology is used based on the intended
use of the speech synthesis system. Formant synthesis creates a speech output through
additive synthesis and an acoustic model. The speech generated through formant
synthesis sounds artificial and robotic. The main goal of formant synthesis is not always
to sound natural but to be reliably intelligible. It is commonly used in screen readers.
The concatenative synthesis produces a natural-sounding speech and is based on the
stringing together of segments of recorded speech. There are other technologies used in
generating synthetic speech such as articulatory synthesis, HMM-based synthesis,
Sinewave synthesis, and Deep learning-based synthesis.