Synthesized speech (text-to-speech or TTS) is useful as a placeholder during application development, or when
the data to be spoken is “unbounded” (not known in advance), which
makes it impossible to prerecord.
When deploying your applications, however, you should plan to use professionally
recorded prompts whenever possible. Users expect commercial systems to use
high-quality recorded speech, and only recorded speech can guarantee highly
natural pronunciation and prosody.
Creating recorded prompts
The following guidelines will help you generate high-quality recorded prompts:
- Use professional voice talent, quality recording equipment, and a suitable
recording environment.
- Maintain consistency in microphone placement and recording area.
- If prompts contain long numbers, or if many of the application's
users are not native speakers of the language in which the application speaks,
consider slowing down the speech or exaggerating natural pauses.
- If you are planning on disabling barge-in, aggressively trim recorded
prompts to remove the beginning and ending silences. If you are planning on
enabling barge-in or if another speech segment will follow immediately, trim
the beginning aggressively, but leave silence at the end that is appropriate
for the ending punctuation (500 ms for final punctuation, 250 ms for non-final
punctuation). Otherwise, leave as little silence as possible.
- As a general rule, use only one voice. When using multiple voices, have
a clear design goal (for example, a female voice for introduction and prompts,
and a male voice for menu choices). For a consistent sound, you should record
your own messages to handle the <error>, <cancel> and <help> events.
- If a voice segment will appear in phrases with different intonations,
be sure to record that segment for each intonation. For example, suppose
the system will seek confirmation of a telephone number using the phrase “Was
that four three three <pause> five five six three?” The “three”
that appears before the pause should have a slightly falling pitch, but the “three”
that appears before the question mark should have a rising pitch. The “three”
that appears between two other numbers should have a steady pitch. This suggests
that it will be necessary to obtain one recording for each of the three intonations,
to obtain the highest-quality speech output. Note that the development effort
required for this might not be appropriate for every application.
- Be aware of the appropriate stress to use in each segment that you plan
to record. If the appropriate stress point is not the last open-class item
(which is either a noun or a verb) in the sentence, make a note of where the
speaker should place the stress.
- If you are recording segments that the application will play sequentially
(in other words, will splice), be sure to choose the splice points carefully.
If possible, choose splice points at natural pause points. Avoid splice points
that separate articles such as "a," "an," and "the" from the following word
(or any other combination that speakers normally run together).
- If you intend to translate the application into other languages, plan
ahead when defining the audio segments to record. You might need to seek assistance
from a native speaker of the target language. In general, try to avoid defining
audio segments to record that are isolated nouns because in many languages
the correct form for determiners (in English, “a”, “an” and “the”)
depends on the following noun. You should be aware that there might be other
contextual dependencies that are important in the target language. Some of
the known issues are gender sensitivity, ordering of recorded segments and
plurality. Good planning early in the definition of audio segments can prevent
unnecessary rework during translation.
When using recorded prompts, you can improve system performance by prefetching
and caching the audio files. See Fetching and caching resources for improved performance.
For
DTMF prompts (for example, “For checking,
press 1. For savings, press 2. To transfer to an agent, press 3.”) use
the following timing guidelines:
- Use a 500 ms pause between items.
- Use a 250 ms pause before “press”.
- No detectable pause after “for”, “to” or “press”.
For
speech prompts (for example, “Select checking,
savings or transfer to agent.” or “To work with your checking account
say checking.”) use the following timing guidelines:
- Use a 750 ms pause between items when there are more than 3 items. When
there are only two or three items, do not introduce any exaggerated pauses.
Speak the phrase as a normal sentence.
- Use a 250 ms pause before “say” or “select”.
- No detectable pause after “say” or “select”.
For a mixture of both
DTMF and speech prompts use
the following timing guidelines:
- Use a 300 to 500 ms pause after an informational message that precedes
the presentation of a menu.
- For longer messages, use 250 ms for a comma type pause and 500 ms for
a period type pause.
Using TTS prompts
Although recorded prompts are best for many applications, it is important
to keep in mind that it is easier to maintain and modify an application that
uses TTS prompts. For this reason, you should typically use TTS prompts during
development.
When you are ready to deploy your application, use recorded prompts when
possible. If part of a sentence requires production via TTS, it is generally
better to generate that entire sentence with TTS to avoid the jarring juxtaposition
of recorded and artificial speech. It is also possible to design sentences
to position the dynamic content at the end, and to play the dynamic content
following a short pause to separate the dynamic TTS content from the static
recorded content. For now, designers should be cautious in using this approach
because it isn't clear whether people would generally prefer hearing
all TTS or this type of combination of recorded and TTS output.
Handling unbounded data:
If the information that the application needs to speak is unbounded,
you will need to use TTS. Examples of unbounded information include:
- Telephone directories
- E-mail messages
- Frequently updated lists of employee or customer names, movie titles,
or other proper nouns
- Up-to-the-minute news stories
Improving TTS output:
You
can improve the quality of synthesized speech output by using SSML to provide
additional information in the input text. For example:
- You can improve the TTS engine's processing of numerical constructs
by using the <say-as> element to specify the desired
pronunciation.
- You can improve the TTS engine's processing of uncommon names by
using the <phoneme> tag.
- For synthesized speech, a speed of 150-180 words per minute is generally
appropriate for native speakers. You can use the <prosody> element to slow down the speed of TTS output on a prompt-by-prompt basis
for long numbers, or if many of the application's users are not native
speakers of the language in which the application speaks.
- You can further improve the prosody of the TTS output by using the <break> and <emphasis> elements.
- You can change the gender and age characteristics of the TTS voice by
using the <voice> element. As with recorded prompts,
however, it is generally a good idea to use a single voice throughout your
application unless there is a clear design goal that requires multiple voices.