Selecting recorded prompts or synthesized speech

Synthesized speech (text-to-speech or TTS) is useful as a placeholder during application development, or when the data to be spoken is “unbounded” (not known in advance), which makes it impossible to prerecord.

When deploying your applications, however, you should plan to use professionally recorded prompts whenever possible. Users expect commercial systems to use high-quality recorded speech, and only recorded speech can guarantee highly natural pronunciation and prosody.

Creating recorded prompts

The following guidelines will help you generate high-quality recorded prompts:

Use professional voice talent, quality recording equipment, and a suitable recording environment.
Maintain consistency in microphone placement and recording area.
If prompts contain long numbers, or if many of the application's users are not native speakers of the language in which the application speaks, consider slowing down the speech or exaggerating natural pauses.
If you are planning on disabling barge-in, aggressively trim recorded prompts to remove the beginning and ending silences. If you are planning on enabling barge-in or if another speech segment will follow immediately, trim the beginning aggressively, but leave silence at the end that is appropriate for the ending punctuation (500 ms for final punctuation, 250 ms for non-final punctuation). Otherwise, leave as little silence as possible.
As a general rule, use only one voice. When using multiple voices, have a clear design goal (for example, a female voice for introduction and prompts, and a male voice for menu choices). For a consistent sound, you should record your own messages to handle the <error>, <cancel> and <help> events.
If a voice segment will appear in phrases with different intonations, be sure to record that segment for each intonation. For example, suppose the system will seek confirmation of a telephone number using the phrase “Was that four three three <pause> five five six three?” The “three” that appears before the pause should have a slightly falling pitch, but the “three” that appears before the question mark should have a rising pitch. The “three” that appears between two other numbers should have a steady pitch. This suggests that it will be necessary to obtain one recording for each of the three intonations, to obtain the highest-quality speech output. Note that the development effort required for this might not be appropriate for every application.
Be aware of the appropriate stress to use in each segment that you plan to record. If the appropriate stress point is not the last open-class item (which is either a noun or a verb) in the sentence, make a note of where the speaker should place the stress.
If you are recording segments that the application will play sequentially (in other words, will splice), be sure to choose the splice points carefully. If possible, choose splice points at natural pause points. Avoid splice points that separate articles such as "a," "an," and "the" from the following word (or any other combination that speakers normally run together).
If you intend to translate the application into other languages, plan ahead when defining the audio segments to record. You might need to seek assistance from a native speaker of the target language. In general, try to avoid defining audio segments to record that are isolated nouns because in many languages the correct form for determiners (in English, “a”, “an” and “the”) depends on the following noun. You should be aware that there might be other contextual dependencies that are important in the target language. Some of the known issues are gender sensitivity, ordering of recorded segments and plurality. Good planning early in the definition of audio segments can prevent unnecessary rework during translation.

When using recorded prompts, you can improve system performance by prefetching and caching the audio files. See Fetching and caching resources for improved performance.

For DTMF prompts (for example, “For checking, press 1. For savings, press 2. To transfer to an agent, press 3.”) use the following timing guidelines:

Use a 500 ms pause between items.
Use a 250 ms pause before “press”.
No detectable pause after “for”, “to” or “press”.

For speech prompts (for example, “Select checking, savings or transfer to agent.” or “To work with your checking account say checking.”) use the following timing guidelines:

Use a 750 ms pause between items when there are more than 3 items. When there are only two or three items, do not introduce any exaggerated pauses. Speak the phrase as a normal sentence.
Use a 250 ms pause before “say” or “select”.
No detectable pause after “say” or “select”.

For a mixture of both DTMF and speech prompts use the following timing guidelines:

Use a 300 to 500 ms pause after an informational message that precedes the presentation of a menu.
For longer messages, use 250 ms for a comma type pause and 500 ms for a period type pause.

Using TTS prompts

Although recorded prompts are best for many applications, it is important to keep in mind that it is easier to maintain and modify an application that uses TTS prompts. For this reason, you should typically use TTS prompts during development.

When you are ready to deploy your application, use recorded prompts when possible. If part of a sentence requires production via TTS, it is generally better to generate that entire sentence with TTS to avoid the jarring juxtaposition of recorded and artificial speech. It is also possible to design sentences to position the dynamic content at the end, and to play the dynamic content following a short pause to separate the dynamic TTS content from the static recorded content. For now, designers should be cautious in using this approach because it isn't clear whether people would generally prefer hearing all TTS or this type of combination of recorded and TTS output.

Handling unbounded data:

If the information that the application needs to speak is unbounded, you will need to use TTS. Examples of unbounded information include:

Telephone directories
E-mail messages
Frequently updated lists of employee or customer names, movie titles, or other proper nouns
Up-to-the-minute news stories

Improving TTS output:

You can improve the quality of synthesized speech output by using SSML to provide additional information in the input text. For example:

You can improve the TTS engine's processing of numerical constructs by using the <say-as> element to specify the desired pronunciation.
You can improve the TTS engine's processing of uncommon names by using the <phoneme> tag.
For synthesized speech, a speed of 150-180 words per minute is generally appropriate for native speakers. You can use the <prosody> element to slow down the speed of TTS output on a prompt-by-prompt basis for long numbers, or if many of the application's users are not native speakers of the language in which the application speaks.
You can further improve the prosody of the TTS output by using the <break> and <emphasis> elements.
You can change the gender and age characteristics of the TTS voice by using the <voice> element. As with recorded prompts, however, it is generally a good idea to use a single voice throughout your application unless there is a clear design goal that requires multiple voices.