Selecting recorded prompts or synthesized speech

Synthesized speech (text-to-speech or TTS) is useful as a placeholder during application development, or when the data to be spoken is “unbounded” (not known in advance), which makes it impossible to prerecord.

When deploying your applications, however, you should plan to use professionally recorded prompts whenever possible. Users expect commercial systems to use high-quality recorded speech, and only recorded speech can guarantee highly natural pronunciation and prosody.

Creating recorded prompts

The following guidelines will help you generate high-quality recorded prompts:

When using recorded prompts, you can improve system performance by prefetching and caching the audio files. See Fetching and caching resources for improved performance.

For DTMF prompts (for example, “For checking, press 1. For savings, press 2. To transfer to an agent, press 3.”) use the following timing guidelines:
  • Use a 500 ms pause between items.
  • Use a 250 ms pause before “press”.
  • No detectable pause after “for”, “to” or “press”.
For speech prompts (for example, “Select checking, savings or transfer to agent.” or “To work with your checking account say checking.”) use the following timing guidelines:
  • Use a 750 ms pause between items when there are more than 3 items. When there are only two or three items, do not introduce any exaggerated pauses. Speak the phrase as a normal sentence.
  • Use a 250 ms pause before “say” or “select”.
  • No detectable pause after “say” or “select”.
For a mixture of both DTMF and speech prompts use the following timing guidelines:
  • Use a 300 to 500 ms pause after an informational message that precedes the presentation of a menu.
  • For longer messages, use 250 ms for a comma type pause and 500 ms for a period type pause.

Using TTS prompts

Although recorded prompts are best for many applications, it is important to keep in mind that it is easier to maintain and modify an application that uses TTS prompts. For this reason, you should typically use TTS prompts during development.

When you are ready to deploy your application, use recorded prompts when possible. If part of a sentence requires production via TTS, it is generally better to generate that entire sentence with TTS to avoid the jarring juxtaposition of recorded and artificial speech. It is also possible to design sentences to position the dynamic content at the end, and to play the dynamic content following a short pause to separate the dynamic TTS content from the static recorded content. For now, designers should be cautious in using this approach because it isn't clear whether people would generally prefer hearing all TTS or this type of combination of recorded and TTS output.

Handling unbounded data:

If the information that the application needs to speak is unbounded, you will need to use TTS. Examples of unbounded information include:

  • Telephone directories
  • E-mail messages
  • Frequently updated lists of employee or customer names, movie titles, or other proper nouns
  • Up-to-the-minute news stories
Improving TTS output:

You can improve the quality of synthesized speech output by using SSML to provide additional information in the input text. For example:

  • You can improve the TTS engine's processing of numerical constructs by using the <say-as> element to specify the desired pronunciation.
  • You can improve the TTS engine's processing of uncommon names by using the <phoneme> tag.
  • For synthesized speech, a speed of 150-180 words per minute is generally appropriate for native speakers. You can use the <prosody> element to slow down the speed of TTS output on a prompt-by-prompt basis for long numbers, or if many of the application's users are not native speakers of the language in which the application speaks.
  • You can further improve the prosody of the TTS output by using the <break> and <emphasis> elements.
  • You can change the gender and age characteristics of the TTS voice by using the <voice> element. As with recorded prompts, however, it is generally a good idea to use a single voice throughout your application unless there is a clear design goal that requires multiple voices.