Constructing appropriate menus and prompts

A characteristic of many voice applications is that they have little or no external documentation. Often, these applications must support both novice and expert users. Part of the challenge of designing a good voice application is providing just enough information at just the right time. In general, you don't want to force users to hear more than they need to hear, and you don't want to require them to say more than they need to say. Adhering to the following guidelines can help you achieve this:

Limiting menu length
Grouping menu items, prompts, and other information
Controlling prompt length
Avoiding DTMF-style prompts
Using the right words
Providing instructional information

Limiting menu length

When designing menus, you follow the guidelines in Table 1.

Table 1. Recommended maximum number of menu items
Application	Maximum number of menu items
Barge-in enabled	12
Barge-in disabled	5
Any application in which the menu items are long phrases	3

For cases in which you cannot stay within these limits, see Managing audio lists.

A deep menu structure is one in which there are few choices available at any given level in the structure but there are many levels. A flat menu structure is one in which there are many choices available at any given level in the structure but there are few levels. In the flattest possible structure, there is only one level which contains all the choices. A terminal node in a menu structure is a choice that does not lead to any additional sets of choices.

The conditions that favor deep menu structures are:

Barge-in disabled.
Terminal nodes have approximately equal frequency of selection.
Menu items require many words or long duration.
Menu items fall into easy to label, clearly defined categories.
There are dependencies among the menu options.

The conditions that favor the use of flat menu structures are:

Barge-in enabled.
Need to surface frequently-used terminal nodes.
Menu items have few words per item or short duration.
Difficult to develop clear categories for menu items.

For most applications, the conditions will favor the use of flat rather than deep menu structures.

Older guidelines for DTMF user interfaces (for example, Marics & Engelbeck), strongly advised against exceeding four options per menu. This guideline is inappropriate for SUIs because options in speech menus typically have far fewer words than DTMF options. Recent human factors research has also challenged the applicability of this guideline for DTMF menus.

Some practitioners have suggested 7 ±2 menu items for speech menus. This suggestion assumes that users are trying to memorize each option as they hear them, but task analysis of selection from auditory menus does not support this assumption. A user does not need to memorize all of the items in a speech menu; users only need to remember the one that is the current best match to the desired function. If the user hears an excellent match, then he or she can barge in to select it (self terminating search). Otherwise, the user continues to listen to options until hearing a better match (discarding the old and remembering the new) or there are no more options (exhaustive search).

Grouping menu items, prompts, and other information

There are a number of strategies that you should consider when deciding how to group information.

Separating introductory/instructive text from prompt text:

When appropriate, you should consider separating any introductory or instructive text from the prompt text; this allows you to reprompt without repeating the introductory text. For example, you might create one audio file that says, “Welcome to the WebVoice demo” and a separate audio file that says, “Say one of the following options: Library, Banking, Calendar.” The first time through the sequence, the application could play the files in succession. If the user returns to this main menu later in the application session, the application could play only the second audio file. For example, the first time the user hears:

System:

Welcome to the WebVoice demo. Select Library, Banking, or Calendar.

On the second and any subsequent times, the user hears only:

System:

Select Library, Banking, or Calendar.

Separating text for each menu item:

If appropriate for your application design, you might even create separate audio files for each menu choice.

Ordering menu items:

When presenting menu items, consider putting the most common choices first so that most users don't have to listen to the remaining options. For example:

System:

Select List Specials, Place an Order, Check Order Status
or Get Mailing Address.

A possible exception to this guideline is when the most common choice is a more general case of another choice. In this example, “Other loans” is presented last, regardless of its relative frequency of use:

System:

Loan type?
 <3 second pause>
Select Car, Personal or Other Loan.

Controlling prompt length

Applications are generally most usable when system prompts are as short as possible (minimizing users' need to interrupt prompts) and user responses are relatively short (minimizing the likelihood of Lombard speech and the stuttering effect. See Controlling Lombard speech and the stuttering effect.).

Note: Especially for applications using hotword (recognition) barge-in detection, try to keep required user responses to no more than two or three syllables. If this is not possible, you may want to consider using speech-based barge-in.

Effectively worded shorter prompts are generally better than longer prompts, due to the time-bound nature of speech interfaces and the limitations of users' short-term memory. A reasonable goal might be to strive for initial prompt lengths of no more than 3 seconds, and to try to keep the greeting and opening menu to less than 20 seconds. If prompt lengths must consistently exceed 3 seconds, the application should permit barge-in. For planning purposes, assume that each syllable in a prompt or message lasts 150-200 ms.

Do not overuse the words “please”, “thanks” and “sorry”. You can use them, but don't use them automatically or thoughtlessly; only when they serve a clearly defined purpose.

If you can remove a word without changing the meaning, then consider removing it (while keeping in mind that a certain amount of structural variation in a group of prompts increases the naturalness of the dialog). Strive to use clear and unambiguous terms.

In general, if you have a choice between long and short words that mean the same thing, choose the short word. In most cases, the short word will be more common than the long word and users will hear and process it more quickly. For example, "use" is a better choice than "utilize."

Consolidate common words and phrases. For example, you could combine “Are you calling about buying a fax machine for your home?” and “Are you calling about buying a fax machine for your business?” into “Are you calling about buying a fax machine for your home or business?”.

In general, use the active voice rather than the passive voice. People process the active voice faster and more accurately. This is partly because phrases using the active voice (for example, “Say the book's title”) tend to be shorter than those using the passive voice (for example, “The book's title must be spoken now”). In some cases a sentence will sound best in passive voice, but you should use passive voice only if attempts to rewrite the sentence in active voice don't sound natural.

Good prompts do not necessarily have good grammar. As in normal conversation many natural phrases do not abide by the rules of grammar.

Avoiding DTMF-style prompts

Prompts in an application with a DTMF interface typically take the form “For option, do action.” With a SUI, the option is the action. In general, you should avoid prompts that mimic DTMF-style prompts; these types of prompts are longer than they need to be for most types of menu selections. For example, use:

System:

Select Marketing, Finance, Human Resources, Accounting, or Research

rather than:

System:

For the Marketing department, say 1
For Finance, say 2
For Human Resources, say 3
For Accounting, say 4
For Research, say 5

or worse:

System:

For the Marketing department, say Marketing
For Finance, say Finance
For Human Resources, say Human Resources
For Accounting, say Accounting
For Research, say Research

If the menu items are difficult for a user to remember (for example, if they are long or contain unusual terms or acronyms), you might choose to mimic DTMF prompts. Also, this style can work well for first level help messages as it slows things down, giving users more time to process their options.

Choosing a complex alternative

It isn't always possible to have a simple label for a choice. Consider the following prompt:

System:

Would you like to hear your account balances at the beginning of every
call, or just at the beginning of the first call of the day?

Try to imagine how many different ways a user might respond to this question. One way to deal with this situation is to change the prompt to a yes or no question that has the form, "You can choose A or B. Would you like A'?" where A and B are complex choices and A' is a short version of A. For example:

System:

You can hear your account balances at the beginning of every call, or
just at the beginning of the first call of the day. Would you like to hear
them in every call?

Using the right words

There are many aspects to consider when deciding how to word your application's prompts and menus. The choices you make will have a significant impact on the types of responses your users provide, and therefore on what you will need to code in your grammars. Some of the issues you may need to address include the following:

Adopting user vocabulary
Being concise
Mixing menu choices and form data in a single prompt
Avoiding synonyms in prompts
Promoting valid user input
Tips for voice spelling

Adopting user vocabulary:

Applications that require users to learn new commands are inherently more difficult to use. During the Design phase of your application development process, you will want to make note of the words and phrases that your users typically use to describe common tasks and items. See Design Phase. These are the words and phrases that you will want to incorporate into your prompts and grammars.

Being concise:

Regardless of whether your general style is terse or verbose, try to phrase prompts in a way that conveys the maximum amount of information in the minimum amount of time. For example, use:

System:

Say the author's last name, followed optionally by the author's first name.

rather than:

System:

Say the author's last name. If you also know the author's first name, state the author's last name and first name.

Avoid lengthy lead-in phrases to a set of options. Begin your prompts with words like “Select”, “Choose” or , in some cases, “Say”. For example, use:

System:

Select Checking, Savings or Money Market.

rather than:

System:

Please make one of the following choices: Checking, Savings or Money Market.

Sometimes, the most effective way to prompt the user is to word the prompt as a question. For example, use:

System:

Savings or checking?

rather than:

System:

Please choose from the inquiry menu:
Savings
Checking

Question prompts are especially useful when the user can make a choice by repeating one of the options verbatim. You can also use question prompts to collect information, as long as the question restricts likely user input to something the application can understand. For example:

System:

Transfer how much?

Using pronouns:

Use pronouns such as "it" and "one" to avoid stilted repetition of words. For example:

System:

You have four new messages. The first one is from... The second one is from... The third one is from... The last one is from...

rather than:

System:

You have four new messages. The first message is from... The second message is from... The third message is from... The last message is from...

Mixing menu choices and form data in a single prompt:

When mixing menu choices and form filling in the same list, you will want to word the prompt to clearly indicate what the user can say. For example, use a prompt such as:

System:

Please state the author's last name, or say List Best Sellers.

Avoiding synonyms in prompts:

You should avoid using synonyms in prompts; these might mislead the user regarding valid input. For example, use:

System:

To search the database, say Search by Author.

rather than:

System:

To query the database, say Search by Author.

because the latter might cause the user to think that “query” is a valid response.

Promoting valid user input:

Whenever possible, the prompt text should guide the user to the proper word choice. For example:

System:

If this is correct, say Yes.

Try to phrase prompts in a way that minimizes the likelihood of the user inserting extraneous words in the response. For example, if the system cannot interpret dates embedded in sentences, use:

System:

Please state the year you were born.

or:

System:

Birth year?

rather than:

System:

When were you born?

because the latter is likely to elicit a response such as:

User:

I was born on November 3rd, 1954.

Tips for voice spelling:

In general, if you can avoid voice spelling, you should. The letters of the alphabet are notoriously difficult for computers (and humans) to recognize.

If a user must perform voice spelling, then you can take advantage of the fact that some recognition errors are more likely than others. Table 2 shows the results of an experiment conducted to investigate patterns of recognition errors for the letters of the English alphabet. In the table, uppercase letters indicate substitution probabilities that exceeded 10%. Lowercase letters indicate substitutions that occurred during the study, but had substitution probabilities less than 10%.

For example, if a user rejects a returned K in a voice-spelling application, then the letter most likely to have actually been spoken is A. Fourteen of the letters in the table have only one substitution for which the probability of substitution exceeded 10%. Eleven of the letters don't have any substitutes for which the probability of substitution exceeded 10%, and six of those didn't have any substitutions at all. This means that whenever the system returns these letters (H, I, U, W, X, and Y), you can have very high confidence that the speaker actually said that letter. Only two letters (T and G) had two substitutes for which the probability of substitution exceeded 10%. Note that the substitution probabilities are not usually symmetrical (F/S and V/Z are exceptions). If the system returned a D but the user indicated that it was not correct, the letter the user said was most likely an E. If the system returned an E but the user indicated that it was not correct, the letter the user said was most likely a V.

Table 2. Recognition errors when spelling
Letter returned	Most likely substituted for
A	I
B	d
C	V t z
D	E b v
E	V
F	S
G	P T v
H
I
J	a
K	A
L	m
M	N
N	x
O	L r u
P	d e
Q	U p t
R	I f y
S	F h j
T	D E g p v
U
V	Z
W
X
Y
Z	V

There are several ways that you can use this information to improve the user experience when voice spelling. If the information that the user is spelling is unbounded (such as a user's last name), then you can offer alternatives in the order of their likelihood. For example:

`System:`	`Please spell your last name.`
`User:`	`L e w i s`
`System:`	`L e w i f?`
`User:`	`No.`
`System:`	`First letter was L as in Lucy?`
`User:`	`Yes.`
`System:`	`Then E as in elephant?`
`User:`	`Yes.`
`System:`	`Then W as in walrus?`
`User:`	`Yes.`
`System:`	`Then I as in Iris?`
`User:`	`Yes.`
`System:`	`Then F as in Frank?`
`User:`	`No.`
`System:`	`Was it S as in Sally?`
`User:`	`Yes.`
`System:`	`L e w i s?`
`User:`	`Yes.`

You can also use n-best decoding to find likely substitutes. See Refining confirmation and error correction with confidence levels and n-best lists for details.

If the information that the user is spelling is bounded (such as a part number, or a list of users' last names in a database), then you can often detect and correct voice spelling recognition errors without involving the user. For example, suppose the user is ordering a part by its code:

`System:`	`Part code?`
`User:`	`S X H U F`
`System:`	<Returns F X H U F> <Checks part number database> <Doesn't find this number> <Checks the following possibilities: <F X H U S S X H U S S X H U F> <Only third one is in database>
`System:`	`S as in Sam, X, H, U, F as in Frank?`
`User:`	`Yes.`

In the example above, the system created the different possible part numbers by using the information from the table of substitutions. Because there were no likely substitutes for X, H, or U, the system left them alone, and systematically changed F to S and vice versa.

Providing instructional information

In certain situations, you may need to provide instructional information to the users.

“Feeding-forward” information as confirmation:

Where applicable, you may want to word prompts in a way that “feeds the result forward” (that is, incorporates the user response) into the next prompt. For example:

`System:`	Say Phone or Fax
`User:`	`Phone`
`System:`	Which phone action? Leave message Camp Forward call

This technique provides feedback that the system correctly understood the response and also reinforces the user's mental model of the dialog state. Using this technique eliminates the need for cumbersome confirmation of every user input; however, you should still confirm user requests for actions that cannot be undone.

Recovering from errors:

This section deals with error recovery only as it relates to prompt wording. See Error recovery and confirming user input for additional information about error recovery.

To promote faster error recovery, prompts should focus on keeping the dialog moving rather than on any mistakes. For example, if you are using self-revealing help, you should avoid having the system say:

System:

Sorry, I don't understand what you said.

because there is no evidence that this helps the user understand what to do next. See Implementing self-revealing help for guidance on how to keep the dialog moving forward with self-revealing help.

Similarly, it is best to avoid claiming that the user “said” a particular response, since the information you present is actually just a reflection of how the speech recognition performed. For example, use:

System:

Was that 113?

instead of:

System:

You said 113. This is not a valid quantity.
Please restate your quantity.