A characteristic of many voice applications is that they have little or no external documentation. Often, these applications must support both novice and expert users. Part of the challenge of designing a good voice application is providing just enough information at just the right time. In general, you don't want to force users to hear more than they need to hear, and you don't want to require them to say more than they need to say. Adhering to the following guidelines can help you achieve this:
When designing menus, you follow the guidelines in Table 1.
Application | Maximum number of menu items |
---|---|
Barge-in enabled | 12 |
Barge-in disabled | 5 |
Any application in which the menu items are long phrases | 3 |
For cases in which you cannot stay within these limits, see Managing audio lists.
A deep menu structure is one in which there are few choices available at any given level in the structure but there are many levels. A flat menu structure is one in which there are many choices available at any given level in the structure but there are few levels. In the flattest possible structure, there is only one level which contains all the choices. A terminal node in a menu structure is a choice that does not lead to any additional sets of choices.
For most applications, the conditions will favor the use of flat rather than deep menu structures.
Older guidelines for DTMF user interfaces (for example, Marics & Engelbeck), strongly advised against exceeding four options per menu. This guideline is inappropriate for SUIs because options in speech menus typically have far fewer words than DTMF options. Recent human factors research has also challenged the applicability of this guideline for DTMF menus.
Some practitioners have suggested 7 ±2 menu items for speech menus. This suggestion assumes that users are trying to memorize each option as they hear them, but task analysis of selection from auditory menus does not support this assumption. A user does not need to memorize all of the items in a speech menu; users only need to remember the one that is the current best match to the desired function. If the user hears an excellent match, then he or she can barge in to select it (self terminating search). Otherwise, the user continues to listen to options until hearing a better match (discarding the old and remembering the new) or there are no more options (exhaustive search).
There are a number of strategies that you should consider when deciding how to group information.
When appropriate, you should consider separating any introductory or instructive text from the prompt text; this allows you to reprompt without repeating the introductory text. For example, you might create one audio file that says, “Welcome to the WebVoice demo” and a separate audio file that says, “Say one of the following options: Library, Banking, Calendar.” The first time through the sequence, the application could play the files in succession. If the user returns to this main menu later in the application session, the application could play only the second audio file. For example, the first time the user hears:
|
On the second and any subsequent times, the user hears only:
|
If appropriate for your application design, you might even create separate audio files for each menu choice.
When presenting menu items, consider putting the most common choices first so that most users don't have to listen to the remaining options. For example:
|
A possible exception to this guideline is when the most common choice is a more general case of another choice. In this example, “Other loans” is presented last, regardless of its relative frequency of use:
|
Applications are generally most usable when system prompts are as short as possible (minimizing users' need to interrupt prompts) and user responses are relatively short (minimizing the likelihood of Lombard speech and the stuttering effect. See Controlling Lombard speech and the stuttering effect.).
Effectively worded shorter prompts are generally better than longer prompts, due to the time-bound nature of speech interfaces and the limitations of users' short-term memory. A reasonable goal might be to strive for initial prompt lengths of no more than 3 seconds, and to try to keep the greeting and opening menu to less than 20 seconds. If prompt lengths must consistently exceed 3 seconds, the application should permit barge-in. For planning purposes, assume that each syllable in a prompt or message lasts 150-200 ms.
Do not overuse the words “please”, “thanks” and “sorry”. You can use them, but don't use them automatically or thoughtlessly; only when they serve a clearly defined purpose.
If you can remove a word without changing the meaning, then consider removing it (while keeping in mind that a certain amount of structural variation in a group of prompts increases the naturalness of the dialog). Strive to use clear and unambiguous terms.
In general, if you have a choice between long and short words that mean the same thing, choose the short word. In most cases, the short word will be more common than the long word and users will hear and process it more quickly. For example, "use" is a better choice than "utilize."
Consolidate common words and phrases. For example, you could combine “Are you calling about buying a fax machine for your home?” and “Are you calling about buying a fax machine for your business?” into “Are you calling about buying a fax machine for your home or business?”.
In general, use the active voice rather than the passive voice. People process the active voice faster and more accurately. This is partly because phrases using the active voice (for example, “Say the book's title”) tend to be shorter than those using the passive voice (for example, “The book's title must be spoken now”). In some cases a sentence will sound best in passive voice, but you should use passive voice only if attempts to rewrite the sentence in active voice don't sound natural.
Good prompts do not necessarily have good grammar. As in normal conversation many natural phrases do not abide by the rules of grammar.
Prompts in an application with a DTMF interface typically take the form “For option, do action.” With a SUI, the option is the action. In general, you should avoid prompts that mimic DTMF-style prompts; these types of prompts are longer than they need to be for most types of menu selections. For example, use:
|
rather than:
|
or worse:
|
If the menu items are difficult for a user to remember (for example, if they are long or contain unusual terms or acronyms), you might choose to mimic DTMF prompts. Also, this style can work well for first level help messages as it slows things down, giving users more time to process their options.
It isn't always possible to have a simple label for a choice. Consider the following prompt:
|
Try to imagine how many different ways a user might respond to this question. One way to deal with this situation is to change the prompt to a yes or no question that has the form, "You can choose A or B. Would you like A'?" where A and B are complex choices and A' is a short version of A. For example:
|
There are many aspects to consider when deciding how to word your application's prompts and menus. The choices you make will have a significant impact on the types of responses your users provide, and therefore on what you will need to code in your grammars. Some of the issues you may need to address include the following:
Applications that require users to learn new commands are inherently more difficult to use. During the Design phase of your application development process, you will want to make note of the words and phrases that your users typically use to describe common tasks and items. See Design Phase. These are the words and phrases that you will want to incorporate into your prompts and grammars.
Regardless of whether your general style is terse or verbose, try to phrase prompts in a way that conveys the maximum amount of information in the minimum amount of time. For example, use:
|
rather than:
|
|
|
Sometimes, the most effective way to prompt the user is to word the prompt as a question. For example, use:
|
rather than:
|
Question prompts are especially useful when the user can make a choice by repeating one of the options verbatim. You can also use question prompts to collect information, as long as the question restricts likely user input to something the application can understand. For example:
|
Use pronouns such as "it" and "one" to avoid stilted repetition of words. For example:
|
rather than:
|
When mixing menu choices and form filling in the same list, you will want to word the prompt to clearly indicate what the user can say. For example, use a prompt such as:
|
You should avoid using synonyms in prompts; these might mislead the user regarding valid input. For example, use:
|
rather than:
|
because the latter might cause the user to think that “query” is a valid response.
Whenever possible, the prompt text should guide the user to the proper word choice. For example:
|
Try to phrase prompts in a way that minimizes the likelihood of the user inserting extraneous words in the response. For example, if the system cannot interpret dates embedded in sentences, use:
|
or:
|
rather than:
|
because the latter is likely to elicit a response such as:
|
In general, if you can avoid voice spelling, you should. The letters of the alphabet are notoriously difficult for computers (and humans) to recognize.
If a user must perform voice spelling, then you can take advantage of the fact that some recognition errors are more likely than others. Table 2 shows the results of an experiment conducted to investigate patterns of recognition errors for the letters of the English alphabet. In the table, uppercase letters indicate substitution probabilities that exceeded 10%. Lowercase letters indicate substitutions that occurred during the study, but had substitution probabilities less than 10%.
For example, if a user rejects a returned K in a voice-spelling application, then the letter most likely to have actually been spoken is A. Fourteen of the letters in the table have only one substitution for which the probability of substitution exceeded 10%. Eleven of the letters don't have any substitutes for which the probability of substitution exceeded 10%, and six of those didn't have any substitutions at all. This means that whenever the system returns these letters (H, I, U, W, X, and Y), you can have very high confidence that the speaker actually said that letter. Only two letters (T and G) had two substitutes for which the probability of substitution exceeded 10%. Note that the substitution probabilities are not usually symmetrical (F/S and V/Z are exceptions). If the system returned a D but the user indicated that it was not correct, the letter the user said was most likely an E. If the system returned an E but the user indicated that it was not correct, the letter the user said was most likely a V.
Letter returned | Most likely substituted for |
---|---|
A | I |
B | d |
C | V t z |
D | E b v |
E | V |
F | S |
G | P T v |
H | |
I | |
J | a |
K | A |
L | m |
M | N |
N | x |
O | L r u |
P | d e |
Q | U p t |
R | I f y |
S | F h j |
T | D E g p v |
U | |
V | Z |
W | |
X | |
Y | |
Z | V |
There are several ways that you can use this information to improve the user experience when voice spelling. If the information that the user is spelling is unbounded (such as a user's last name), then you can offer alternatives in the order of their likelihood. For example:
|
You can also use n-best decoding to find likely substitutes. See Refining confirmation and error correction with confidence levels and n-best lists for details.
If the information that the user is spelling is bounded (such as a part number, or a list of users' last names in a database), then you can often detect and correct voice spelling recognition errors without involving the user. For example, suppose the user is ordering a part by its code:
|
In the example above, the system created the different possible part numbers by using the information from the table of substitutions. Because there were no likely substitutes for X, H, or U, the system left them alone, and systematically changed F to S and vice versa.
In certain situations, you may need to provide instructional information to the users.
Where applicable, you may want to word prompts in a way that “feeds the result forward” (that is, incorporates the user response) into the next prompt. For example:
|
This technique provides feedback that the system correctly understood the response and also reinforces the user's mental model of the dialog state. Using this technique eliminates the need for cumbersome confirmation of every user input; however, you should still confirm user requests for actions that cannot be undone.
This section deals with error recovery only as it relates to prompt wording. See Error recovery and confirming user input for additional information about error recovery.
To promote faster error recovery, prompts should focus on keeping the dialog moving rather than on any mistakes. For example, if you are using self-revealing help, you should avoid having the system say:
|
because there is no evidence that this helps the user understand what to do next. See Implementing self-revealing help for guidance on how to keep the dialog moving forward with self-revealing help.
Similarly, it is best to avoid claiming that the user “said” a particular response, since the information you present is actually just a reflection of how the speech recognition performed. For example, use:
|
instead of:
|