designing the dialog

Customer satisfaction with your voice response service will depend on a number of factors, including the voice you choose, the information you provide, and the ease with which callers can get the information they want. The sound and feel of the service is every bit as important as the look and feel of a visual computer application. With careful design, the caller’s experience will be pleasurable; with lack of attention to dialog design, the caller will get frustrated and hang up. At best this will result in your people handling just as many calls as they ever did; at worst it may mean loss of business.

You’re already considering a voice response service, so you’re aware of the strengths of voice response: telephones are ubiquitous, familiar, and easy-to-use and an automated system can provide more speed, privacy, efficiency, availability, and accuracy than some human operators, and at lower cost.

Nevertheless, when designing the dialog, you need to be aware of some of the limitations of the medium. Compared with people, automated systems are inflexible, demanding, and intolerant of deviations from the expected conversation. They are best at handling very routine calls. Almost without exception, all voice response services should offer help in some way, for example by offering the choice of transferring to an operator (if call transfer is provided by the switch) or calling another number to speak to an operator. If call transfer is provided by the switch, the application can automatically transfer the caller to a human agent in some circumstances.

You should give human factors and usability a high priority when designing voice response services, in addition to testing the usability of the services later.

Design considerations

Compared with today’s multiwindowed screen-based computer applications, voice response applications have a number of potential limitations. With careful design, you can overcome the limitations and make your application a delight to use:

Auditory output is sequential, and hard to keep in short-term memory: you have to take great care wording the prompts.
Auditory output can be slow: unfortunately, someone (either you or the caller) is paying for the call, and wants the whole transaction to take as little time as possible: again, the prompts need to be worded carefully so that they convey the greatest amount of information, unambiguously, and in the shortest amount of time.
The very ubiquity and availability of telephones means that callers don’t have access to manuals and other supporting materials: the whole point is that the voice response service can be used from anywhere, so the application itself must be completely self-documenting.
Voice data takes up a lot of storage space (one second of voice occupies 8000 bytes of storage). This can be reduced by a factor of 5 by compression, or even further by generating synthesized speech from text.

On the input side, from the caller’s point of view, the telephone is an inferior device compared with the computer keyboard and mouse:

There are two ways of communicating with a Blueworx Voice Response application: by key or by voice. If keys are used, they must transmit dual-tone multifrequency (DTMF) tones. If the caller is able to transmit these tones, keys are the most accurate and efficient method of input.
Bear in mind that many callers may be using phones with an integral keypad and handset: they will take longer to press the keys than callers using traditional phones with a separate keypad.
Compared with the 100 or so keys on a computer keyboard, the standard telephone has twelve keys (Blueworx Voice Response supports up to sixteen: 0 through 9, *, #, A, B, C, and D).
The standard keypad is fine for numeric input, making simple choices, and answering yes-no questions, but what about alphabetic input? In some countries, there are no alphabetic characters on the keys at all; even in countries where telephones still have the alphabetic characters on the number keys, people don’t use them enough to be familiar with their positions. (And note that different countries use different layouts anyway.)
Many telephones cannot transmit DTMF tones reliably or at all: rotary dial, hybrid tone, and dial-pulse phones are still in use; cellular and portable phones often suffer from noise interference; some equipment sends tones of fixed duration or cannot send tones at all during a call. In these cases, the only reasonable input device is voice, and for that, you require access to speech recognition hardware or software.

Faced with all these challenges, you have a number of design decisions to make. First you need to determine the right dialog style or styles and then you need to attend to the detail of each interaction, or task, within the application. If you design your application well, callers will prefer it to the human agent, and the reward will be savings in business costs.

Choosing a dialog style

No single dialog style is best for all applications, for all tasks within an application, or for all callers:

Callers may be familiar or unfamiliar with the mechanics of driving the application.
Callers may be familiar or unfamiliar with other voice response services.
Callers may use the service frequently (and therefore acquire expertise with the mechanics, or the content, over time) or infrequently (never acquiring expertise).
The content of the dialog may be familiar and predictable (for example, days of the week) or relatively unfamiliar and unpredictable (for example, toppings available for pizzas).
It may be appropriate to provide documentation or training sessions, for example, if callers are employee or students, but it is unreasonable to base your design on the assumption that documentation will be read or training sessions attended.

There is a far greater choice of dialog designs for voice response services than you might imagine. Some less common designs overcome the problems encountered by callers in using traditional voice response services. Different styles of dialog may be appropriate in different parts of the service, depending on the task being performed: after all, these days one expects to see pull-down menus, radio buttons, check boxes, and so on, used in visual applications, so it’s not surprising that there might be a variety of techniques you can use in voice response dialogs.

There are three basic styles of voice response dialog, suited to different types of task:

Menu:: suited to selecting one of a small number of options
List:: suited to choosing multiple items, perhaps from a large number
Form:: suited to providing input such as addresses and telephone numbers

Within these basic styles, there are numerous variations. It’s hard to provide rules for good dialog design, but you need to classify your application and your callers in these terms before considering the following questions:

Composite or separate actions? Composite actions may be simpler for the caller, but separate actions gives more flexibility. If your callers are unlikely to use the application often, give them composite actions that accomplish the most with the fewest keystrokes. If callers are likely to use the application often, however, they may want to combine actions in different ways.
Keys or speech recognition? The application is using speech to communicate with the caller, but should the caller be using speech or keys? Remember it takes longer to speak a command and have it recognized than it does to press a key. Also, people are better at pressing keys accurately than speaking clearly enough to be recognized accurately.
A mixture of key and speech input? You can mix key input and speech input in a single application (though you are not recommended to allow both at the same time). For example, entering a Personal Identification Number (PIN) is easier using keys, whereas recording a street address is easier with speech.
Command-driven or prompted? You need to make a decision about whether to let callers interrupt the prompts or not. In general, you should let them interrupt. Once they learn the choices at each point, they can key ahead (or speak ahead), without waiting for the prompt to finish. There may, however be some prompts that you want to force play to the end; there may even be whole applications in which you want all prompts to be force played.
The type of dialog that allows key-ahead or speak-ahead is sometimes known as a command-driven dialog, but you still need to play the prompts in a voice application, because of the absence of documentation.
A single selection key or option-specific selection keys? With a single selection key, you play each choice to the caller and give them a few seconds to press the nominated key (for example, 1); if they do not press a key, you play the next choice, and so on. The caller always presses 1 to select the current option. There is less for callers to learn if, at any point, they have only two choices: to select the current option or to proceed to the next; it can be useful for long lists of options (for example film titles). On the other hand, this kind of design tends to restrict the caller’s ability to key ahead and bypass menus.
With option-specific selection keys, you allocate a different key to each option. This has the advantage of allowing callers to key ahead, but limits the options available at any one time. It can also be difficult to ensure consistency of key-allocation throughout the application (for example, on one menu, 3 may be “Delete message” while on another menu, it is something less destructive). Again, infrequent use may indicate the single selection key and frequent use the option-specific selection keys.
Passive or active advance through menus? Should each option “drop through” to the next option, or should the caller have to press a key to proceed? This becomes an issue if you provide a single selection key. Again, passive advance is probably most suitable for callers who use the application infrequently but, to avoid the problem of callers selecting an option just after the menu has moved on to the next option, you need to include a few seconds of silence between each option.
Passive advance might seem easiest for the caller, but only at first. Later, callers may become frustrated at having to sit through whole menus. Providing a “Next” key would seem to prevent this, but the “drop-through” behavior often results in callers never learning about the “Next” key, unless the menus mention it.
Long or short prompts? The more information you provide, the more likely the caller is to learn how to drive the application. However, long prompts slow down callers, particularly experienced ones. A choice of novice and expert prompts can help. The difference between them is not necessarily verbose versus terse: you could actually leave some information out of the expert prompts altogether.
Always provide essential information before information that is merely helpful. Provide “Next” and “Previous” keys, so that callers can skip over information they don’t want to hear.

Designing the dialog

Design considerations

Choosing a dialog style