Designing and using grammars

Designing good grammars is as much art as science. Iterative prototyping is crucial to grammar design. See Design methodology.

Since only words, phrases, and DTMF key sequences from active grammars are possible speech recognition candidates, what you choose to put in a grammar and when you choose to make each grammar active have a major impact on speech recognition accuracy. In general, you should only enable a grammar when it is appropriate for a user to say something matching that grammar. When appropriate, you should reuse grammars to promote application consistency.

Managing trade-offs

There are many trade-offs that you will want to consider in deciding what words and phrases to include in your grammars and when to make each grammar active. Some of the major trade-offs are:

Word and phrase length:

One of the first trade-offs you are likely to encounter is how long users responses should be. Table 1 compares the two schemes.

Table 1. Grammar word/phrase length trade-offs
Longer words and phrases Shorter words and phrases
Multisyllabic words and phrases generally have greater recognition accuracy because there is greater differentiation among valid utterances.

Individual word choice is still important in longer phrases of because the VoiceXML browser's ability to match a menu choice based on a user utterance of one or more significant words.

Shorter words and phrases are more likely to be misrecognized; when a grammar permits many short user utterances, it is important to minimize acoustic confusability by making them as acoustically distinct as possible.

Monosyllabic words and short words with unstressed vowels are especially prone to be recognized as each other, even though they may look and sound different to a human ear.

Dialogs may be slower. Dialogs progress faster: choices are read faster, and user responses tend to be shorter.
Users may have difficulty remembering long phrases. Easier for users to remember.
For applications with hotword (recognition) barge-in detection, longer words and phrases may induce stuttering and Lombard effects. SeeChoosing the barge-in style.  
Vocabulary robustness and grammar complexity:

A related issue is how robust and complex your grammars should be, as illustrated in Table 2.

Table 2. Vocabulary robustness and grammar complexity trade-offs
Robust grammar Simple grammar
Inclusion of synonyms and alternative phrases gives users greater freedom of word choice; however, users may incorrectly assume that they can say virtually anything, leading to a large number of out-of-grammar errors. Narrow list of valid utterances places more constraints on user input.
Grammar files are larger and load more slowly. Grammar files are smaller and load more quickly.
Increased chance of recognition errors. Simple grammars generally have better recognition accuracy.
Number of active grammars:

Finally, you will want to consider when each grammar should be active, as presented in Table 3.

Table 3. Number of active grammar trade-offs
More active grammars Fewer active grammars
May improve usability, such as by allowing anytime access to items on main menu.  
Increased chance of recognition conflicts. Less chance of misrecognitions due to recognition conflicts.
Performance can degrade. Better performance.

You can limit the active grammars to just the ones specified by the current form by using the <field> element's modal attribute.

Improving recognition accuracy

In general, you can improve recognition accuracy by:

Using Boolean and yes/no grammars

General strategy:

For the first presentation of a prompt with an expected answer of Yes or No, we generally recommend using the built-in boolean grammar. This grammar provides more flexibility in accepting user input than does a simple Yes/No grammar. For example:

System:

Do you want more information? (boolean grammar active)

User:

Okay

Recovering from a noninput event:

If the system returns a noinput event in response to the initial prompt, we recommend that you attempt to recover by switching to a simple Yes/No grammar and a prompt that clearly directs the user to say “Yes” or “No”, as shown here:

System:

Do you want more information? (boolean grammar active)

User: (no response)
System: Please say Yes, No, or Repeat. (Yes/No grammar active, Repeat always available)
User: Yes.
Recovering from a nomatch event:

If the system returns a nomatch event based on the user input for the initial prompt, we recommend that you switch to a Yes/No grammar and use a prompt that attempts to confirm whether the user intended to provide a positive or negative response. In their book, How to Build a Speech Recognition Application: A Style Guide for Telephony Dialogues, Bruce Balentine and David P. Morgan recommend the phrase, “Was that a Yes?” for this purpose. For example:

System:

Do you want more information? (boolean grammar active)

User:

(unintelligible response)

System:

Was that a Yes? (Yes/No grammar active)

User:

Yes.

This design minimizes disruptions to the dialog flow because the user's response to the subdialog prompt is the same as the intended response to the prompt that generated the out-of-grammar error.

When initial accuracy is paramount:

If it is more important to get extremely high accuracy on the first presentation of the Yes/No question than it is to accept a broader range of user responses, you could write the initial prompt so that it explicitly directs the user to say Yes or No, and use a Yes/No grammar. For example:

System:

Please say Yes or No. Do you approve this transaction? (Yes/No grammar active)

User:

Yes.

Using the built-in phone grammar

Some people tend to pause at the logical grouping points when speaking telephone numbers, especially if they are having difficulty remembering the number. For example, when speaking a USA telephone number, a user might pause after the 3-digit area code, and again after the 3-digit exchange. If the pauses are long enough, they might inadvertently trigger endpoint detection before the user has finished speaking the 10-digit telephone number.

If your application requires users to enter telephone numbers, you will want to take special care to thoroughly test your telephone number collection dialogs; if users experience difficulties, you may want to employ some of the following techniques to ensure that the data is being captured correctly:

Note: These guidelines apply to any long alphanumeric string with a predictable format.

Testing grammars

When testing your grammars, you should test words and phrases that are out of your grammars as well as words and phrases that are in. (The purpose of testing “out-of-grammar” words and phrases is to ensure that the speech recognition engine is rejecting these utterances; erroneously accepting these utterances could cause unintended dialog transitions to occur.)

If your grammar tools allow you to list (enumerate) commands that the grammar can recognize, you can examine these lists of commands for phrases that you do not want to include in the grammar. For example, if a list of commands contains the sentence "Play the next next message," you can modify the grammar to prevent inappropriate duplication of words.

If your application has more than one grammar active concurrently, you should test each grammar separately, and then test them together.

To help identify if there are any words that are consistently misrecognized, you should test your grammar with a group of test subjects that is representative of the demographics and environments of your users. For example, you might want to vary the ambient noise level, gender, age, accent, and level of fluency during desktop testing. When you are ready to deploy your application, you may want to perform additional testing while varying the type of telephone (standard, cordless, cellular, and speaker phone, etc.).

If you discover words or phrases that are consistently problematic, you might need to rephrase some entries or add multiple pronunciations.

Remember that testing your grammar is an iterative process. As you make changes, you should go back and retest to verify that all of the valid words and phrases can still be recognized.