Choosing the barge-in style

Enabling barge-in allows the computer and the user to speak at the same time, permitting the user's speech to interrupt system prompts as the machine plays them.

On the surface, it might seem that enabling barge-in is always preferable to disabling barge-in. It is easy to imagine experienced users wanting to interrupt prompts (especially lengthy ones) when they know what to say. There are situations, however, in which a system for which barge-in has been disabled will be as easy or easier to use.

Table 1 compares implementations with barge-in enabled or disabled:

Table 1. Barge-in Enabled versus Disabled
Style Advantages Disadvantages
Barge-in enabled Experienced users can interrupt system prompts to speed up the interaction.

Users can say “Quiet” to stop the prompt.

Note: For commands in languages other than US English, see the appropriate appendixes.
Inexperienced users may inadvertently interrupt the prompt before hearing enough to form an acceptable response. You can minimize this problem by keeping system prompts short, to lessen the user's need to barge in; if your prompts are long, you should try to present key information early in the prompt.

When using hotword barge-in (see Table 2), Lombard speech and the stuttering effect can be problematic. To minimize this problem, you should keep required user inputs very short. See Controlling Lombard speech and the stuttering effect for more information.

Barge-in disabled Guarantees that the entire prompt text plays. This may be especially useful for applications with lots of legal notices, advertisements, or other information that you want to make sure always gets presented to the user.

Creates a “my turn-your turn” rhythm for the dialog.

Experienced users cannot interrupt prompts; however, if the prompts are short enough, users should not need to interrupt.

Users may experience turn-taking errors. Keeping prompts short helps minimize this.

If enabling barge-in, you should play an initial prompt of 3 seconds or longer with barge-in disabled to give the system time to calibrate echo cancellation.

Comparing barge-in detection methods

To use barge-in effectively, it is important to understand how the system determines when to stop an interrupted prompt. For WebSphere Voice Server the default barge-in detection method is speech.

Table 2 compares the available barge-in detection methods.

Table 2. Barge-in detection methods
Barge-in detection method Description Advantages Disadvantages
hotword Audio output stops as soon as the speech recognition engine returns a match for a word, phrase, DTMF key, or key sequence found in a currently active grammar.

For voice input, hotword barge-in is available for all call types.

DTMF hotword barge-in is supported only for VoIP/SIP calls. To use the DTMF hotword barge-in method:
  1. Blueworx Voice Response must be configured to use DTNA and VoIP/SIP.
  2. External DTMF detection must be enabled. See Remote DTMF grammars for details of how to configure Blueworx Voice Response to do this.
Resistant to accidental interruptions, such as, those caused by coughing, muttering, or using the system in an environment with loud ambient conversation. Increased incidence of Lombard speech and the “stuttering effect” (see next section); however, you can control this somewhat by making required user responses as short as possible.

The time required to recognize spoken input can cause slower system response times.

speech Audio output stops as soon as the speech recognition engine detects sound. This behavior is more typical of conversation between two humans.

Minimizes Lombard speech, the stuttering effect, and the distortion to the first syllable of user speech that often occurs when users barge in.

Susceptible to accidental interruption due to background noise, non-speech vocalizations, and speech not intended for the system.
dtmf_only Audio output stops only when the user has pressed a DTMF key.

Blueworx Voice Response supports this ‘proprietary’ barge-in detection method only when the default Blueworx Voice Response DTMF detection method is used. It is not supported when Blueworx Voice Response is configured to use external DTMF detection on a remote speech server.

Can be used where speech recognition is an option but noisy environmental factors may cause audio interruptions. This has two advantages:
  • It ensures audio playback whilst offering DTMF barge-in during the audio playback.
  • It supports both DTMF and speech recognition after the prompt and before the speech timeout period.
 

Controlling Lombard speech and the stuttering effect

When speaking in noisy environments, people tend to exaggerate their speech or raise their voices so others can hear them over the noise. This distorted speech pattern is known as Lombard speech (named for the researcher E. Lombard, who in 1911 was the first to report such an effect), and it can occur even when the only noise is the voice of another participant in the conversation (for example, when one person tries to interrupt another, or, in the case of a voice application, when the user tries to barge-in while the computer is speaking).

The “stuttering effect” may occur when a prompt keeps playing for more than about 300 ms after the user begins speaking. Unless users have undergone training with the system, they may interpret the continued playing of the prompt as evidence that the system did not hear them. In response, some users may stop what they were saying and begin speaking again – causing a stuttering effect. This stuttering makes it virtually impossible for the system to match the utterance to anything in an active grammar, so the system generally treats the input as an “out-of-grammar” utterance, even if what the user intended to say was actually in one of the active grammars.

To control Lombard speech and the stuttering effect when using hotword barge-in detection, the prompt should stop within about 300 ms after the user begins talking. The average time required to produce a syllable of speech is about 150-200 ms, this means that the system design should promote short user responses (ideally no more than two or three syllables) when using hotword barge-in detection. You should also try to keep prompts as short as possible to minimize the likelihood that users will want to interrupt the prompt. If this is not possible, you should consider switching to speech barge-in detection, or in extreme cases consider disabling barge-in.

Weighing user and environmental characteristics

When deciding whether to use barge-in and which type of barge-in detection is most appropriate, you should consider how frequently users will use your application (expert users are more likely to barge in), and in what environment (quality of the telephone connection, general noise level, etc.).

In general, you should enable barge-in for deployed applications. However, if echo cancellation on your telephony equipment is not good enough, it might be necessary to disable barge-in.

Minimizing the need to barge in

Even when the system permits barge-in, many users do not like to interrupt the system. To minimize the user's need to barge in, you might consider placing short pauses (around 0.75 second) at logical points during and between prompts, such as at the end of a sentence or after each menu item. These brief pauses will give users the opportunity to begin talking without actively interrupting the system. In systems for which barge-in has been disabled, you can simulate barge-in by enabling recognition during these pauses. Be sure not to produce a turn-taking tone at the end of these “recognition windows” because speech at these times is optional, not required.

Using audio formatting

If you need to temporarily disable barge-in (using <prompt bargein=“false”>), such as while the system reads legal notices or advertisements, you may want to use a unique background sound, tone, or prompt as an indicator. For guidance, see Applying audio formatting.

If you disable barge-in, consider playing a tone to signal the user when it is time to speak. The introductory message should explicitly tell users to speak only after this “turn-taking” tone.

Note: The use of tones to signal user input is somewhat controversial, with some designers avoiding tones based on a belief that tones are unnatural in speech and annoying to users. Others contend that effective computer speech interfaces need not perfectly mimic human conversation, and that a well-designed tone can promote clear and efficient turn-taking without annoyance. For guidance in creating an effective turn-taking tone, see Designing audio tones.

Wording prompts

For systems without barge-in, make prompts as concise as possible. If a prompt must be relatively long, place the key information toward the end of the prompt to discourage users from speaking before their turn.

You can do the same for systems with barge-in, assuming your prompts are relatively short; if the prompts are long, you may decide to move the key information to the beginning of the prompt so users know what input to provide if they interrupt the prompt.