Speech Recognition

Speech recognition enables people to use their voices instead of the telephone keys to drive voice applications. It provides the following advantages over DTMF tones:

The simplest speech recognition applications enable callers without push-button (DTMF) telephones to use equivalent applications. The technology can also be used to provide more sophisticated applications that use large vocabulary technologies. For telephone applications, speech recognition must be speaker-independent. Voice applications can be designed with or without barge-in (or cut-through) capability.

Speech recognition is particularly useful if a large number of your callers do not have DTMF telephones, if the application does not require extensive data input, or if you want to offer callers the choice of using speech or pressing keys.

Figure 1 shows a typical speech recognition setup. To handle the processing that speech recognition demands, the example uses multiple LAN-connected server machines.

Figure 1. A speech recognition environment
The figure shows a caller using the Lan-based speech recognition environment
When speech recognition technology is being used in a telephony environment, you need to think about the following characteristics:

Speech recognition using an MRCP Version 1-compliant speech recognition product

Speech recognition by MRCP Version 1-compliant speech recognition products is supported in both the VoiceXML and Java programming environments. The runtime components of WebSphere Voice Server or other MRCP Version 1 compliant speech recognition product such as Nuance Speech Server version 5.1.2 (Recognizer 9.0.13), or Loquendo Speech Server V7 work with the speech recognition engine to perform the recognition. Recognition engines can be distributed across multiple systems so that resources can be shared, and redundancy is provided.

To allow highly-accurate recognition of continuous speech in this environment, the speech must follow a format that is defined by the application developers. These formats (or grammars) can allow many possible ways of speaking requests, and many tens of thousands of words can be recognized. Although WebSphere Voice Server provides the means for you to support large-vocabulary speech recognition, the system does not allow dictation.

The WebSphere Voice Toolkit enables an application developer to create the grammars that perform the recognition. The grammars can be created in multiple national languages.

For detailed information about the speech recognition functions provided in WebSphere Voice Server, refer to the WebSphere Voice Server infocenter at:
http://publib.boulder.ibm.com/infocenter/pvcvoice/51x/index.jsp

Planning a speech recognition system

In planning a voice system using MRCP Version 1 compliant speech technologies, you must determine:
  • The number of gateway systems you require to connect to the speech server
  • The number of automatic speech recognition (ASR) engines and how many will be active at any one time.
  • The number of text-to-speech (TTS) engines, and how many will be active at any one time.
  • The number of machines you require, as a function of the number and speed of the processors available in each machine.
  • The type of local area network necessary.
Note: It is not possible to give definitive information about exactly what size or number of machines you will need for your system in this documentation. Only approximate guidelines can be provided, and it is essential that any implementation be tested with realistic call volumes before it is put into production. For guidance about capacity planning for your specific configuration, contact the vendor of your MRCP Version 1-compliant speech product.

Application load

To estimate the application load on the system, you need to know the following:

Allocating speech recognition and TTS engines

The following figures illustrate the sequence of events in different types of voice application using speech recognition and text-to-speech:

With a barge-in application, speech recognition engines and text-to-speech engines are allocated for the duration of a call.
Figure 2. Sequence of events in a barge-in application using speech recognition and text-to-speech
Image describing the sequence of events in a barge-in application using speech recognition and text-to-speech. Speech recognition engines and text-to-speech engines are allocated for the duration of a call.
The allocation duty cycle is much greater than the active duty cycle and this also allows some adaptation to a caller’s voice over the duration of a call. If a system is under-specified, an engine may not be available at the start of a call.

With a non-barge-in application, text-to-speech engines are allocated for the duration of a call, and speech recognition engines are allocated slightly before the first period of speech recognition until the end of a call.

Figure 3. Sequence of events in a non-barge-in application using speech recognition and text-to-speech
Image describing the sequence of events in a non-barge-in application using speech recognition and text-to-speech. Text-to-speech engines are allocated for the duration of a call, and speech recognition engines are allocated slightly before the first period of speech recognition until the end of a call.

Depending on the application load on the system, and the number of engines in your speech recognition system, to optimize the use of speech recognition or TTS engines, you may choose to allocate engines dynamically rather than allocate engines for the duration of a call. With dynamic engine allocation, speech recognition and TTS engines are allocated slightly before each period of speech recognition or text-to-speech begins, and the appropriate engines are freed slightly after each period ends. The disadvantage of using this approach is that an engine may not be available when required by a voice application. Also, the vendor of your speech product may not support the use of a dynamic engine allocation approach.

Figure 4. Sequence of events in speech recognition and text-to-speech using dynamic engine allocation
Image describing the sequence of events in speech recognition and text-to-speech using dynamic engine allocation. Speech recognition and TTS engines are allocated slightly before each period of speech recognition or text-to-speech begins, and the appropriate engines are freed slightly after each period ends.

For information on configuring speech recognition using dynamic engine allocation, see InitTechnologyString.

For information on configuring text-to-speech using dynamic engine allocation, see InitTechnologyString.

To optimize the use of speech recognition or TTS engines, you can also close a speech recognition or text-to-speech session directly from a VoiceXML document by using the <object> element.

See Closing a speech recognition or TTS session from VoiceXML