Glossary

This glossary defines terms and abbreviations used in this publication. If you do not find the term you are looking for here, refer to The IBM Dictionary of Computing, SC20-1699, New York: McGraw-Hill, copyright 1994 by International Business Machines Corporation. Copies may be purchased from McGraw-Hill or in bookstores.

A

active grammar: A speech grammar that the speech recognition engine is currently listening for. One or more grammars can be active at any time, and the content of the active grammar(s) defines the user utterances that are valid in a given context.
A-law: The compression and expansion algorithm used primarily in Europe when converting from analog to digital speech data.
ANI: Automatic Number Identification. A service offered by commercial telephone networks, which provides the directory billing number associated with a calling party. This is the originating telephone number of the incoming call, which can be used for call set-up or passed by the switch to the Voice Server, which can then use it to retrieve data from business databases. Often used as a synonym for calling number.
application: A set of related VoiceXML documents that share the same application root document.
application root document: A document that is loaded when any documents in its application are loaded, and unloaded whenever the dialog transitions to a document in a different application. The application root document may contain grammars that can be active for the duration of the application, and variables that can be accessed by any document in the application.
ASP: Microsoft Active Server Pages. One of many server-side mechanisms for generating dynamic Web content by transmitting data between an HTTP server and an external program. ASPs can be written in various scripting languages, including VBScript (based on Microsoft Visual Basic), JScript (based on JavaScript), and PerlScript (based on Perl).

B

bail out: The termination of a sequence of self-revealing help prompts, if the user repeatedly fails to provide an appropriate response. This is generally a transfer to a human operator (if available), or an exit.
barge-in: A feature of full-duplex environments that allows the user to interrupt computer speech output (audio file and text-to-speech). See also “full-duplex.”

C

CGI: Common Gateway Interface. One of many server-side mechanisms for generating dynamic Web content by transmitting data between an HTTP server and an external program. CGI scripts are typically written in Perl, although they can be written in other programming languages.
called number: The number dialed by callers to reach the voice application, or the number dialed when making a call. Often used as a synonym for DNIS.
calling number: The number from which a call is made. Often used as a synonym for ANI
continuous speech recognition: The WebSphere Voice Server supports “continuous speech recognition,” in which users can speak a string of words at a natural pace, without the need to pause after each word. Contrast with “discrete speech recognition.”
cookie: Information that a Web server stores on a user's computer when the user browses a particular Web site. This information helps the Web server track such things as user preferences and data that the user may submit while browsing the site. For example, a cookie may include information about the purchases that the user makes (if the Web site is a shopping site). The use of cookies enables a Web site to become more interactive with its users, especially on future visits.
cut-thru word recognition: See “barge-in.”

D

data prompts: Prompts where the user must supply information to fill in the field of a form. Contrast with “verbatim prompts.”
dialog: The main building block for interaction between the user and the application. VoiceXML supports two types of dialogs: “form” and “menu.”
discrete speech recognition: Users must pause briefly after speaking each word, to allow the system to process and recognize the input. Contrast with “continuous speech recognition.”
DNIS: Dialed Number Identification Service. A service supplied by the public telephone network to identify the number actually dialed. For example, calls placed to two or more 1-800 numbers will arrive at the same call center switch. Upon arrival, DNIS tells the switch which one of the 1-800 numbers was actually dialed. DNIS can be used by the Voice Server to automatically select between several voice applications. Often used as a synonym for “called number.”
DTMF: Dual Tone Multiple Frequency. The tones generated by pressing keys on a telephone's keypad.
DTMF Simulator: A GUI tool that enables you to simulate DTMF input when testing your VoiceXML application on your desktop workstation. The VoiceXML browser communicates with the DTMF Simulator to accept DTMF input, and uses that input to fill in forms or select menu items within the VoiceXML application.

E

echo cancellation: Technology that removes echo sounds from the input data stream before passing what's left (that is, user speech) to the speech recognition engine. In a connection environment, this is configured on the telephony hardware; if echo cancellation is poor, you may need to turn off barge-in or switch to “half-duplex.”
ECMAScript: An object-oriented programming language adopted by the European Computer Manufacturer's Association as a standard for performing computations in Web applications. ECMAScript is the official client-side scripting language of VoiceXML. Refer to the ECMAScript Language Specification, available at http://www.ecma.ch/ecma1/stand/ECMA-262.htm.
event: The VoiceXML browser throws an event when it encounters a <throw> element and certain specified conditions occur. Events are caught by <catch> elements that can be specified within other VoiceXML elements in which the event can occur, or inherited from higher-level elements. The VoiceXML browser supports a number of predefined events and default event handlers; you can also define your own events and event handlers.

F

form: One of two basic types of VoiceXML dialogs. Forms allow the user to provide voice or DTMF input by responding to one or more <field> elements. See also menu.
full-duplex: Applications in which the user and computer can speak concurrently. Full-duplex applications use echo cancellation to subtract computer output from the incoming data to determine what was user speech. See also “barge-in.” Contrast with “half-duplex.”

G

grammar: A collection of rules that define the set of all user utterances that can be recognized by the speech recognition engine at a given point in time. The VoiceXML browser makes different grammars active at different points in the dialog, thereby controlling the set of valid utterances that the speech recognition engine is listening for. Grammars support word substitution and word repetition.
GSM: Global System for Mobile Communication. The cellular telephone network.
GUI: Graphical User Interface. A type of computer interface consisting of visual images and printed text. Users can access and manipulate information using a pointing device and keyboard. Contrast with “speech user interface.”

H

H.323: Audio communications protocol used by WebSphere Voice Server.
half-duplex: Applications in which the user should not speak while the computer is speaking because the speech recognition engine does not receive audio when sending audio to the user. Turn-taking problems can occur when the user speaks before the computer has finished speaking; using a unique tone to indicate the end of computer output can minimize these problems by informing users when they can speak. Contrast with “full-duplex.”
help mode: A technique for providing general help information explicitly, in a separate dialog. Contrast with “self-revealing help.” See Choosing help mode or self-revealing help.

I

II digits: Information Indicator Digits. A telephony service that provides information about the caller's line (for example, cellular service, pay telephone, etc.).

J

JMF: Java Media Framework.
JSP: Java Server Pages. One of many server-side mechanisms for generating dynamic Web content by transmitting data between an HTTP server and an external program. JSPs call Java programs, which are executed by the HTTP server.
JVM: Java Virtual Machine.

L

Lombard speech: The tendency of people to raise their voices in noisy environments, so that they can be heard over the noise.

M

machine directed: A dialog in which the computer controls interactions. Grammars are only active within their own dialogs. Contrast with “mixed initiative.”
menu: One of two basic types of VoiceXML dialogs. Menus allow the user to provide voice or DTMF input by selecting one menu choice. See also form.
menu flattening: A feature of natural command grammars that enables the system to parse user input and extract multiple tokens. Mixed initiative dialogs can provide similar benefits.
mixed initiative: A dialog in which either the user or the computer can initiate interactions. You can use form-level grammars to allow the user to fill in multiple fields from a single utterance, or document-level grammars to allow the form's grammars to be active in any dialog in the same VoiceXML document; if the user utterance matches an active grammar outside of the current dialog, the application transitions to that other dialog. Contrast with “machine directed.”
mixed-mode applications: Applications that mix speech and DTMF input.
μ-law: The compression and expansion algorithm used in primarily in North America and Japan when converting from analog to digital speech data.
multi-modal application: An application that has both a speech and a visual interface.

N

natural command grammar: A complex grammar that approaches natural language understanding in its lexical and syntactic flexibility, but unambiguously specifies all acceptable user utterances. Contrast with “natural language understanding (NLU).”
natural language understanding (NLU): A statistical technique for processing natural language, using text that is representative of expected utterances to create a dictation-like model. NLU does not use grammars; instead, it uses statistical information to tag and analyze key words in an utterance. Contrast with “natural command grammar.”

O

out-of-grammar (OOG) utterance: The user input was not in any of the active grammars.

P

persistence: A property of visual user interfaces is that information is persistent; that is, information remains visible until the user moves to a new visual page or the information changes. Contrast with “transience.”
phone: The actual pronunciation of a sound. Phones have a variable duration of up to several seconds. Multiple phones can be categorized as the same phoneme. For example, the vowels in the words “bean” and “bead” are classified as the same phoneme; however, if you carefully monitor the shape of your lips and the position of your tongue and jaw when saying the two words, you can see that the actual sound of the vowel is different (that is, they are different phones). Contrast with “phoneme.”
phoneme: A perceived pronunciation or category or pronunciation for a distinctive sound segment of a language. A change in the phoneme changes the meaning of a word. For example, “zip” and “sip” differ by only the initial sound, but are completely different words. Contrast with “phone.”
prompt: Computer spoken output, often directing the user to speak.
pronunciation: A possible phonetic representation of a word that is stored in the speech recognition engine and referenced by one or more words in a grammar. A pronunciation is a string of sounds that represents how a given word is pronounced. A word may have several pronunciations; for example, the word “tomato” may have pronunciations “toe-MAH-toe” and “toe-MAY-toe.”
PSTN: Public Switched Telephone Network.
prosody: The rhythm and pitch of speech, including phrasing, meter, stress, and speech rate.

R

recognition: When utterances are known and accepted by the speech recognition engine. Only words, phrases, and DTMF key sequences in active grammars can be recognized.
recognition window: The period of time during which the system is listening for user input. In a full-duplex implementation, the system is always listening for input; in a half-duplex implementation or when barge-in is temporarily disabled, a recognition window occurs only when the dialog is in a state where it is ready to accept user input.

S

self-revealing help: A technique for providing context-sensitive help implicitly, rather than providing general help using an explicit help mode. Contrast with “help mode.”
servlet: One of many server-side mechanisms for generating dynamic Web content by transmitting data between an HTTP server and an external program. Servlets are dynamically loaded Java-based programs defined by the Java Servlet API (http://java.sun.com/products/servlet/). Servlets run inside a JVM on a Java-enabled server.
session: A session consists of all interactions between the VoiceXML browser, the user, and the document server. The session starts when the VoiceXML browser starts, continues through dialogs and the associated document transitions, and ends when the VoiceXML browser exits.
speech browser: See “VoiceXML browser.”
speech recognition: The process by which the computer decodes human speech and converts it to text.
speech recognition engine: Decodes the audio stream based on the current active grammar(s) and returns the recognition results to the VoiceXML browser, which uses the results to fill in forms or select menu choices or options.
speech user interface (SUI): A type of computer interface consisting of spoken text and other audible sounds. Users can access and manipulate information using spoken commands and DTMF. Contrast with “GUI.” See “DTMF.”
spoke too soon (STS) incident: A recognition error that occurs when the user in a half-duplex application begins speaking before the turn-taking tone sounds and continues speaking over the tone and into the speech recognition window.
spoke way too soon (SWTS) error: A recognition error that occurs when the user in a half-duplex application finishes speaking before the turn-taking tone sounds.
stuttering effect: When a prompt in a full-duplex application keeps playing for more than 300 ms after the user begins speaking, users may interpret this to mean that the system didn't hear their input. As a result, the users stop what they were saying and start over again. This “stuttering” type of speech makes it difficult for the speech recognition engine to correctly decipher user input.
subdialog: Roughly the equivalent of function or method calls. Subdialogs can be used to provide a disambiguation or confirmation dialog, or to create reusable dialog components.
SUI: See “speech user interface”.

T

text-to-speech (TTS) engine: Generates computer synthesized speech output from text input.
token: The smallest unit of meaningful linguistic input. A simple grammar processes one token at a time; contrast with “menu flattening,” “natural command grammar,” and “natural language understanding (NLU).”
transience: A property of speech user interfaces is that information is transient; that is, information is presented sequentially and is quickly replaced by subsequent information. This places a greater mental burden on the user, who must remember more information than they need to when using a visual interface. Contrast with “persistence.”
turn-taking: The process of alternating who is performing the next action: the user or the computer.

U

URI: Universal Resource Identifier. The address of a resource on the World Wide Web. For example: http://www.ibm.com.
URL: Universal Resource Locator. A subset of URI.
User to User Information: ISDN service that provides call set-up information about the calling party.
utterance: Any stream of speech, DTMF input, or extraneous noise between two periods of silence.

V

verbatim prompts: Menu choices that the user can select by repeating what the system said. Contrast with “data prompts.”
voice application: An application that accepts spoken input and responds with spoken output.
VoiceXML: Voice eXtensible Markup Language. An XML-based markup language for creating distributed voice applications. Refer to the VoiceXML Forum Web site at http://www.voicexml.org.
VoiceXML browser: The “interpreter context” as defined in VoiceXML 2.0. The VoiceXML browser fetches and processes VoiceXML documents and manages the dialog between the application and the user.

W

“Wizard of Oz” testing: A testing technique that allows you to use a prototype paper script and two people (a user and a human “wizard” who plays the role of the computer system) to test the dialog and task flow before coding your application. See Prototype phase (“Wizard of Oz” testing).

X

XML: eXtensible Markup Language. A standard metalanguage for defining markup languages. XML is being developed under the auspices of the World Wide Web Consortium (W3C).