This glossary defines terms and abbreviations used in this publication.
If you do not find the term you are looking for here, refer to The
IBM Dictionary of Computing, SC20-1699, New York: McGraw-Hill, copyright
1994 by International Business Machines Corporation. Copies may be
purchased from McGraw-Hill or in bookstores.
A
- active grammar
- A speech grammar that the speech recognition engine
is currently listening for. One or more grammars can be active at
any time, and the content of the active grammar(s) defines the user
utterances that are valid in a given context.
- A-law
- The compression and expansion algorithm used primarily in Europe
when converting from analog to digital speech data.
- ANI
- Automatic Number Identification. A service offered by commercial
telephone networks, which provides the directory billing number associated
with a calling party. This is the originating telephone number of
the incoming call, which can be used for call set-up or passed by
the switch to the Voice Server, which can then use it to retrieve
data from business databases. Often used as a synonym for calling number.
- application
- A set of related VoiceXML documents that share the same application
root document.
- application root document
- A document that is loaded when any documents in its application
are loaded, and unloaded whenever the dialog transitions to a document
in a different application. The application root document may contain
grammars that can be active for the duration of the application, and
variables that can be accessed by any document in the application.
- ASP
- Microsoft Active Server Pages. One of many server-side mechanisms
for generating dynamic Web content by transmitting data between an
HTTP server and an external program. ASPs can be written in various
scripting languages, including VBScript (based on Microsoft Visual
Basic), JScript (based on JavaScript), and PerlScript (based on Perl).
B
- bail out
- The termination of a sequence of self-revealing help prompts,
if the user repeatedly fails to provide an appropriate response. This
is generally a transfer to a human operator (if available), or an
exit.
- barge-in
- A feature of full-duplex environments that allows the user to
interrupt computer speech output (audio file and text-to-speech).
See also “full-duplex.”
C
- CGI
- Common Gateway Interface. One of many server-side mechanisms
for generating dynamic Web content by transmitting data between an
HTTP server and an external program. CGI scripts are typically written
in Perl, although they can be written in other programming languages.
- called number
- The number dialed by callers to reach the voice application, or
the number dialed when making a call. Often used as a synonym for DNIS.
- calling number
- The number from which a call is made. Often used as a synonym
for ANI
- continuous speech recognition
- The WebSphere Voice Server supports “continuous speech recognition,”
in which users can speak a string of words at a natural pace, without
the need to pause after each word. Contrast with “discrete speech recognition.”
- cookie
- Information that a Web server stores on a user's computer when
the user browses a particular Web site. This information helps the
Web server track such things as user preferences and data that the
user may submit while browsing the site. For example, a cookie may
include information about the purchases that the user makes (if the
Web site is a shopping site). The use of cookies enables a Web site
to become more interactive with its users, especially on future visits.
- cut-thru word recognition
- See “barge-in.”
D
- data prompts
- Prompts where the user must supply information to fill in the
field of a form. Contrast with “verbatim
prompts.”
- dialog
- The main building block for interaction between the user and the
application. VoiceXML supports two types of dialogs: “form” and “menu.”
- discrete speech recognition
- Users must pause briefly after speaking each word, to allow the
system to process and recognize the input. Contrast with “continuous speech recognition.”
- DNIS
- Dialed Number Identification Service. A service supplied by the
public telephone network to identify the number actually dialed. For
example, calls placed to two or more 1-800 numbers will arrive at
the same call center switch. Upon arrival, DNIS tells the switch which
one of the 1-800 numbers was actually dialed. DNIS can be used by
the Voice Server to automatically select between several voice applications.
Often used as a synonym for “called number.”
- DTMF
- Dual Tone Multiple Frequency. The tones generated by pressing
keys on a telephone's keypad.
- DTMF Simulator
- A GUI tool that enables you to simulate DTMF input when testing
your VoiceXML application on your desktop workstation. The VoiceXML
browser communicates with the DTMF Simulator to accept DTMF input,
and uses that input to fill in forms or select menu items within the
VoiceXML application.
E
- echo cancellation
- Technology that removes echo sounds from the input data stream
before passing what's left (that is, user speech) to the speech recognition
engine. In a connection environment, this is configured on the telephony
hardware; if echo cancellation is poor, you may need to turn off barge-in or switch to “half-duplex.”
- ECMAScript
- An object-oriented programming language adopted by the European
Computer Manufacturer's Association as a standard for performing computations
in Web applications. ECMAScript is the official client-side scripting
language of VoiceXML. Refer to the ECMAScript Language Specification,
available at http://www.ecma.ch/ecma1/stand/ECMA-262.htm.
- event
- The VoiceXML browser throws an event when it encounters a <throw> element
and certain specified conditions occur. Events are caught by <catch> elements
that can be specified within other VoiceXML elements in which the
event can occur, or inherited from higher-level elements. The VoiceXML
browser supports a number of predefined events and default event handlers;
you can also define your own events and event handlers.
F
- form
- One of two basic types of VoiceXML dialogs. Forms allow the user
to provide voice or DTMF input by responding to one or more <field> elements.
See also menu.
- full-duplex
- Applications in which the user and computer can speak concurrently.
Full-duplex applications use echo cancellation to
subtract computer output from the incoming data to determine what
was user speech. See also “barge-in.”
Contrast with “half-duplex.”
G
- grammar
- A collection of rules that define the set of all user utterances
that can be recognized by the speech recognition engine at a given
point in time. The VoiceXML browser makes different grammars active
at different points in the dialog, thereby controlling the set of
valid utterances that the speech recognition engine is listening for.
Grammars support word substitution and word repetition.
- GSM
- Global System for Mobile Communication. The cellular telephone
network.
- GUI
- Graphical User Interface. A type of computer interface consisting
of visual images and printed text. Users can access and manipulate
information using a pointing device and keyboard. Contrast with “speech user interface.”
H
- H.323
- Audio communications protocol used by WebSphere Voice Server.
- half-duplex
- Applications in which the user should not speak while the computer
is speaking because the speech recognition engine does not receive
audio when sending audio to the user. Turn-taking problems can occur
when the user speaks before the computer has finished speaking; using
a unique tone to indicate the end of computer output can minimize
these problems by informing users when they can speak. Contrast with
“full-duplex.”
- help mode
- A technique for providing general help information explicitly,
in a separate dialog. Contrast with “self-revealing
help.” See Choosing help mode or self-revealing help.
I
- II digits
- Information Indicator Digits. A telephony service that provides
information about the caller's line (for example, cellular service,
pay telephone, etc.).
J
- JMF
- Java Media Framework.
- JSP
- Java Server Pages. One of many server-side mechanisms for generating
dynamic Web content by transmitting data between an HTTP server and
an external program. JSPs call Java programs, which are executed by
the HTTP server.
- JVM
- Java Virtual Machine.
L
- Lombard speech
- The tendency of people to raise their voices in noisy environments,
so that they can be heard over the noise.
M
- machine directed
- A dialog in which the computer controls interactions. Grammars
are only active within their own dialogs. Contrast with “mixed initiative.”
- menu
- One of two basic types of VoiceXML dialogs. Menus allow the user
to provide voice or DTMF input by selecting one menu choice. See also form.
- menu flattening
- A feature of natural command grammars that enables the system
to parse user input and extract multiple tokens. Mixed initiative
dialogs can provide similar benefits.
- mixed initiative
- A dialog in which either the user or the computer can initiate
interactions. You can use form-level grammars to allow the user to
fill in multiple fields from a single utterance, or document-level
grammars to allow the form's grammars to be active in any dialog in
the same VoiceXML document; if the user utterance matches an active
grammar outside of the current dialog, the application transitions
to that other dialog. Contrast with “machine
directed.”
- mixed-mode applications
- Applications that mix speech and DTMF input.
- μ-law
- The compression and expansion algorithm used in primarily in North
America and Japan when converting from analog to digital speech data.
- multi-modal application
- An application that has both a speech and a visual interface.
N
- natural command grammar
- A complex grammar that approaches natural language understanding
in its lexical and syntactic flexibility, but unambiguously specifies
all acceptable user utterances. Contrast with “natural language understanding (NLU).”
- natural language understanding (NLU)
- A statistical technique for processing natural language, using
text that is representative of expected utterances to create a dictation-like
model. NLU does not use grammars; instead, it uses statistical information
to tag and analyze key words in an utterance. Contrast with “natural command grammar.”
O
- out-of-grammar (OOG) utterance
- The user input was not in any of the active grammars.
P
- persistence
- A property of visual user interfaces is that information is persistent;
that is, information remains visible until the user moves to a new
visual page or the information changes. Contrast with “transience.”
- phone
- The actual pronunciation of a sound. Phones have a variable duration
of up to several seconds. Multiple phones can be categorized as the
same phoneme. For example, the vowels in the words “bean” and “bead”
are classified as the same phoneme; however, if you carefully monitor
the shape of your lips and the position of your tongue and jaw when
saying the two words, you can see that the actual sound of the vowel
is different (that is, they are different phones). Contrast with “phoneme.”
- phoneme
- A perceived pronunciation or category or pronunciation for a distinctive
sound segment of a language. A change in the phoneme changes the meaning
of a word. For example, “zip” and “sip” differ by only the initial
sound, but are completely different words. Contrast with “phone.”
- prompt
- Computer spoken output, often directing the user to speak.
- pronunciation
- A possible phonetic representation of a word that is stored in
the speech recognition engine and referenced by one or more words
in a grammar. A pronunciation is a string of sounds that represents
how a given word is pronounced. A word may have several pronunciations;
for example, the word “tomato” may have pronunciations “toe-MAH-toe”
and “toe-MAY-toe.”
- PSTN
- Public Switched Telephone Network.
- prosody
- The rhythm and pitch of speech, including phrasing, meter, stress,
and speech rate.
R
- recognition
- When utterances are known and accepted by the speech recognition
engine. Only words, phrases, and DTMF key sequences in active grammars
can be recognized.
- recognition window
- The period of time during which the system is listening for user
input. In a full-duplex implementation, the system is always listening
for input; in a half-duplex implementation or when barge-in is temporarily
disabled, a recognition window occurs only when the dialog is in a
state where it is ready to accept user input.
S
- self-revealing help
- A technique for providing context-sensitive help implicitly, rather
than providing general help using an explicit help mode. Contrast
with “help mode.”
- servlet
- One of many server-side mechanisms for generating dynamic Web
content by transmitting data between an HTTP server and an external
program. Servlets are dynamically loaded Java-based programs defined
by the Java Servlet API (http://java.sun.com/products/servlet/).
Servlets run inside a JVM on a Java-enabled server.
- session
- A session consists of all interactions between the VoiceXML browser,
the user, and the document server. The session starts when the VoiceXML
browser starts, continues through dialogs and the associated document
transitions, and ends when the VoiceXML browser exits.
- speech browser
- See “VoiceXML browser.”
- speech recognition
- The process by which the computer decodes human speech and converts
it to text.
- speech recognition engine
- Decodes the audio stream based on the current active grammar(s)
and returns the recognition results to the VoiceXML browser, which
uses the results to fill in forms or select menu choices or options.
- speech user interface (SUI)
- A type of computer interface consisting of spoken text and other
audible sounds. Users can access and manipulate information using
spoken commands and DTMF. Contrast with “GUI.”
See “DTMF.”
- spoke too soon (STS) incident
- A recognition error that occurs when the user in a half-duplex
application begins speaking before the turn-taking tone sounds and
continues speaking over the tone and into the speech recognition window.
- spoke way too soon (SWTS) error
- A recognition error that occurs when the user in a half-duplex
application finishes speaking before the turn-taking tone sounds.
- stuttering effect
- When a prompt in a full-duplex application keeps playing for more
than 300 ms after the user begins speaking, users may interpret this
to mean that the system didn't hear their input. As a result, the
users stop what they were saying and start over again. This “stuttering”
type of speech makes it difficult for the speech recognition engine
to correctly decipher user input.
- subdialog
- Roughly the equivalent of function or method calls. Subdialogs
can be used to provide a disambiguation or confirmation dialog, or
to create reusable dialog components.
- SUI
- See “speech user interface”.
T
- text-to-speech (TTS) engine
- Generates computer synthesized speech output from text input.
- token
- The smallest unit of meaningful linguistic input. A simple grammar
processes one token at a time; contrast with “menu flattening,” “natural command grammar,” and “natural language understanding (NLU).”
- transience
- A property of speech user interfaces is that information is transient;
that is, information is presented sequentially and is quickly replaced
by subsequent information. This places a greater mental burden on
the user, who must remember more information than they need to when
using a visual interface. Contrast with “persistence.”
- turn-taking
- The process of alternating who is performing the next action:
the user or the computer.
U
- URI
- Universal Resource Identifier. The address of a resource on the
World Wide Web. For example: http://www.ibm.com.
- URL
- Universal Resource Locator. A subset of URI.
- User to User Information
- ISDN service that provides call set-up information about the calling
party.
- utterance
- Any stream of speech, DTMF input, or extraneous noise between
two periods of silence.
V
- verbatim prompts
- Menu choices that the user can select by repeating what the system
said. Contrast with “data prompts.”
- voice application
- An application that accepts spoken input and responds with spoken
output.
- VoiceXML
- Voice eXtensible Markup Language. An XML-based markup language
for creating distributed voice applications. Refer to the VoiceXML
Forum Web site at http://www.voicexml.org.
- VoiceXML browser
- The “interpreter context” as defined in VoiceXML 2.0. The VoiceXML
browser fetches and processes VoiceXML documents and manages the dialog
between the application and the user.
W
- “Wizard of Oz” testing
- A testing technique that allows you to use a prototype paper script
and two people (a user and a human “wizard” who plays the role of
the computer system) to test the dialog and task flow before coding
your application. See Prototype phase (“Wizard of Oz” testing).
X
- XML
- eXtensible Markup Language. A standard metalanguage for defining
markup languages. XML is being developed under the auspices of the
World Wide Web Consortium (W3C).