Using Speech To Text (STT) engines


BVR includes functionality to allow a VoiceXML application to connect to Speech To Text engines (STT) as an alternative to traditional MRCP ASR speech providers. This is achieved using standard VoiceXML form processing and grammar tags with a custom grammar format.

In order to use an STT engine in a VoiceXML application, the application must have an STT Call Feature attached to it which matches the locale of the application. For further information on STT Call Features, please refer to BAM Command Line Utility Call Features Panel

A STT interaction is triggered by defining a grammar tag in a field with a mode of "voice" and a type of "application/x-blueworx-stt+json" . This is a custom Blueworx grammar format specifically for interacting with STT engines and is supplied as a JSON string, either inline or using the srcexpr parameter of a grammar.

Responses from the STT engine are returned to the VoiceXML application when the field is filled using standard VoiceXML form item variables.

The contents and format of the custom grammar and the format of the response data is described in the following sections and examples are given below.

Data Structures

Format of data structures passed to and from Virtual Assistants

An object structure is used to send data to and receive data from STT engines. Data sent to the STT engine is sent in the grammar, and returned via the interpretation parameter of the shadow variables. This normalised data structure is sent out to the attached STT engine in the format it is expecting, and the STT engine's response is normalised to the Blueworx format when it is returned in the filled block.

Note that all of these parameters are optional.

Variable Type Description
input_detection_mode string Input detection mode. This can be either "NOISE", "FIRST_INTERIM_RESULT" or "FIRST_FINAL_RESULT". The "NOISE" detection mode is implemented by BVR and occurs when any sound is heard on the call. The other detection modes are triggered by results coming back from the STT engine.
stt_parms JSON object A set of key/value pairs that are sent directly to the STT engine to change its behaviour.

Setting custom STT parameters

It is possible to set parameters for an STT engine by using the stt_parms parameter of the options structure. As an example, if using the IBM Cloud STT engine, all parameters available in the WebSockets API can be set in stt_parms and will be sent to the STT engine. Note that URL parameters and message body parameters can both be used in the stt_parms structure for IBM STT. If the model parameter is set in stt_parms it will override the model in the call feature definition.

The supported Google STT engine parameters are documented at Google Speech To Text (STT) supported parameters

The supported IBM Cloud STT engine parameters are documented at IBM Cloud Speech To Text (STT) supported parameters

Using Inline Grammars

To use inline grammars, supply the STT data structure format as JSON in the grammar body.


This sends a request to the IBM Cloud STT engine setting the input detection mode and some IBM Cloud specific parameters

<grammar mode="voice" version="1.0" type="application/x-blueworx-stt+json">
        "input_detection_mode": "NOISE",
        "stt_parms": {
            "speech_detector_sensitivity": "0.5",
            "background_audio_suppression": "0.5",
            "end_of_phrase_silence_time": "2.0"

Note that using an inline grammar is the simplest way to send a request to aN STT engine without any parameters. The body of the grammar can be an empty JSON structure:

<grammar mode="voice" version="1.0" type="application/x-blueworx-stt+json">

Using srcexpr

The data structure for an STT engine can be set to the srcexpr parameter of the grammar tag. The value of the srcexpr must be a JSON representation of the STT data structure. To make the JSON easier to generate, BVR includes a function to convert an Object structure to JSON. Therefore, it is possible to build up the STT data structure as an Object in the VXML document then pass it to this function to get the JSON representation. The function is called "objectToJson" .

Example usage:

1. Define the variable as a new Object

    <var name="stt_request_structure" expr="new Object()"/>

2. Add any variables for the request. For example, the following sets the same as the previous inline grammar.

    <assign name="stt_request_structure.input_detection_mode" expr="'NOISE'" />
    <assign name="stt_request_structure.stt_parms" expr="new Object()" />
    <assign name="stt_request_structure.stt_parms.speech_detector_sensitivity" expr="0.5" />
    <assign name="stt_request_structure.stt_parms.background_audio_suppression" expr="0.5" />
    <assign name="stt_request_structure.stt_parms.end_of_phrase_silence_time" expr="2.0" />

3. This structure can be converted to JSON and assigned to the srcexpr parameter of the grammar tag

    <grammar mode="voice" version="1.0" type="application/x-blueworx-stt+json" srcexpr="objectToJson(stt_request_structure)"/>

Handling data returned from an STT engine

If no input was detected, a "noinput" event will be thrown. If no intents or entities were detected in the response, a "nomatch" event will be thrown, however it is still possible to pull the STT engine's response using the application.lastresult$ shadow variable.

The shadow variables in the filled block will contain the usual fields.

Field Description
utterance This is what the user said to the virtual assistant.
confidence The highest confidence value returned for all intents and entities in the response
interpretation This is the object that the Virtual Assistant data structure is mapped to and contains the entirety of the normalised data structure in the Object key/value pair format

The confidencelevel VXML property

The confidencelevel VXML property is used to determine the confidence threshold below which the result will be deemed a no match. If the STT engine confidence is below the confidence threshold, a nomatch will be returned.