bvi_seg: Batch Voice segmentation utility

Purpose

Find the start and end of each voice segment in continuous audio input and record these positions in an index file. The success of the utility depends on the silence gaps between segments being longer than any silence gaps within segments.

Description

The diagram shows the bvi_seg utility taking data from the flat audio file created by the bvi_rec utility, determining the position of each voice segment within it, and outputting the results to an ASCII index file. It also uses control parameters from the bvi.control file; these are explained in the text that follows.

Control parameters

The bvi_seg utility uses the following control parameters from the bvi.control file:

VOICE_FILE_NAME (supplied value: bvi.voice)

Specifies the name of the voice file created by bvi_rec and used by the other utilities. You can specify any valid voice file. If you do not specify a path, the utility expects the file to be in the directory $CUR_DIR/ca/BVI_dir.

The file format is 8 kHz, 16-bit linear, big-endian (that is, the most significant byte for each sample is written before the least significant byte).

INDEX_FILE_NAME (supplied value: bvi.index)

Specifies the name of the index file. You can specify any valid AIX file name. If you do not specify a path, the index file is created in the BVI custom server directory ($CUR_DIR/ca/BVI_dir). If an index file of this name already exists, bvi_seg overwrites it.

Each line in the index file refers to one voice segment and contains two ASCII-format numbers, the start and the end of the located segment. The numbers are byte offsets from the start of the voice file.

THRESHOLD_LEVEL (supplied value: -40)

Specifies the level in dBm that bvi_seg uses to decide whether an incoming signal is silence or voice. The utility calculates the level of the audio signal (taken from the voice file) for each 5 millisecond block, and compares that level with THRESHOLD_LEVEL. bvi_seg decides that it has voice activity if the level is above THRESHOLD_LEVEL; otherwise it considers the current block as containing silence.

The value to be assigned to THRESHOLD_LEVEL depends on the level of the incoming signal, but should be between -40 (dBm) and -25. (-25 is used when the incoming signal has a higher level (that is, it is louder) than for a signal for which -40 is most appropriate.

Determine the best value of THRESHOLD_LEVEL as follows:

Try the IBM-supplied value.
If the utility is not finding the gaps between segments (that is, it reports one long voice segment or fewer voice segments than you would expect), increase THRESHOLD_LEVEL by 5 dBm and try again.
If you find that the utility is having difficulty finding voice in the input (that is, it only finds one or two short segments), decrease THRESHOLD_LEVEL by 5.
You will probably be able to find a value in the range of -25 to -40 to suit your input data. You can fine-tune the value by 1 or 2 dBm, but it will probably not be necessary.
bvi_rec reports the maximum and minimum level for the recording voice. THRESHOLD_LEVEL should be set at a point about 25% of the difference between maximum and minimum, above the minimum level. For example, if the minimum level is -50 dBm and the maximum level is -10 dBm, set the THRESHOLD_LEVEL to -40.
If you find that you cannot achieve good segmentation by setting the THRESHOLD_LEVEL In the range -25 to -40, your input data is probably too loud or too quiet.

MARGIN_TIME (supplied value: 20)

Specifies the time (in milliseconds) that bvi_seg puts before and after the detected start and end of a segment. Because the level detection algorithm (using THRESHOLD_LEVEL) only cuts in at a certain level, setting MARGIN_TIME to a non-zero value ensures that voice activity immediately before the voice activity detection is captured.

END_MARGIN_TIME (supplied value: 20)

Specifies the silence time (in milliseconds) which bvi_seg puts after each voice segment. If this parameter is omitted, MARGIN_TIME is used for both before and after margin periods.

START_TIME (supplied value: 50)

Specifies the minimum length (in milliseconds) of voice activity for bvi_seg to decide that a segment has started. Voice activity is defined as an audible signal above the value specified for THRESHOLD_LEVEL.

START_TIME should be set to a value such that it is less than the length of the shortest utterance at the start of a segment. If you find that the start of one or more segments is being ignored (that is, segments start too late), decrease START_TIME. If, however, you find that bvi_rec is recognizing short periods of background noise as segments, increasing START_TIME will probably fix the problem.

STOP_TIME (supplied value: 1500)

Specifies the minimum length (in milliseconds) of silence for bvi_seg to decide that a segment has ended. Silence is defined as an audible signal below the value specified for THRESHOLD_LEVEL.

STOP_TIME must be set to be greater than the largest silence gap that can naturally occur within a segment, and less than the silence delimiter gap between segments. It is recommended that a 5 second silence gap between segments be used with a STOP_TIME of 1.5 seconds (1500 ms). This will prevent intra-segment natural gaps of up to 2 seconds being picked up as inter-segment gaps.

GLITCH_TIME_1 (supplied value: 25)

Specifies the maximum length (in milliseconds) of silence that can be tolerated within START_TIME before bvi_seg decides that what it has heard so far is not the start of a valid voice segment. This prevents inter-word or inter-syllable gaps at the start of a segment from fooling bvi_seg into thinking that it has not seen the start of a segment.

GLITCH_TIME_2 (supplied value: 250)

Specifies the maximum length (in milliseconds) of voice activity that can be tolerated within STOP_TIME before bvi_seg decides that what it has heard so far is not the end of a valid voice segment. This prevents noise pulses in the inter-segment silence gaps from fooling bvi_seg into thinking that it has not seen the end of a segment.

TRACE (supplied value: 0)

Specifies whether tracing is on or off. (Dumps trace information to screen.) For normal operation, set TRACE to 0. If you need to analyze the segmentation activity on a 5 ms block-by-block basis, set TRACE to 1. Internal information, such as the level of each block and state transition information, is displayed for each block.

Procedure

Note: Although there is no theoretical limit on the number of segments that can be processed in one batch, we recommended an upper limit of 5 minutes and 200 segments.

After opening the BVI Custom Server Import window (see Starting the BVI custom server), start segmentation by typing bvi_seg on the command line and pressing Enter. The utility:
1. Reads the control parameters from the bvi.control file.
2. Reads the voice file (in 5 ms blocks)
3. Locates the voice segments by performing signal processing on the voice data.
4. Generates the index file.

You can now use bvi_desc to add information about the voice segments to the description file.