Purpose
Find the start and end of each voice segment in continuous audio input
and record these positions in an index file. The success of the utility depends
on the silence gaps between segments being longer than any silence gaps within
segments.
Control parameters
The bvi_seg utility uses the following control parameters from the bvi.control
file:
- VOICE_FILE_NAME (supplied value: bvi.voice)
- Specifies the name of the voice file created by bvi_rec and used by
the other utilities. You can specify any valid voice file. If you do not specify
a path, the utility expects the file to be in the directory $CUR_DIR/ca/BVI_dir.
The file format is 8 kHz, 16-bit linear, big-endian (that is, the most significant
byte for each sample is written before the least significant byte).
- INDEX_FILE_NAME (supplied value: bvi.index)
- Specifies the name of the index file. You can specify any valid AIX
file name. If you do not specify a path, the index file is created in the
BVI custom server directory ($CUR_DIR/ca/BVI_dir). If an index
file of this name already exists, bvi_seg overwrites it.
Each line in
the index file refers to one voice segment and contains two ASCII-format numbers,
the start and the end of the located segment. The numbers are byte offsets
from the start of the voice file.
- THRESHOLD_LEVEL (supplied value: -40)
- Specifies the level in dBm that bvi_seg uses to decide whether an incoming
signal is silence or voice. The utility calculates the level of the audio
signal (taken from the voice file) for each 5 millisecond block, and compares
that level with THRESHOLD_LEVEL. bvi_seg decides that it has voice activity
if the level is above THRESHOLD_LEVEL; otherwise it considers the current
block as containing silence.
The value to be assigned to THRESHOLD_LEVEL
depends on the level of the incoming signal, but should be between -40 (dBm)
and -25. (-25 is used when the incoming signal has a higher level (that is,
it is louder) than for a signal for which -40 is most appropriate.
Determine the best value of THRESHOLD_LEVEL as follows:
- Try the IBM-supplied value.
- If the utility is not finding the gaps between segments (that is, it reports
one long voice segment or fewer voice segments than you would expect), increase
THRESHOLD_LEVEL by 5 dBm and try again.
- If you find that the utility is having difficulty finding voice in the
input (that is, it only finds one or two short segments), decrease THRESHOLD_LEVEL
by 5.
- You will probably be able to find a value in the range of -25 to -40 to
suit your input data. You can fine-tune the value by 1 or 2 dBm, but it will
probably not be necessary.
- bvi_rec reports the maximum and minimum level for the recording voice.
THRESHOLD_LEVEL should be set at a point about 25% of the difference
between maximum and minimum, above the minimum level. For example, if the
minimum level is -50 dBm and the maximum level is -10 dBm, set the THRESHOLD_LEVEL
to -40.
- If you find that you cannot achieve good segmentation by setting the THRESHOLD_LEVEL
In the range -25 to -40, your input data is probably too loud or too quiet.
- MARGIN_TIME (supplied value: 20)
- Specifies the time (in milliseconds) that bvi_seg puts before and after
the detected start and end of a segment. Because the level detection algorithm
(using THRESHOLD_LEVEL) only cuts in at a certain level, setting MARGIN_TIME
to a non-zero value ensures that voice activity immediately before the voice
activity detection is captured.
- END_MARGIN_TIME (supplied value: 20)
- Specifies the silence time (in milliseconds) which bvi_seg puts after
each voice segment. If this parameter is omitted, MARGIN_TIME is used for
both before and after margin periods.
- START_TIME (supplied value: 50)
- Specifies the minimum length (in milliseconds) of voice activity for
bvi_seg to decide that a segment has started. Voice activity is defined as
an audible signal above the value specified for THRESHOLD_LEVEL.
START_TIME
should be set to a value such that it is less than the length of the shortest
utterance at the start of a segment. If you find that the start of one or
more segments is being ignored (that is, segments start too late), decrease
START_TIME. If, however, you find that bvi_rec is recognizing short
periods of background noise as segments, increasing START_TIME will probably
fix the problem.
- STOP_TIME (supplied value: 1500)
- Specifies the minimum length (in milliseconds) of silence for bvi_seg
to decide that a segment has ended. Silence is defined as an audible signal
below the value specified for THRESHOLD_LEVEL.
STOP_TIME must be
set to be greater than the largest silence gap that can naturally occur within
a segment, and less than the silence delimiter gap between segments. It is
recommended that a 5 second silence gap between segments be used with a STOP_TIME
of 1.5 seconds (1500 ms). This will prevent intra-segment natural gaps of
up to 2 seconds being picked up as inter-segment gaps.
- GLITCH_TIME_1 (supplied value: 25)
- Specifies the maximum length (in milliseconds) of silence that can be
tolerated within START_TIME before bvi_seg decides that what it has
heard so far is not the start of a valid voice segment. This prevents inter-word
or inter-syllable gaps at the start of a segment from fooling bvi_seg into
thinking that it has not seen the start of a segment.
- GLITCH_TIME_2 (supplied value: 250)
- Specifies the maximum length (in milliseconds) of voice activity that
can be tolerated within STOP_TIME before bvi_seg decides that what it has
heard so far is not the end of a valid voice segment. This prevents noise
pulses in the inter-segment silence gaps from fooling bvi_seg into thinking
that it has not seen the end of a segment.
- TRACE (supplied value: 0)
- Specifies whether tracing is on or off. (Dumps trace information to
screen.) For normal operation, set TRACE to 0. If you need to analyze the
segmentation activity on a 5 ms block-by-block basis, set TRACE to 1. Internal
information, such as the level of each block and state transition information,
is displayed for each block.
Procedure
Note: Although there is no theoretical limit on the number of segments
that can be processed in one batch, we recommended an upper limit of 5 minutes
and 200 segments.
- After opening the BVI Custom Server Import window (see Starting the BVI custom server), start segmentation by typing bvi_seg on the command
line and pressing Enter. The utility:
- Reads the control parameters from the bvi.control file.
- Reads the voice file (in 5 ms blocks)
- Locates the voice segments by performing signal processing on the voice
data.
- Generates the index file.
You can now use bvi_desc to add information about the voice segments
to the description file.