High-quality voice data

Sampling rate

All voice segments stored in the Blueworx Voice Response database use an 8 kHz sampling rate, consistent with standards used for telephony transmission. The Voice Segment window lets you digitally input data from other sources, but converts it to 8 kHz if necessary. There is no advantage to using sampling rates other than 8 kHz when recording new voice segments using the Voice Segment window. Similarly, the command line utilities, bvi_aiff and bvi_wav, convert any sampling rate greater than 8 kHz to the required 8 kHz rate.

Source format

Use the best-quality source for your voice segments and import these into Blueworx Voice Response in 16-bit PCM (linear) format at an 8 kHz sampling rate. To do this, use studio-quality DAT tape through the line-in of the Ultimedia adapter with the Ultimedia format set to 16-bit PCM. Alternatively, you may already have 16-bit PCM voice segments as files that can be imported directly into the Voice Segment Editor. The editor can change sampling rates are required, although slight distortion will usually result from a change in sampling rate. You should therefore always use an 8 kHz sampling rate for imported voice data if possible.

Dynamic range

When using the voice segment editor or the batch voice input utility to record voice segments via the Ultimedia adapter with an audio source connected to the its line input, you may find that the audio signal is relatively small compared to the available ‘dynamic range’. 16-bit PCM allows signal levels of up to 32K, whereas typical input signals from the Ultimedia adapter may have an amplitude of around 2K. When using 5:1 compression, the best quality is obtained if the input signal occupies as much of the 32K range as possible without signal peaks exceeding the available limits. This can be done with an external preamplifier or by using the MAXIMIZE option of the voice segment editor or batch voice input utility which digitally scales the input signal to occupy 90% of the full range.

Note that the maximize button of the voice segment editor is only enabled when operating in 16-bit PCM mode.

Filters

When you record a high-quality input signal for use over the telephone, it is necessary to filter out all frequencies above 4 kHz to allow transmission at the digital 8 kHz rate. (The voice segment editor does this automatically when it stores the segment in the database.) Loss of these high frequencies can make the signal sound relatively dull. You can improve this by using the Boost button of the voice segment editor before saving the recorded segment. This increases the volume of frequencies in the range 1.5 kHz to 4 kHz by 2 dB, and decreases the volume of frequencies in the range 500 Hz to 1.5 kHz by 2 dB. An identical effect can be achieved with the “Boost” option of the batch voice input utility where the boost amount can be set to any value.

Note that the boost button of the voice segment editor is only enabled when operating in 16-bit PCM mode.

Recording directly using a microphone

A direct microphone input can provide excellent quality input. However, the pSeries computer must be within 10-15 feet (maximum) of the microphone in order to minimize electrical noise pick-up. This may be difficult to achieve in a studio environment because fan and disk noise prohibit the pSeries computer from being in the same room as the microphone.

Using a recording studio

For the best results when recording voice segments, keep to the following rules:

It is recommended that a professional recording studio with an anechoic chamber be used to record the audio if you want segments to be of the highest possible quality. It is important to achieve a good acoustic ambience (a normal office has too much reverberation).
Keep background noise to an absolute minimum. Even low-level noise generated by cooling fans in machines such as personal computers, should be avoided.
If you are editing segments in the studio, do not put absolute silence between segments, as this sounds unnatural. Instead, insert room-tone silence breaks (background studio ambient sound).
Half a second of silence at the beginning and end of each segment is recommended.
Record segments as a continuous stream of audio with a silence gap between consecutive segments. The recommended silence gap is five seconds, because this allows the batch voice import utility to distinguish the silence gaps between segments from the natural gaps that occur within segments.
If a mistake is made during the recording of a segment, just stop, wait for five seconds (or whatever inter-segment gap you have decided to use) and then re-record the segment. Bad segments can be removed by the voice segment editor or the batch voice import utility.

If you are working with a studio which has reasonably sophisticated audio processing capabilities, it is wise to apply the audio boost function at source rather than with batch voice import utility. The best frequency-shaping function to apply is defined in the ITU P-Series Blue Book (Volume 5 1988) in Supplement No. 10 (P332). This is the preferred response for a telephone microphone as determined by user trials, and can be applied to flat-spectrum audio, achieving the same results as if the voice was being spoken through a telephone.

The frequency shaping function recommended by the ITU boosts the treble and cuts the bass in a signal in order to restore some of the brightness lost when a full-bandwidth audio signal is low-pass filtered at 3400 Hz prior to sampling at 8 kHz and is similar to the BOOST option of the voice segment editor or the batch voice import utility. Be sure that the shaping is not done both in the studio and by one or other of Blueworx Voice Response’s voice utilities.

The ITU-recommended frequency response characteristic is as follows:

0 dB reference at 1kHz
Under 1kHz, 4 dB/octave roll-off to 200 Hz
Below 200 Hz, 8 dB/octave roll-off
Above 1kHz, smooth increase to +7 dB peak at 2600 Hz.
Sharp cutoff at 3.4kHz

Responses for spot frequencies are shown in Table 1.

Table 1. Responses for spot frequencies
Frequency	Response attenuated or amplified by
50 Hz	-20 dB
100 Hz	-12 dB
200 Hz	-4.5 dB
400 Hz	-2 dB
800 Hz	-1 dB
1000 Hz	0 dB
1500 Hz	+2.5 dB
2000 Hz	+6 dB
2500 Hz	+7 dB
3000 Hz	+6 dB
3400 Hz	0 dB

To get the best results when recording data for use as background music:

Don’t use the BOOST option of the voice segment editor or the batch voice import utility
Filter the signal using a graphic equalizer before the it reaches the Ultimedia adapter.