Overview of voice signal processing

Normally, voice is transmitted to the human ear by means of an acoustic wave travelling through the air at the speed of sound. A conventional analog telephone transmits sound through a wire as an electrical signal which travels at close to the speed of light. To do this, the acoustic signal generated by the human vocal chords must first be converted to an electrical signal, and then converted back to an acoustic form before it can be heard by the human ear. These two conversions are done by a telephone mouthpiece and earpiece respectively.

The electrical signal sent over the telephone wire for a conventional telephone is of an analog form. That is, it is represented as a voltage which varies continuously in a given range (for example, 0 to 1 volt) where the louder the signal, the higher the voltage. The normal electrical signal is described as analog because the voltage can take any value in the possible range, that is an infinite number of possible values. As well as the signal varying continuously in the voltage limits, an analog signal is able to vary continuously over time with no requirement to change only at fixed time intervals.

Although analog signals are the easiest to handle in a simple telephone system, they give rise to a number of problems if they are to be stored or processed by computer or if they are to be sent over a long distance. Sending an analog signal over long distances rapidly decreases the signal strength, and can increase background noise level, both of which lead to severe quality degradation. For these reasons, almost all modern telephone systems are based on the concept of digital processing of voice, where the signal is converted to a form which can be handled by standard digital computers as a sequence of numbers. This means that voice can be stored on a standard computer as a set of numerical values, for example, just like a spreadsheet, and an operation such as increasing the volume of a segment is equivalent to multiplying every number in a spreadsheet by a certain value.

To convert an analog signal into a digital form two steps are needed:

First, the analog signal is sampled at a fixed rate to break it into a sequence of analog samples which can be handled individually. For the highest possible audio quality (such as CD audio), the sampling rate is usually very high, that is 44,000 times per second (44 kHz), whereas for the telephone, where a much lower voice quality is acceptable, the sampling rate is only 8,000 times per second (8 kHz). This is a fixed sampling rate now used by all telephone systems of the world.

Note: The sampling rate is one factor limiting the voice quality that can be achieved over a telephone link as it limits the frequency response (the highest audio signal that can be carried) to one half of the sampling rate, that is 4 kHz. The human ear can detect frequencies up to about 18 kHz; dogs and bats can detect even higher frequencies.

Second, each analog sample is converted to a number to allow it be handled by the digital computer. For example, if the input signal has a range of 0 to 1 volt and 16-bit numbers are used to represent the digital form of each signal sample, the digital value 0 would represent 0 volts, the digital value 65535 would represent 1 volt with a linear sliding scale for intermediate values (for example, 32767 = 0.5 volt).

Note: Analog voltages are more usually transmitted with a center value of zero and, say, maximum and minimum values of +0.5 volt and -0.5 volt respectively. This corresponds with a two's complement digital numbering system which can, for 16-bit values, range from +32767 down to -32768 with a center value of zero.

A special technique known as companding is used to reduce the number of bits for each voice sample from 16 to 8 bits. This halves the amount of data to be processed and stored. Companding applies a logarithmic conversion to each sample, resulting in a signal format known as μ-law (used in North America, Japan, and some other countries) or A-law (used in Europe, Latin America, and some other countries). These 8-bit samples can then be stored, transmitted and processed, and a reverse (anti-log) process applied to the signal at the receiver to reproduce the original signal with very little loss in quality.

Note that almost without exception, T1 digital trunks are encoded as μ-law, and E1 trunks as A-Law. Also note that μ-law and A-law signals are not compatible, they must be converted to move from one to the other.

When Blueworx Voice Response plays voice to, or records voice from, the telephone line, it is at the standard 8 kHz 8-bit rate (μ-law for T1, A-law for E1). When the data is stored on disk it can be in either uncompressed form (which is always 8 kHz 8-bit μ-law or A-law), or compressed. Blueworx Voice Response applies a compression algorithm to the signal to reduce its size by a factor of five. When compressed voice is played to the line, Blueworx Voice Response decompresses it to reproduce the original 8 kHz, 8-bit signal.

Blueworx Voice Response uses a compression algorithm known as GSM (used in the digital mobile phone system of the same name). This gives a very good quality at a compression ratio of five to one, that is the data rate is reduced to 1600 bytes per second. Other compression techniques, such as ADPCM, are also used in the voice processing industry to reduce the size of voice data. Blueworx Voice Response uses only the five to one GSM compression algorithm; this is supplied as part of Blueworx Voice Response.

The advantages of using compressed voice are that you use less disk storage, less system memory, less processing time, and less bus bandwidth. The disadvantage of compressed voice is that the quality of voice is slightly reduced. Depending on your application, this may or may not be a problem, although you can take steps to ensure that the quality of compressed voice is as high as possible (see the Blueworx Voice Response for AIX: Application Development using State Tables information).