Tutorial Introduction

This primer is intended to take the mystery out of the process that reduces noises from an audio signal. This process is known as “audio enhancement.” The information is presented in terms that are easy to understand and is intended as an introduction for those unfamiliar with audio enhancement.

The current state of audio enhancement involves multiple microprocessors performing millions, even billions, of mathematical operations a second. These mathematical equations, or algorithms, perform sophisticated digital analyses of the audio and then modify the signal to reduce those components of the signal that it determines are noise. DAC equipment helps to insulate you from the complexity of this process. Through the use of simplified controls and software interfaces, you do not have to be an electrical engineer to achieve optimum voice enhancement results.

After reading this primer, we trust you will have a basic understanding of the principles of sound and what we can accomplish at DAC with our audio enhancement products. For more thorough training in audio enhancement theory and application, DAC offers the “DAC School of Forensic Audio.” Please contact DAC for more information regarding dates and availability.


Contents


Sound

If a tree falls in the forest and there is no one there to hear it, does it make a sound? As a matter of fact, it does.

Sound is simply pressure waves in the air that are caused by some disturbance (such as a tree falling, a car crash, a jack hammer, etc.). In other words, something happens that causes the air to respond in a certain way that affects our eardrums so we sense . At 70° F, these sound waves travel through the air at about 1130 feet/second or about 770 miles/hour. That's fast!

Table 1. Typical Sound Pressure Levels

Level (dBA SPL)

Example

135 Threshold of pain
125 Jackhammer
115 Car horn
95 Subway train
75 Street traffic
65 Conversation
55 Business office
45 Living room
35 Library reading room
25 Bedroom at night
15 Broadcst studio
5 Threshold of hearing

Sound has two characteristics that are important to us when we are enhancing audio: energy and frequency. Energy is the strength of the signal, measured as the decibel level or dB. Many times this is characterized as “volume.” High dB is loud. Low dB is quiet. Below are some common sounds and their dB level.

Frequency defines the characteristics of the disturbance over time. This used to be called cycles per second (cps), but now we call it Hertz (Hz). Frequency affects our ability to hear the sound. Remember our eardrums respond to this disturbance in air. A good ear typically responds to frequencies between 20 — 20,000 Hz. As we age, or if we have damaged eardrums, the higher frequencies usually become less audible. Law enforcement officers who years ago took target practice without good quality hearing protection frequently demonstrate loss of high frequency hearing. (This is not all bad, as we will explain in the chapter on voice.)

In later chapters we will discuss what we can do with these characteristics to concentrate on voice enhancement.

Key Points to Remember:

  • Sound is a function of pressure waves moving through the air as a result of some disturbance.
  • Sound has two defining characteristics: energy and frequency.
  • The human ear typically responds to frequencies between 20 -20,000 Hz.

Noise

Now that we understand how sound is created, let's talk about a special kind of sound as noise. For our purposes, anything but the voice signal in which we are interested is classified as noise. We are surrounded by noise everyday - in the house, at the office, while driving in the car, even on the golf course. In some places they even add noise so that you don't notice other noises! Anyway, we are constantly exposed to noise. Most of the time, it just makes our good tape recordings a little bit harder to understand. But, sometimes we wonder if there are actually voices on the tape because the noises are so loud!

For this primer, we are going to classify noises into three categories: additive, convolutional, and distortion. The sketch below shows a typical recording scenario and some of these noises. We will discuss each of the three categories separately.

Figure 1. Noise Model

Additive

The most common of the noises we encounter are additive. These noises come from anything that generates an audible sound. The list is endless: engines, motors, fans, florescent light buzz, radios, TV, juke box, music bands, glass breaking, paper rustling, wind, etc., etc., etc. For audio enhancement purposes, these additive noises can be classified into two general categories: random and time-correlated. Certainly you know what random noise is like. Some forms of music come to mind. Also sounds such as a nylon jacket rubbing against a “concealed” microphone. Sound familiar? Many noises we encounter are classified as random. Time correlated noises, though, have a repeatable, predictable “pattern.” Examples are tones, power line hum, and florescent light buzz. These noises can be effectively reduced using 1 CH adaptive filtering, and often this processing alone is enough to make a recording intelligible, even in the presence of random noises that cannot be reduced by the process.

Convolutional

The second type of noise is convolutional. These noises are the result of room acoustics and only exist if there is a sound source. In a room with no sound source, there is no convolutional noise. If a person speaks or some additive noise source is present, depending on the room acoustics, there may be an echo or reverberation. This echo is a convolutional noise. Sometimes it is so strong that it interferes with hearing the desired audio signal. It is also a repeatable, predictable signal (described above) and thus a time-correlated noise. In “hard rooms” - for example, sheet rock walls and ceiling, tile floor, no soft furniture, and nothing to absorb sound waves - a sensitive microphone will pick up echoes that our ears might ignore making the recording much worse than expected. Jail cells and interview rooms are good examples of hard rooms, and are the source of many bad recordings encountered by law enforcement. But remember echoes are time-correlated, and 1CH adaptive filtering can reduce time-correlated noises. Again, this processing is often sufficient to allow an otherwise unintelligible recording to be understood.

Distortion

The third type of noise is distortion. The equipment we use to make the recording introduces this noise. Inexpensive microphones, bad cabling, poor quality recorders, cheap tapes, weak batteries - - all contribute to distortion. Distortion, though, cannot be reduced without physically altering the recorded sounds. This could raise questions of “doctoring” if we tried to use the altered recording in court. We are lucky, though, because distortion typically has only a minimal effect on voice intelligibility; the main impact is on voice quality. Since we cannot do much with it, we will not address distortion any further, except to say that we should pay attention to all of the components of our system. Remember the old cliché about the weakest link — it applies here, too.

As with sound, “noise” can be defined in terms of energy level and frequency or frequency band (bandwidth). And we will discover that some are easily reduced while others are not affected by current audio enhancement techniques


Key Points to Remember:

  • Noise is divided into three major categories: additive, convolutional, and distortion
  • Noise can be characterized as either “time correlated” (predictable) or “random.”
  • Like sound, noise has two defining characteristics: energy and frequency.

Voice

For the purpose of this primer, we are going to concentrate on trying to reduce noise to hear the voice. In some applications, there may be other compelling reasons to analyze the entire audio signal; but, for our purposes, we will just try to make the voices more intelligible. We are not interested in analyzing the voice, only in determining what is said.

The human voice communicates an abundance of information. Words are but a small part. The emotional state of the speaker, the person's gender, age, level of education, dialectical influences, and physical attributes are all communicated in the acoustic waveform. In order to preserve all of this information, the voice should be recorded and processed with as much fidelity as possible.

If you recall, we said that many people could hear audio signals up to 20,000 Hz. That's great! But three-fourths of these signals are classified, according to our definition, as noise. Pay attention here: the most important voice information is located between 200 and 5000 Hz! In most cases, audio signals outside this range will be considered “noise.” Sometimes a voice will have frequency characteristics outside this range so it may be necessary to adjust our processing frequency band. Remember these numbers, as they will continually come into play when we begin our process of reducing background noise. So as a rule of thumb, we can establish the voice frequency range as 200 — 5000 Hz.

Vowels typically have high energy concentrated below 3000 Hz. Consonants have lower energy and typically are distributed above 3000 Hz. Both are very important for word discrimination. Voice is also classified as a random audio signal because its components are relatively unpredictable.

The figure below shows a typical voice spectrum.

Figure 2. Typical Averaged Voice Spectrum

Note from the figure that there is voice information out to 10,000 Hz and beyond. However, what we find is that when we expand the upper frequency limit, the signal contains substantially more noise than voice information. There is a term that we use called signal-to-noise ratio, S/N or SNR. This is defined as:

If we only marginally increase the signal (voice) energy by expanding the bandwidth and substantially increase the noise level, the S/N ratio decreases, thus making it harder to understand the voices. We want the highest signal to noise ratio possible when we are trying to understand what is being said. We have found that voice quality and intelligibility are acceptable for a voice frequency band of 200 to 5000 Hz. Expanding beyond these limits has marginal value.

Key Points to Remember:

  • In addition to words, the human voice communicates an abundance of information including the speaker's emotions, gender, age, and dialectical influences.
  • The most important voice information is located in the frequency range between 200 - 5000 Hz. For forensic applications, anything outside this range is typically treated as noise.
  • The higher the signal-to-noise ratio (SNR), the higher the quality of the voice information available in a recording.

Monitoring and Recording

This chapter provides a brief discussion on recorders and how they impact audio enhancement. We will cover three types of recording devices that are commonly used in law enforcement: microcassette, cassette, and digital recorders.

Typical recording setups are shown in the block diagrams below:

The quality of each component is critical to the quality of the recording. Surely this is no surprise. It is like your stereo system at home. If you have the best amp, best CD player, and cheap speakers, what kind of sound can you expect? So, if you want a good recording, you have to pick up the sound and transmit it with good quality equipment, and then have a good quality recorder.

Microcassette

The microcassette is commonly used in law enforcement primarily because of its size. There are two major draw backs to obtaining a good quality recording with this device: 1) the quality of the tape is generally poor, and 2) these units are typically bandlimited to around 3000 Hz, resulting in reduced frequency response. Remember, there is important voice information up to 5000 Hz. As a result, the recorded voice may not sound exactly like the person speaking. If you use microcassettes, it is recommended that you get a high quality microcassette playback machine, which has pitch control to account for varying tape speeds and azimuth adjustment. This will allow you to match the playback head to the recorded tracks. It is also recommended that you not use the built-in microphone, as these are often poor quality and typically not located at an optimum position for recording purposes.

Cassette

Cassette recorders are still the mainstay of law enforcement audio recording. There are many good quality units available. These recorders have bandwidths up to 10,000 Hz, which is more than adequate for recording voice. The highest recordable frequency is determined by two principal factors, tape speed and head gap. The faster the speed and the more narrow the gap, the higher the recordable bandwidth. Many recorders have been modified to reduce the speed by one-half in order to allow twice as much time to be recorded onto tape. Unfortunately, changing design speeds can sometimes result in reducing the highest recordable frequency to lower than the desirable 5000 Hz.

Another problem with magnetic tape players comes in the form of wow and flutter. These are dynamic frequency variations in an analog recording. These effects are due to tape not passing over the record/playback head at a constant speed. Motor speed regulation, varying tape tension, and irregularities in pinch and backup roller shape all contribute. Wow is a low frequency variation (a few Hz), while flutter is a high frequency variation (up to hundreds of Hz). Wow and flutter can be introduced during both the recording and the playback process. The overall effect of the fluctuations is to produce an undesired “modulation” effect on the recorded audio. Substantial wow and flutter are audible as a vibrating, or nervous, overtone to the voice accompanied by a loss of audio crispness. Even modest levels of wow and flutter can impair enhancement and noise cancellation because the signal processor is forced to “chase” these dynamic noises.

Although it is not always possible, playback should be done on the same machine that did the recording. This can help minimize some of the specific equipment recording deficiencies.

Digital

Digital recorders (e.g., DATs, MiniDisks, recordable CDs, FBIRDs, and SSABRs) are becoming more popular as the price decreases. These recording units offer exceptional quality for audio reproducibility. The incoming analog signal is converted into a sequence of numbers in an analog-to-digital (A/D) converter. The audio is sampled at rates up to 48,000 samples per second to define the signal precisely, with no introduction of wow or flutter distortion. These samples are stored as binary (0 or 1) numbers. The data can be recorded onto magnetic tapes, floppy disks, hard disks, or flash memory chips. Using a laser, the data can be optically recorded. Even if the recorded data has very poor noise characteristics, it is still possible to distinguish between 1 and 0 to achieve a good quality playback.

A word of caution is warranted when using digital recorders. Some digital recorders may employ audio compression. For example, if a chip can store 1,000,000 bits of data, and the recorder has 10,000 bits of audio per minute real time, the recorder can store 100 minutes of data. Lossy compression schemes work by throwing away audio data that is deemed by the algorithm to be unimportant, allowing the chip to record more than 100 minutes of audio. Unless the compression is lossless, our ability to subsequently reduce or cancel noise from the recorded signal may be affected. Therefore, unless the compression algorithm is known to be lossless, it is recommended that any audio compression feature be disabled when making digital recordings that might require subsequent processing.

General

Remember that the recording, or even live monitoring, is only as good as the weakest component. In addition, the equipment should be cleaned, examined for damage, and tested prior to each use. Always use fresh batteries and a new, good quality tape. Whenever possible, we should make stereo recordings. The spacing between microphones should be about the size of your fist with the thumb extended. There are processing techniques that can use such stereo recordings to further reduce background interference.

Microphone placement is critical to making a good recording. You want as clear an air path to the mike as possible. Also remember that the audio level is inversely proportional to the square of the distance to the microphone. In other words, if the distance to the microphone is doubled, the audio level will be cut to one-fourth the original level. Suppose a person being recorded is five feet from the microphone and has a sound level of 60 dB. If he moves away to 10 feet from the microphone, the intensity of the sound drops to one-fourth of its previous level, and the sound level decreases by 6 dB to 54 dB. (Remember that dB is a logarithmic scale.) If the speaker moves to a distance of 20 feet from the mike, the intensity of the sound drops to one-sixteenth of the original level, and the sound level decreases to 48 dB. A man's voice at a long distance from the microphone becomes a low-frequency, muddled tone that blends into the background.

For body wires, always fasten the microphone to the outermost piece of clothing, fastening in such a way to prevent the clothing from rubbing against the mike. Place room mikes close to where the people are actually talking and away from strong noise sources. Good recordings are hard enough to obtain without our making matters worse by failing to take these reasonable steps before the recording is made. By taking these steps, we ensure that we will get the best possible results when we go back and process the recorded audio to reduce noises.

Key Points to Remember:

  • The quality of the voice recording is directly proportional to the quality of the recording equipment used. Using the highest quality recorders, microphones, and recording media can obtain best results.
  • The native limitations of a particular recording device must be considered to ensure maximum opportunity for subsequent voice enhancement processing.

Enhancement

Despite all of our efforts to have good equipment and properly plan for the recording of the conversations, background noise (over which we have little control) often makes the voices difficult to hear and/or understand. In this chapter, we will discuss processing the audio from tape recordings to reduce this intrusive noise. It is also important to note that these techniques and technology apply equally well to live monitoring situations since DAC equipment operates in real time.

There are a variety of audio filters that we can use to reduce noise. Remember, our principal goal is to understand what is being said. But if we also plan to use this tape in any legal proceedings, we must be careful that the voice sounds like the person recorded. This means that we must be careful what we do in the voice frequency range.

Set bandwidth

The first step in our enhancement process is to establish the frequency bandwidth. All DAC equipment has either adjustable bandwidth or is preset at the voice frequency range (200 to 5000 Hz). Since microcassette recorders cannot record audio above about 3000 Hz, processing to 5000 Hz is unnecessary. DAT recorders can record up to 20,000 Hz, but important voice information is 5000 Hz or less. There is no reason to process “excess” signals as they are mostly noise. So, first determine the highest voice frequency recorded and then set the equipment bandwidth accordingly.

Since we are discussing bandwidth, let's discuss a special type of law enforcement recording where the signal is automatically bandlimited. A recording of a telephone conversation obtained by wiretap has the signal bandlimited to about 3200 Hz by the equipment at the telephone company. Higher frequencies, which might make the voices more distinguishable, and easier to understand, simply are not there. Perhaps you have noticed that it is difficult to recognize the voices of certain people when they call you on the phone; the loss of the frequency information beyond 3200 Hz is the primary reason. Thus, when we process a recording of a telephone conversation, we are forced to set the upper frequency limit close to 3200 Hz so as not to process excess noise.

Apply audio filters

After setting the bandwidth as close as possible to the voice frequency range (200 — 5000 Hz) or to the equipment limited bandwidth (~3200 Hz), we are ready to apply additional audio filters to reduce more background noises. The most common audio filters are the highpass, lowpass, notch, and comb filters. The most important filter we will use will be the adaptive predictive deconvolver, or the one-channel adaptive filter. A two-channel adaptive filter, or reference noise canceller, is discussed in the next chapter. We will also cover the use of a 20-band digital graphic equalizer.

Highpass filter

The highpass filter, sometimes called a rumble filter, reduces noise below a specified cutoff frequency. If we set the filter's frequency at 300 Hz, all signal energy below 300 Hz will be reduced according to the specified stopband attenuation, making these frequencies much less audible.

Lowpass filter

The lowpass filter, sometimes referred to as a hiss filter, reduces or signals above a specified cutoff frequency. If we set the filter's cutoff frequency at 4000 Hz, all signal energy above 4000 Hz will be reduced according to the specified stopband attenuation, making these frequencies much less audible.

To illustrate these two filters, look at the figures below:

Figure 3. Input Signal

Figure 4. Processed Signal

Notch filter

If we are able to identify a narrow frequency band in which a noise exists, we can apply a notch filter that will affect only that signal. For example, if we set the notch frequency at 1000 Hz and the notch width at 200 Hz, the narrow band of energy between 900 and 1100 Hz can be reduced according to the specified notch depth.

Figure 5. Notch Filter

We must be careful, though, when using a notch filter. If the notch width is set too wide, or the depth is set too deep, the quality of the voice will be affected. It may not “sound” like the person speaking.

Comb filter

The comb filter is designed specifically to reduce 60 Hz power line hum (50 Hz overseas), and its harmonics. Harmonics are simply multiples of the fundamental frequency; such as 120 Hz, 180 Hz, 240 Hz, etc. Sometimes when our equipment is not grounded properly, or is located too close to an A/C power line, a “hum” occurs in the audio signal. A comb filter acts similarly to a notch filter by reducing energy at both the appropriate frequency and its harmonics. The depth of a comb filter is constant for all harmonics. Some people prefer to use multiple notch filters where the depth can be adjusted at each harmonic.

Figure 6. Comb Filter

Other bandlimiting filters

There are three other lesser-used audio filters that are available in some DAC products for special application as described below:
a. Bandpass filter - a combination of a highpass and lowpass filter where the lower and upper cutoff frequencies can be specified.
b. Bandstop filter - the opposite of the bandpass filter that passes audio below the lower cutoff and above the upper cutoff frequencies.
c. Slot filter - the opposite of the notch filter where only the audio within the slot width is passed.

Adaptive filter

Okay, now we are getting to the heart of the matter, or at least the heart of all DAC filter products. The one-channel (1CH) adaptive filter is sometimes referred to as an automatic noise reduction filter. The other filters we have discussed are what we call fixed filters; that is, when we set the parameters, they do not change unless we change them. The adaptive filters, on the other hand, use multiple microprocessors to constantly analyze the incoming audio signal and automatically make adjustments when they detect changes in the noise. No operator action is required.

Now, it would be nice if this filter reduced all background noises. Unfortunately, this is not the case. The filter only reduces repeatable or predictable signals. In Chapter 2, we referred to these noises as time-correlated signals. Hums, tones, echoes, and reverberations are examples of time-correlated signals. The assumption is that, any time a signal is time-correlated (I've heard it before and expect to hear it again) it is not voice. Remember, voice is a relatively random audio signal. So, reducing all time-correlated signals has little impact on the voice, and usually does a very good job of making the voice more intelligible. Unfortunately, though, many noises are also random in nature (water/shower running, wind noise, etc); these signals are unaffected by the 1CH adaptive filter.

Despite this, the adaptive filter has two major advantages: 1) it automatically adjusts itself when the noise changes, and 2) it can operate in the voice frequency bandwidth to reduce noise and not affect the voice. This makes the adaptive filter ideal for use in unattended voice surveillance activities.

Equalizer

An equalizer is also a common type of audio processor. Many home stereos even include an equalizer. DAC includes a 20-band digital graphic equalizer in several of its products. Simplistically, this equalizer consists of 20 notch filters, each with a notch width of 1/20 of the bandwidth. As the slide control is moved, the energy level in that frequency band is reduced or increased. This can be used to reduce noise levels, but is more often used to reshape the voice spectrum to improve voice quality. Think back to Chapter 3. After processing, the resultant spectrum may be more flat. By using the equalizer as shown below, the voice can be returned to a more typical spectrum, thus improving voice quality.

Figure 7. 20-band Digital Graphic Equalizer

Audio enhancement

The primary purpose of reducing background noise is to be able to hear what is being said - - that is, voice intelligibility. If we plan to use the enhanced tape in court or other legal proceedings, the voice on the tape also has to “sound” like the person. This could limit what we can do to process the audio. Remember, the adaptive filter can reduce time-correlated noise within the voice frequency bandwidth and not affect the voice. But what if there are random noises still present which mask the conversation? If these noise signals extend across the entire bandwidth (broadband noise), there is little we can do. However, if they are limited in their frequency range(s), we might be able to do something.

Now let's suppose there is some random noise between 3000-3500 Hz that is so loud we cannot hear the conversation. What can we do? Well, we could use either a notch filter or a bandstop filter to reduce the loud noise in this frequency band. The result may be that the noise level is reduced and we can hear what is being said. However, the voice energy would be modified within this frequency band so it probably would not “sound” the same as the person normally does. Thus, in this case we would have to find the proper balance when adjusting the filters so that we achieve both maximum intelligibility and maximum voice quality.

Key Points to Remember:

  • Despite all of our efforts to have good equipment and properly plan for the recording of conversations, background noise often makes the voices difficult to hear and/or understand
  • When working in the voice frequency range, audio enhancement techniques must be carefully applied to achieve maximum intelligibility and voice quality.
  • Based on the limitations of the recording, set your frequency bandwidth first then begin applying one or more audio filters to reduce the noise.

Summary

This booklet was written to introduce audio enhancement to the beginner; to keep it simple and understandable. There are other filters and many techniques and processes that were not covered in order not to confuse at this introductory level.

To learn more details on audio enhancement, you are encouraged to attend DAC School. This 5-day school is held twice per year, once in the Fall and once in the Spring. It begins with some elementary discussions on sound, noise, and voice, and progresses up through spectrum analysis and audio enhancement techniques. It is structured to include 50% classroom instruction and 50% hands-on experimentation. Contact DAC for more information.

For more information about DAC audio enhancement and collection support products, return to the home page and browse our Products section of the website.

If you have any comments regarding this publication, please feel free to contact DAC at any time.