Chapter 1 - Sounds
Before diving into the details of mixing, we need to look at some properties of sounds in general. This section is background information, but it is necessary to understand its contents in order to grasp a lot of the basic principles of mixing. A sound is a pressure wave traveling through the air. Any action which puts air into motion will create a sound. Our auditory system systematically groups the pressure waves that hit our ears into distinct sounds for ease of processing, much how our vision groups the photons that hit our eyes into objects.
But, just like our vision can divide visual objects into smaller objects (a "person" can be divided into "arms," "legs," a "head," etc.), our brains can analytically divide sounds into smaller sounds (for instance the spoken word "cat" can be divided into a consonant `k', a vowel `ahh', and another consonant 't'). Similarly, just as our vision can group collections of small objects into larger objects (a collection of "persons" becomes a "crowd"), our brains can group collections of sounds into larger sounds (a collection of "handclaps" becomes "applause").
1.1 Frequency Domain
If you continue to subdivide physical objects into smaller and smaller pieces, you will eventually arrive at atoms, which cannot be further subdivided. There is a similarly indivisible unit of sound, and that is the "frequency". All sounds can ultimately be reduced to a bunch of frequencies. The difference is that, where an object may be composed of billions of atoms, a sound typically consists of no more than thousands of frequencies. So, frequencies are a very practical way of analyzing sounds in the everyday context of electronic music.
What is a frequency, anyway? A frequency is simply a sine-wave shaped disturbance in the air; an oscillation, in other words. They are typically considered in terms of the rate at which they oscillate, measured in cycles per second (Hz). Science tells us that the human ear can hear frequencies in the approximate range of 20Hz to 20,000Hz, though many people seem to be able hear somewhat further in both directions. In any case, this range of 20Hz-20,000Hz comfortably encompasses all of the frequencies that we commonly deal with in our day to day lives.
Unsurprisingly, different frequencies sound different, and have different effects on the human psyche. There is a continuum of changing "flavor" as you go across the frequency range. 60Hz and 61Hz have more or less the same flavor, but by the time you get up to 200Hz, you are in quite different territory indeed.
It is worth noting that we perceive frequencies logarithmically. In other words, the difference between 40Hz and 80Hz is comparable to the difference between 2,000Hz and 4,000Hz. This power-of-two difference is called an "octave." Humans can hear a frequency range of approximately ten octaves.
I will now attempt to describe the various flavors of the different frequency ranges. As I do, bear in mind that words are highly inadequate for this job. First, because we do not have words to refer to the flavors of sounds, so I must simply attempt to describe them and hope that you get my drift. Second, because, as I have said previously, all of these flavors blend into each other; there are no sharp divisions between them.[1] With all that in mind, here we go.
20Hz-40Hz "subsonics": These frequencies, residing at the extremes of human hearing, are almost never found in music, because they require extremely high volume levels to be heard, particularly if there are other sounds playing at the same time. Even then, they are more felt than heard. Most speakers can't reproduce them.
That said, subsonics can have very powerful mental and physical effects on people. Even if the listener isn't aware that they're being subjected to them, they can experience feelings of unease, nausea, and pressure on the chest. Subsonics can move air in and out the lungs at a very rapid rate, which can lead to shortness of breath. At 15Hz, which is the resonant frequency of the eyeball, people can start hallucinating. It is suspected that frequencies in this range may be present at many allegedly "haunted" locales, since they create feelings of unease. Furthermore, frequencies around ISHz may be responsible for many "ghost" sightings. Incidentally, many horror movies use subsonics to create feelings of fear and disorientation in the audience.
40Hz-100Hz "sub-bass": This relatively narrow frequency range marks the beginning of musical sound, and it is what most people think of when they think of "bass." It accounts for the deep booms of hip- hop and the hefty power of a kick drum. These frequencies are a full-body experience, and carry the weight of the music. Music lacking in sub-bass will feel lean and wimpy. Music with an excess of sub-bass will feel bloated and bulky.[]
100Hz-300Hz "bass": Still carrying a hint of the feeling of the sub-bass range, this frequency range evokes feelings of warmth and fullness. It is body, stability, and comfort. It is also the source of the impact of drums. An absence of these frequencies makes music feel cold and uneasy. An excess of these frequencies makes music feel muddy and indistinct.
300Hz-l,000Hz "lower midrange": This frequency range is rather neu- tral in character. It serves to anchor and stabilize the other frequency ranges; without it, the music will feel pinched and unbalanced.
l,000Hz-8,000Hz "upper midrange": These frequencies attract atten- tion. The human ear is quite sensitive in this range, and so it is likely to pay attention to whatever you put in it. These frequencies are presence, clarity, and punch. An absence of upper midrange makes music feel dull and lifeless. An excess of upper midrange makes music feel piercing, overbearing, and tiring.
8,000Hz-20,000Hz "treble": Another extreme in the human hearing range. These frequencies are detail, sparkle, and sizzle. An absence of tre- ble makes music feel muffled and boring. An excess of treble makes music harsh and uncomfortable to listen to.
These frequencies, by their presence of absence, make music exciting or relaxing. Music that is meant to be exciting, such as dance music, contains large amounts of treble; music that is meant to be relaxing contains low amounts of treble. As people age, they gradually lose their ability to hear frequencies in this range.
So now we understand the effects of individiual frequencies on the human psyche. But sounds rarely consist of single frequencies; they are composed of multitudes of frequencies, and the way in which said frequencies are organized also has an effect on the human psyche.
When multiple frequencies occur simultaneously in the same frequency range, their conflicting wavelengths cause periodic oscillations in volume known as "beating." Beating is more noticeable in lower frequencies than in higher fre- quencies. In the sub-bass range, any beating at all becomes quite dominating and often disturbing, while in the treble range, frequencies are typically quite densely packed to no ill effect.
Beating is also the underlying principle of the formation of musical chords. Combinations of tones which produce subtle beating are considered "consonant," while combinations of tones which produce pronounced beating are con- sidered "dissonant." When considering chords in terms of beating, it is important to note that beating occurs not only between the fundamental frequencies of the tones involved, but also their harmonics. Thus, for instance, while two individual frequencies a major ninth apart will not produce beating, two tones a major ninth apart will, because their harmonics will produce beating.
Beating also contributes to the character of many non-tonal sounds. For instance, the sound of a cymbal is partially due to the beating of the countless frequencies which it contains. Similarly, the "thumpy" sound of the body of an acoustic kick drum is partially due to the beating of bass frequencies.
1.2 Patterns of Frequency Distribution
Having considered in general the psychological effects of individual frequencies and combinations of frequencies, let us now examine the specific frequency distribution patterns of common sounds. Obviously, it would be impossible to describe the frequency distribution patterns of every possible sound. Indeed, every frequency distribution describes one sound or another. So, in this section, we will simply examine the frequency distribution patterns of the sounds most commonly found in music. We will only examine four categories of sounds, but they cover a surprisingly large amount of ground; with them, we will be able to account for the majority of sounds found in most music.
1.2.1 Tones
The simplest frequency organization structure is the tone. Tones are very common in nature, and our brains are specially built to perceive them. A tone is a series of frequencies arranged in a particular, mathematically simple, pattern. The lowest frequency in the tone is called the fundamental and the frequencies above it are called harmonics. The first harmonic is twice the frequency of the fundamental; the second harmonic is three times the frequency; and so forth. This extension could theoretically go on to infinity, but because the harmonics of a tone typically steadily fall in volume with increasing frequency, in practice they peter out eventually.
The character of a particular tone, often called its "timbre," is partially determined by the relative volumes of the harmonics; these differences are a big part of what differentiates a clarinet from a violin, for instance. The reedy, hollow tone of a clarinet is partially due to a higher emphasis on the odd-numbered harmonics, while a violin tone gets its character from a more even distribution of harmonics. The bright tone of a trumpet is due to the high volume of its treble-range upper harmonics, while the mellower tone of a french horn has much more subdued upper harmonics.
Tones are the bread and butter of much music. All musical instruments, except for percussion instruments, primarily produce tones. Synthesizers also mostly produce tones.
1.2.2 The Human Voice
The human voice produces tones, and thus could justifiably be lumped into the previous section. But there is a lot more to it than that, and since the human voice is such an important class of sound, central to so much music, it is worth examining more closely.
The human voice can make a huge variety of sounds, but the most important sounds for music are those that are used in speech and singing: specifically, vowels and consonants.
A vowel is a tone. The specific vowel that is intoned is defined by the relative volumes of the different harmonics; the difference between an 'ehh' and an 'ahh' is a matter of harmonic balance. In speech, vowel tones rarely stay on one pitch; they slide up and down. This why speech does not sound "tonal" to us, though it technically is. Singing is conceptually the same as speaking, with the difference being that the vowels are held out at constant pitches.
A consonant is a short, non-tonal noise, such as 't', 's', 'd', or 'k.' They are found in the upper midrange. The fact that consonants carry most of the information content of human speech may well account for the human brain-ear's bias towards the upper midrange.
So, we can see that the human voice, as it is used in speech and singing, is composed of two parts: tonal vowels, and non-tonal consonants. That said, the human voice is very versatile, and many of its possible modes of expression are not covered by these two categories of sound. Whispering, for instance, replaces the tones of vowels with breathy, non-tonal noise, with consonants produced in the normal manner. Furthermore, many of the noises that are made, for instance, by beatboxers, defy analysis in terms of vowels and consonants.
1.2.3 Drums
So far we have examined tones and the human voice. The human voice is quite tonal in nature, so in a certain sense we are still looking at tones. Now we will look at drum sounds, which, though not technically tones, are still somewhat tonal in nature.
A "drum" consists of a membrane of some sort stretched across a resonating body. It produces sound when the membrane is struck. A drum produces a complex sound, the bulk of which resides in the bass and the lower midrange. This lower component of the sound, which I call the "body," does not technically fit the frequency arrangement of a tone, but usually bears a greater or lesser resemblance to such an arrangement, and thus the sound of a drum is somewhat tonal.
In addition to the body component of the sound, which is created by the vibration of the membrane, part of the sound of a drum is created by the impact between the membrane and the striking object. This part of the sound, which I will refer to as the "beater sound," has energy across the frequency spectrum, but is usually centered in the upper midrange and the treble.
1.2.4 Cymbals
Now, having examined tones in general, the human voice, and drums, we come to the first (and only) completely non-tonal sounds that we will examine: cymbals. Cymbals are thin metal plates that are struck, like drums, with beaters. The vibrations of the struck plates create extremely complex patterns of frequencies, hence the non-tonal nature of cymbals.
Cymbals have energy throughout the entire frequency spectrum, but the bulk of said energy is typically in the treble range, or in the midrange in the case of large cymbals such as gongs. There is also reason to believe that cymbals have significant sonic energy above the range of human hearing, since their energy shows no signs of petering out near 20kHz. In any case, because cymbals have so much treble energy, they are a very exciting type of sound.
1.3 Time Domain
Thus far we have analyzed sounds in terms of frequencies, and indeed this type of analysis, called "frequency domain" analysis, is a very useful way to analyze them. But there is another way to analyze sounds that is important to understand for the purposes of mixing, which is in terms of their waveforms. This type of waveform-based analysis is called "time domain" analysis.
Time domain analysis essentially means looking at a sound not in terms of the sine waves that make it up, but in terms of the patterns of disturbance that it causes in whatever medium it is traveling through: air molecules, a human eardrum, a speaker cone, or the electrical signal in an audio cable, for instance. The intensity of the disturbance that the sound causes at any given instant is called its amplitude. The sound of a sound is determined by its patterns of changing amplitude; its waveform, in other words.
When you combine two sounds (i.e., play them simultaneously through the same medium), their time-domain disturbances are added together; the instantaneous amplitude of the resulting sound at any given time is a simple mathematical sum of the instantaneous amplitudes of the separate sounds. This is why the final stage of mixing (i.e., combining the separate mixer tracks into one "master" track) is sometimes called "summing." It literally is just a matter of taking the sum of everything.
It is important to understand that any sound can be analyzed both in the frequency domain and the time domain. You can look at a sound as a collection of sine waves, or you can look at it as a pattern of disturbance in a medium. Both perspectives are useful for different things.
1.4 Loudness Perception
Since loudness is such an important topic in mixing, it seems appropriate at this point to talk about the perception of loudness in general.
Loudness is measured in decibels (dB). Decibels are a relative, logarithmic measurement.
Decibels are a logarithmic measurement in that amplitude increases exponentially with decibel value. Specifically, every lOdB increase or decrease of decibel value corresponds to a factor of ten increase or decrease in amplitude. In other words, increasing a sound's amplitude by lOdB multiplies its amplitude by ten. Increasing a sound's loudness by 20dB multiplies its amplitude by a hundred. Decreasing a sound's loudness by 30dB multiplies its amplitude by one thousandth. And so forth.
Decibels are a relative measurement in that a measurement of decibels does not tell you precisely how loud a sound is; it can only tell you how loud it is.
relative to some reference amount, usually designated as OdB. So, for instance, a level of 3dB is three decibels louder than the reference level, and a level of -3dB is three decibels quieter than the reference level.
When discussing real-world sounds traveling through the air, loudness is most often measured in dBSPL, or "decibels of sound pressure level." This is a unit of measure based on the decibel, with the reference level of OdBSPL being the quietest sound that is audible by a young adult with undamaged hearing.[3] The threshold of pain is generally placed around 120dBSPL. This range of OdBSPL to 120dBSPL gives us the practical dynamic range[4] of human hearing. 5OdBSPL is a good listening level for music.
Loudness can be measured in two ways: it can be measured in terms of peak loudness, or in terms of average loudness. Peak loudness measures the amplitude of the highest instantaneous peaks in the sound. Average loudness measures the overall average amplitude level, taking into account all of the loud peaks and the quiet in-between spaces.[5] Peak loudness is good to know because peaks that are too loud will often cause audio equipment to overload. Average loudness is good to know because it reflects, more accurately than peak loudness, the human ear's actual perception of loudness. The level meters on most audio mixers measure peak loudness.
Average loudness, when measured as described above, will still not be a terribly accurate measurement of human loudness perception. Loudness perception is complicated by the fact that the ear has a bias towards certain frequency ranges and away from others. The ear is most insensitive in the subsonic range, and becomes progressively more sensitive into the upper midrange, after which its sensitivity rapidly rolls off. The sensitivity also varies with volume, with the ear being less sensitive to bass and treble at lower volumes. The precise sensitivity curves are given in Figure 1.1.

Figure 1.1: Sensitivity of the human ear across the audible frequency range.
1.5 Digital Audio
Thus far we have only looked at how sounds work in the "real world;" we've looked at sounds in the form of pressure waves in the air, and in the form of analog electrical signals. We have not yet looked at how sounds are represented in the computer, in their digital, numerical representation. Digital sound behaves in more or less the same way as real-world, "analog" sound, but there are still a number of special considerations that apply, so it is worth examining the basic ideas behind it.

Figure 1.2: Analog to digital conversion.

Figure 1.3: Digital clipping.
The defining characteristic of any kind of digital data, be it text, pictures, or movies, is that it is made of a bunch of numbers. Numbers are all that computers know how to work with. When computers work with audio, the situation is no different: they must figure out how to take the continuous time-domain waveform of a sound and reduce it to a series of numbers.
They accomplish this by "sampling" the waveform. What this means is that, when you record an audio signal into your computer, it captures it by measuring the instantaneous amplitude of the waveform at regular intervals. These individual measurements are called "samples." This process of sampling turns the continuous, analog waveform into a numeric, "digital" approximation that looks a lot like a staircase. Figure 1.2 illustrates the effect.
1.5.1 Clipping
The numeric value of a sample represents its amplitude. One of the limitations of digital systems is that they have a sharp, absolute limit on the maximum amplitude of the signals that can be represented; the computer will only count so high. Any amplitudes that are higher than the maximum countable amplitude will simply be "clipped" off, as shown in Figure 1.3.
As you might guess, digital clipping generally sounds quite bad, and it is to be avoided in most circumstances.[6] Whenever you are working with digital audio, you must make sure that it never exceeds the maximum digital amplitude.
1.5.2 Sampling Resolution
Besides clipping, the process of analog to digital conversion can have a number of other detrimental effects on the quality of audio. Furthermore, processing audio when it is in digital form can further degrade the quality, due to rounding errors in the numerical digital processing algorithms.
There are two attributes of a digital audio system that determine its fidelity: sampling rate[7] and sampling resolution. If both of these attributes are sufficiently good, then digital recording and processing will create little or no audible degradation of the sound quality.
The sampling resolution of a system is the numeric accuracy of the individual samples. The more possible numeric values for a sample, the higher the sampling resolution is. Because computers work in binary, sampling resolution is typically described in terms of "bits." A 4-bit digital system has 16 possible numeric values for each sample.[8] An 8-bit system has 256 possible values. A 16-bit system has 65,536 possible values, and a 24-bit system has 16,777,216 possible values. In general, an n-bit system has 2[n] possible numeric values for each sample.
A low sampling resolution will degrade the quality of the audio by introducing "quantization noise." Quantization noise is the audible artifact that results from the "rounding errors" inherent in analog to digital conversion, as seen in Figure 1.2. It usually[9] manifests in the form of a low-volume hissing sound, somewhat similar to the sound heard in quiet sections on analog tapes and vinyl. This sound will mask subtle details in the sound and make sufficiently quiet sounds inaudible.
1.5.3 Dynamic Range
The higher the bit resolution of a digital system is, the quieter the quantization noise is. The level of the quantization noise is what determines the system's total "dynamic range;" that is, the ratio between the quietest possible sound and the loudest possible sound. The quietest possible sound is restricted by the level of the quantization noise, and the loudest possible sound is restricted by the threshold for clipping.
A digital system has a dynamic range of 6dB times the bit resolution. In other words, each bit of sampling resolution adds roughly 6dB of dynamic range. Thus, the dynamic range of a 16-bit system is about 96dB. The dynamic range of a 24-bit system is about 144dB, larger than the dynamic range of human hearing.
Volume levels in the digital world are measured in "full-scale decibels," or dBFS. The digital full-scale measurement system measures peak volume, not average volume. The OdB reference point is set at the highest representable amplitude; in other words, OdBFS is the loudness of the loudest possible sound. All other volume levels are negative; a sound with a level of -6dBFS has a peak level 6dB below the digital maximum, for instance.
1.5.4 Standard Sampling Resolutions
There are two commonly used sampling resolutions: 16-bit and 24-bit. 16-bit is the resolution of audio CDs and most MP3s. It is typically used for the distribution of mixed-down music. Its dynamic range is sufficient for the vast majority of music.
In the actual mixing process, it is preferable to use 24-bit. 24-bit has more dynamic range than 16-bit. While the difference doesn't matter much for finished mixdowns, it can make a difference when in the mixing process, because the extra dynamic range gives some "slop room," allowing for the rounding errors introduced by digital processing to occur without significant audible effects.
Some DAWs also have a "32-bit" resolution. This usually refers to the so-called "floating point" representation of digital audio, as opposed to the usual "fixed-point" representation, which is what we have discussed so far.
32-bit floating point and 24-bit flxed point are, in a certain sense, the same thing. Without going into the technical differences between the two, 32-bit floating point audio has the same dynamic range as 24-bit flxed point audio, with the added advantage that audio above the OdBFS threshold will not clip. Instead, the computer will effectively take bits from the bottom and add them to the top. This raises the quantization noise, but also raises the maximum representable amplitude, resulting in a net effect of the same amount of dynamic range.
It is generally not a good idea to take advantage of floating point's ability to exceed the OdBFS ceiling, because even in DAWs that fully support floating point, many plugins will convert their input audio to flxed point internally; when they do this, the audio will clip. So, even if you are working in floating point, it is best to act as if you were not, and keep all levels below OdBFS at all times.
1.5.5 Sampling Rate
The sampling rate of a digital system is the number of samples per second that it uses to represent the audio. For instance, audio CDs uses 44,100 samples per second. Sampling rates are measured in hertz (Hz), just like frequencies. Thus, the audio CD sampling rate might be written as 44,100Hz, or 44.1kHz.
Intuitively, you might expect that a higher sampling rate would yield higher quality audio, and this intuition is correct. Speciflcally, sampling rate affects the "frequency response" of the digital system; that is, the range of frequencies that it can represent.
Digital systems have no minimum representable frequency; they can go all the way down to OHz. They do, however, have a maximum representable frequency, and it is determined by the sampling rate. Specifically, the maximum representable frequency is half of the sampling rate. Thus, with a sampling rate of 44.1kHz, the maximum representable frequency is 22.05kHz. This maximum frequency is referred to as the "Nyquist frequency."
The most common sampling rates are 44.1kHz, 48kHz, 96kHz, and 192kHz. The lowest of these, 44.1kHz, is typically used for distributing finished mixes. Since this sampling rate can represent all audible frequencies, you might wonder why anyone would ever use a higher sampling rate.
The answer is that, besides allowing higher frequencies to be represented, higher sampling rates can also make certain audio processes sound better, with fewer sonic artifacts. Such processes include equalization[10] and compression[11], certain aspects of synthesis, such as filtering and waveform synthesis, and certain aspects of sampling, such as repitching.
The drawback of higher sampling rates is that they imply higher CPU usage. For instance, going from 48kHz to 96kHz, you can expect most processes to use twice as much CPU, because they are processing twice as many samples in the same amount of time.
[1] This also implies that the precise frequency ranges given for each flavor are highly inexact and really somewhat arbitrary.
[2] It is a common beginner mistake to mix with far too much sub-bass. To do so may produce a pleasing effect in the short term, but in the long term it will become apparent that the excess of sub-bass is hurting the music by destroying its sense of balance and making it tiring to listen to.
[3] Because human hearing sensitivity varies with frequency, this "quietest audible sound" metric is measured at a frequency of IkHz, where human hearing is most sensitive.
[4] The "dynamic range" of a system is the ratio between the quietest sound it can handle, and the loudest sound it can handle.
[5] Average loudness is essentially
where a{t) is the instantaneous amplitude of the sound over time and T is the length of the time interval being measured.
[6] Digital clipping may, in certain circumstances and styles, be considered aesthetically desirable, but in the vast majority of cases it is considered an artifact.
[7] See Section 1.5.5 for a discussion of sampling rates.
[8] Figure 1.2 shows 4-bit sampling.
[9] With particularly simple signals, particularly quiet signals, and particularly low sampling resolutions, the quantization noise may manifest quite differently, and usually in a more disturbing way.
[10] See Section 4.
[11] See Section 5.
Next - »« - Previous