Algorithms for determining the key of an audio sample

Question

I am interested in determining the musical key of an audio sample. How would (or could) an algorithm go about trying to approximate the key of a musical audio sample?

Antares Autotune and Melodyne are two pieces of software that do this sort of thing.

Can anyone give a bit of a layman's explanation about how this would work? To mathematically deduce the key of a song by analysing the frequency spectrum for chord progressions etc.

This topic interests me a lot!

Edit - brilliant sources and a wealth of information to be found from everyone who contributed to this question.

Especially from: the_mandrill and Daniel Brückner.

@Moron - thanks. Be pleased to see if any SO bods can give a good answer :) — Alex, Jun 29, 2010 at 15:07
For what it's worth, Antares Autotune doesn't do key detection, simply pitch correction to bend notes towards certain semitones that you specify. Check out its Wikipedia article for a screenshot of the interface. It probably does some form of pitch detection, which when dealing with a monophonic vocal track, isn't terribly difficult. I think the question of how to do key detection is an interesting one though! :) — Colin Barrett, Jun 29, 2010 at 18:21
This won't answer your question for an algorithm, but if you're interested in the cutting edge of music processing and are open to using an external API, you could check out The Echo Nest — Justin L., Jun 29, 2010 at 21:29

the_mandrill · Accepted Answer · 2010-06-29 20:59:34Z

It's worth being aware that this is a very tricky problem and if you don't have a background in signal processing (or an interest in learning about it) then you have a very frustrating time ahead of you. If you're expecting to throw a couple of FFTs at the problem then you won't get very far. I hope you do have the interest as it is a really fascinating area.

Initially there is the problem of pitch recognition, which is reasonably easy to do for simple monophonic instruments (eg voice) using a method such as autocorrelation or harmonic sum spectrum (eg see Paul R's link). However, you'll often find that this gives the wrong results: you'll often get half or double the pitch that you were expecting. This is called pitch period doubling or octave errors and it occurs essentially because the FFT or autocorrelation has an assumption that the data has constant characteristics over time. If you have an instrument played by a human there will always be some variation.

Some people approach the problem of key recognition as being a matter of doing the pitch recognition first and then finding the key from the sequence of pitches. This is incredibly difficult if you have anything other than a monophonic sequence of pitches. If you do have a monophonic sequence of pitches then it's still not a clear cut method of determining the key: how you deal with chromatic notes, for instance, or determining whether it's major or minor. So you'd need to use a method similar to Krumhansl's key finding algorithm.

So, given the complexity of this approach, an alternative is to look at all the notes being played at the same time. If you have chords, or more than one instruments then you're going to have a rich spectral soup of many sinusoids playing at once. Each individual note is comprised of multiple harmonics a fundamental frequency, so A (at 440Hz) will be comprised of sinusoids at 440, 880, 1320... Furthermore, if you play an E (see this diagram for pitches) then that is 659.25Hz which is almost one and a half times that of A (actually 1.498). This means that every 3rd harmonic of A coincides with every 2nd harmonic of E. This is the reason that chords sound pleasant, because they share harmonics. (as an aside, the whole reason that western harmony works is due to the quirk of fate that the twelfth root of 2 to the power 7 is nearly 1.5)

If you looked beyond this interval of a 5th to major, minor and other chords then you'll find other ratios. I think that many key finding techniques will enumerate these ratios and then fill a histogram for each spectral peak in the signal. So in the case of detecting the chord A5 you would expect to find peaks at 440, 880, 659, 1320, 1760, 1977. For B5 it'll be 494, 988, 741, etc. So create a frequency histogram and for every sinusoidal peak in the signal (eg from the FFT power spectrum) increment the histogram entry. Then for each key A-G tally up the bins in your histogram and the ones with the most entries is most likely to be your key.

That's just a very simple approach but may be enough to find the key of a strummed or sustained chord. You'd also have to chop the signal into small intervals (eg 20ms) and analyse each one to build up a more robust estimate.

EDIT:
If you want to experiment then I'd suggest downloading a package like Octave or CLAM which makes it easier to visualise audio data and run FFTs and other operations.

Other useful links:

My PhD thesis on some aspects of pitch recognition -- the maths is a bit heavy going but chapter 2 is (I hope) quite an accessible introduction to the different approaches of modelling musical audio
http://en.wikipedia.org/wiki/Auditory_scene_analysis -- Bregman's Auditory Scene analysis which though not talking about music has some fascinating findings about how we perceive complex scenes
Dan Ellis has done some great papers in this and similar areas
Keith Martin has some interesting approaches

A good pitch detection algorithm should not be detecting chords or determining "major or minor". It should be detecting individual notes. This is how ear with absolute pitch ability works (I do have the ability + musical education) - I do not hear "C major chord". I hear C+E+G and then determine that it is, indeed C major chord. Even if you sit on the piano keyboard or press combo of random keys (like C+Cis+D+Fis+G+Bes+B), I still will be able to name every note, although it will not be a "chord". This is because (my) ear does not operate on chords or tonalities. It operates on notes. — SigTerm, Jun 29, 2010 at 18:04
@SigTerm: the problem isn't as clear cut as you make out. When there are multiple instruments playing (and in particular for orchestral scores) it's simply not possible to hear every single note, but yet it can be simple to hear the chord. From a signal processing point of view the problem is ambiguous since you have several instruments playing the same pitch, or at (almost) integer multiples thereof. Therefore the signal from each instrument isn't orthogonal. I think it was one of Tangian's papers who showed that a complex tone can be indistinguishable from a chord. (see above for link) — the_mandrill, Jun 29, 2010 at 20:48
@the_mandrill: With complex-sounding harmonics, when you're recognizing pitch by ear (and when you can't name all notes instantly), it goes like this: You concentrate on sound of one instrument, then for all currently "active" sounds of every instrument, you concentrate on individual notes and "name" them. Recognition of one note (ear) is instant. Not sure how brain does it, "concentrating" is probably equivalent to setting up filter sensitivity, and picking up individual notes probably equals to histogram scanning. Also, don't forget that it may be possible to use trained neural networks. — SigTerm, Jun 30, 2010 at 7:30
@SigTerm: It's not always possible (or necessary) to hear every single note. A chord composed of C4+C5 it may be indistinguishable from a complex tone at C4. The only reason you may be able to hear it as two notes is that you have a prior expectation of the harmonic structure of that particular instrument. If you construct it out of sine waves (which intrinsically is what you're detecting) then it can be impossible to detect. Similarly C4+C5+G5 sounds just like a complex tone at C4. So the whole problem of chord recognition is ambiguous. See Terhardt's virtual pitch theory for more. — the_mandrill, Jun 30, 2010 at 8:35
+1 for the 2^(7/12) ~= 1.5 bit. I've been wondering about that for some time. — Tomer Vromen, Jul 1, 2010 at 8:49

Daniel Brückner · Accepted Answer · 2010-06-29 18:08:59Z

I have worked at the problem of transcribing polyphonic CD recordings into scores for more than two years at university. The problem is notoriously hard. The first scientific papers related to the problem date back to the 1940s and up to today there are no robust solutions for the general case.

All the basic assumption you usually read are not exactly right and most of them are wrong enough that they become unusable for everything but very simple scenarios.

The frequencies of overtones are not multiples of the fundamental frequency - there are non-linear effects so that the high partials drift away from the expected frequency - and not only a few Hertz; it is not unusual to find the 7th partial where you expected the 6th.

Fourier transformations do not play nice with audio analysis because the frequencies one is interested in are spaced logarithmically while the Fourier transformation yields linearly spaced frequencies. At low frequencies you need high frequency resolution to separate neighboring pitches - but this yields bad time resolution and you lose the ability the separate notes played in quick succession.

An audio recording does (probably) not contain all the information needed to reconstruct the score. A large part of our music perception happens in our ears and brain. That is why some of the most successful systems are expert systems with large knowledge repositories about the structure of (western) music that only rely to a small portion on signal processing to extract information from the audio recording.

When I am back home I will look through the papers I have read and pick the 20 or 30 most relevant ones and add them here. I really suggest to read them before you decide to implement something - as stated before most common assumptions are somewhat incorrect and you really don't want to rediscover all this things found and analyzed for more than 50 year while implementing and testing.

It's a hard problem, but it's much fun, too. I would really like to hear what you tried and how well it worked.

For now you may have a look at the Constant Q transform, Cepstrum and Wigner(–Ville) distribution. There are also some good papers on how to extract the frequency from shifts in the phase of short time Fourier spectra - this allows to use very short windows sizes (for high time resolution) because the frequency can be determined with a precision several 1000 times larger than the frequency resolution of the underlying Fourier transformation.

All this transformations fit the problem of audio processing much better than ordinary Fourier transformations. For improving the results of basic transformations have a look at the concept of energy reassignment.

+1. For me though as things stand, I do not have the mathematical knowledge to fully comprehend Constant Q transform like you. What I can do though is try and think of practical solutions based on my not particularly extensive knowledge of computing and programming. — Alex, Jun 30, 2010 at 0:54
+1 for mentioning inharmonicity of the overtone series. I can actually see this in some of my instruments using multiple simultaneous strobe tuners tuned to an overtone series. Notes can also "bend" in frequency as they evolve in time. — hotpaw2, Dec 2, 2010 at 23:46
+1 for mentioning Wigner-Ville. If I was looking at this problem again now then I would certainly be looking at Time-Frequency methods that can trade off time against space. This is also a better model of how we perceive pitch. — the_mandrill, Jan 8, 2011 at 0:08
any examples on "good papers on how to extract the frequency from shifts in the phase of short time Fourier spectra"? Not sure what to google search here — woojoo666, Feb 18, 2017 at 1:14
Would love to see some specific papers you suggest reading to get started! — Meekohi, Dec 20, 2017 at 16:36

bta · Accepted Answer · 2010-06-29 15:36:14Z

6

You can use the Fourier Transform to calculate the frequency spectrum from an audio sample. From this output, you can use the frequency values for particular notes to turn this into a list of notes heard during the sample. Choosing the strongest notes heard per sample over a series of samples should give you a decent map of the different notes used, which you can compare to the different musical scales to get a list of the possible scales that contain that combination of notes.

To help decide which particular scale is being used, make a note (no pun intended) of the most frequently heard notes. In Western music, the root of the scale is typically the most common note heard, followed by the fifth, and then the fourth. You can also look for patterns such as common chords, arpeggios, or progressions.

Sample size will probably be important here. Ideally, each sample will be a single note (so that you don't get two chords in one sample). If you filter out and concentrate on the low frequencies, you may be able to use the volume spikes ("clicks") normally associated with percussion instruments in order to determine the song's tempo and "lock" your algorithm to the beat of the music. Start with samples that are a half-beat in length and adjust from there. Be prepared to throw out some samples that don't have a lot of useful data (such as a sample taken in the middle of a slide).

edited Jun 29, 2010 at 15:36

answered Jun 29, 2010 at 15:30

bta

44.5k6 gold badges72 silver badges99 bronze badges

It's not that easy to extract pitch from a power spectrum - there are much better pitch detection algorithms.
– Paul R
Jun 29, 2010 at 15:31
The whole process is a complex one. But very interesting. Chords I think create much complexity as they generate their own resonance and harmonic frequency that must be very difficult to account for in an algorith!
– Alex
Jun 29, 2010 at 15:35
@AlexW- Yes harmonic resonance is present, but it appears at a much lower magnitude than the chord itself. If you know the chord, you can predict the harmonics that might be heard and filter them out of your results accordingly.
– bta
Jun 29, 2010 at 15:37
@bta yes that's true. Going by the material generated from this page, it's an all round tricky task. Maybe if you can strip away unnecessary artefacts from music, it would be easier to determine the key (to add a bandpass filter first to get rid of high- and low-frequencies).
– Alex
Jun 29, 2010 at 15:41
@AlexW- I would recommend starting with something recorded as a series of electronic tones (from an electronic keyboard, perhaps). Simple tones are much easier to work with, and once you get the hang of that you can slowly move to more and more complex sounds. Real-world instruments (and to a greater degree, voices) are a complex combination of sounds and are much tougher to crack; if you are targeting a specific instrument, it helps if you can filter out anything outside that instrument's range.
– bta
Jun 29, 2010 at 15:52

Add a comment |

JAB · Accepted Answer · 2010-06-29 15:16:21Z

As far as I can tell from this article, various keys each have their own common frequencies, so it likely analyzes the audio sample to detect what the most common notes and chords are. After all, you can have multiple keys that have the same configuration of sharps and flats, with the difference being the note that the key starts on and thus the chords that such keys, so it seems how often the significant notes and chords appear would be the only real way you could figure that sort of thing out. I don't really think you can get a layman's explanation of the actual mathematical formulas without leaving out a lot of information.

Do note that this is coming from somebody who has absolutely no experience in this area, with his first exposure being the article linked in this answer.

MRalwasser · Accepted Answer · 2010-06-29 15:21:42Z

3

It's a complex topic, but a simple algorithm for determining a single key (single note) would look like this:

Do a fourier transformation on let's say 4096 samples (exact size depends on your resolution demands) on a part of the sample which contains the note. Determine the power peak in the spectrum - this is the frequency of the note.

Things are getting tighter if you have a chord, different "instruments/effects" or a non-homophonic music pattern.

answered Jun 29, 2010 at 15:21

MRalwasser

15.8k15 gold badges103 silver badges149 bronze badges

Yes I think you'd need a fairly clean sample to work with. Plus one that fits with Western tonal structures too of course. Good answer, many thanks.
– Alex
Jun 29, 2010 at 15:30
Frequency of a peak != pitch, at least for for musical instruments. Better to use one of the popular pitch detection algorithms.
– Paul R
Jun 29, 2010 at 15:33
@Paul R - yes I've seen that the perception of volume of a pitch is determined by it's frequency, not by some other measure. This also confuses me a bit though.
– Alex
Jun 29, 2010 at 15:44
@AlexW: pitch is a percept rather than an actual physical quantity, but it's usually quite close to the fundamental frequency of the note being played. In some instruments though the fundamental frequency may be of quite low amplitude, or even missing altogether, hence the need to use a proper pitch detection algorithm rather than a power spectrum.
– Paul R
Jun 29, 2010 at 18:24

Add a comment |

Paul R · Accepted Answer · 2010-06-29 18:43:45Z

3

First you need a pitch detection algorithm (e.g. autocorrelation).

You can use then your pitch detection algorithm to extract the pitch over a number of short time windows. After that you would need to see which musical key the sampled pitches fit best with.

edited Jun 29, 2010 at 18:43

answered Jun 29, 2010 at 15:30

Paul R

211k37 gold badges398 silver badges566 bronze badges

I'm not sure this could work on chords as you are hearing many pitches at once.
– Alex
Jun 29, 2010 at 15:37
@AlexW: yes, chords are going to be tricky - you are going to want to sample the more melodic and monophonic parts of the music.
– Paul R
Jun 29, 2010 at 18:26
At the moment this is to be honest a rather vague notion. The maths is intimidating, but it's important to remember that tools exist to help with 'boilerplate' Fourier transformations. It's a case of understanding this data and experimenting with algorithms.
– Alex
Jun 29, 2010 at 23:51

Add a comment |

Nathan · Accepted Answer · 2010-06-29 21:06:13Z

1

If you need to classify a bunch of songs right now, then crowd-source the problem with something like Mechanical Turk.

answered Jun 29, 2010 at 21:06

Nathan

6,15510 gold badges46 silver badges54 bronze badges

2

That would be a 'Mechanical Turk With Pitch Perfect Musical Understanding'... good luck finding your source!
– Alex
Jun 29, 2010 at 23:02

Add a comment |

Musicologist · Accepted Answer · 2015-07-22 18:03:34Z

1

Analysing the key is not the same thing as analysing the pitches. Unfortunately the entire concept of key is somewhat ambiguous, the different definitions typically tend to only share the concept of tonic, i.e. a central pitch/chord. Even if a good system for automatic transcription existed, there is no reliable algorithm for determining key.

answered Jul 22, 2015 at 18:03

Musicologist

111 bronze badge

Add a comment |

Collectives™ on Stack Overflow

Algorithms for determining the key of an audio sample

8 Answers 8

Your Answer

Not the answer you're looking for? Browse other questions tagged
algorithm
audio
analysis
sampling
audio-processing
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

8 Answers 8

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged algorithmaudioanalysissamplingaudio-processing or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
algorithm
audio
analysis
sampling
audio-processing
or ask your own question.