This page describes the process and results of applying audio-fingerprinting to finding snippets of audio in a large database of live concert recordings from a single artist.
My goal was to search a database of 4,240 hours of concert recordings (roughly every circulating show by the band Phish) for a given "query" audio snippet. The queries are different encodings and often from different sources and of varying quality (i.e. many are different audience recordings of the same concert).
By extracting fingerprint features and performing a very fast search, we can locate the correct date and track (and time offset within the track) with 96.6% accuracy. Details on all of the test set can be viewed on a separate page.
Note: this page doesn't contain code or a way to run this remotely. This is simply a write-up for anyone interested in the technical details and viewing some interesting examples of successful and unsuccessful queries.
The data used consists of two sets: the "database" (like a training set), which is all the audio that will be searched-through for a given query. And the "queries" (like a test set).
The "database" audio is a collection of live recordings from the band Phish, consisting of 4,240 hours taken from 1,689 concerts over the years 1983 to 2021. The source is "The Spreadsheet" – a fan-created spreadsheet of nearly every Phish concert with download links (legally shared, following the band's taping policy).
The sources are either "soundboard" (SBD) recordings (~20%) or audience recordings (AUD) (~80%). The band has always allowed audience taping and trading of their live shows, leading to multiple sources of recordings (for example, this page shows the multiple known sources for each show in 1995). Tapers typically trade lossless formats (FLAC or SHN), but all the files in the spreadsheet have been encoded to MP3.
Database stats:
The test set "queries" come from "Mystery Jam Monday" (MJM), a reocurring blog post on the fan site phish.net. The organizer posts one or more mystery clips and users must guess the song and date for a small prize. MJM has been active for nearly 12 years. At the time of this experiment, there were 497 instances of MJM, containing 891 mystery clips. The organizers maintain a detailed history in an online spreadsheet. [Update: since doing this experiment, I have learned that MJM has gone on hiatus as of April 2022.]
Importantly for this experiment, the clips are often of a different source than those in the database, i.e. they come from different recordings of the same concert. I don't have the source information, but it appears MJM more often uses soundboard (SBD) than does the spreadsheet. For example, it will use official releases that aren't freely traded. Also, even if using the same source recording, there are audio differences from being a different MP3 encoding of the originally traded lossless source.
I parsed the MJM spreadsheet and programmatically downloaded all the audio referenced in the blog posts. 41% of the MJM audio contain more than one clip and thus have to be segmented. Most contain 1, 2, or 3 clips, but the number is as high as 14.
#clips | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 293 | 93 | 78 | 19 | 7 | 1 | 1 | 0 | 1 | 2 | 1 | 0 | 0 | 1 |
I was able to split most of these automatically by detecting fade-out/fade-in and knowing how many clip boundaries were needed in each audio file. These could be quickly visually verified. In other cases, I needed to split them manually (sometimes finding the boundary between clips was not easy because they were blended together. For example, see this one).
With the audio segmented, we have a "test set" of queries: 891 audio snippets, labeled with date and track name, taken from the MJM spreadsheet.
Audio fingerprints are extracted using a custom implementation of the Chromaprint open-source library. I was expecting to experiment with variations on these features, but it worked quite well out-of-the-box, so for now, I'm using a representation that is essentially equivalent to Chromaprint, re-implemented in Python/numpy.
It converts an input waveform to mono and resamples to 11,050 Hz. Then, 32 binary features are computed every 123ms (~8 frames per second) by the following stages:
Visualized:
Note that for the database of 4,240 hours of music, the total size of computed features is only about 470 MiB (~32 bytes/sec), a roughly 900x compression from the MP3 data.
Given a query of Tq
frames, we can do a brute-force search
of all length Tq
regions within all
fingerprints in the database.
For a database track of length
Td (Td ≥ Tq)
we check K = Td - Tq + 1
frames, i.e. we check all K
possible offsets.
A brute-force search sounds expensive, but
with a highly-optimized implementation,
it is surprisingly fast.
Since our features are binary, the
error between two feature vectors is simply
the count of bit differences.
This can be expressed as
popcnt(xor(F1,F2)
where xor
gives 1 if and only if
the two inputs are different
and
popcnt
("population count")
counts the number of set bits.
We use the libpopcnt library, which makes use of AVX2 instructions on Linux (using the algorithms described here). We are using a 4 core processor (4.2GHz Intel Core i7-7700K), and thus do 4x multiprocessing. The result is that we can search the entire database for a 30-second clip in about 2.2 seconds. Details of the speed are shown in the figure below.
It is a testament to modern hardware how much computation can be done in just over 2 seconds. For a 30 second query clip, there are:
For all 861 MJM clips, we find the track+offset with the minimum bit-error in the database. This matches the ground-truth track 96.6% (832 / 861) of the time. Details and audio for every test query can be reviewed on the results page.
It is instructive to look at the bit-error on the correct track vs. all the incorrect tracks:
Notice that the "target track" distribution (blue) is bi-modal: a large peak at less than 5% error and wider peak around 27% error. I suspect that this represents clips that do and do not use the same source as the database, respectively. With the same source (first peak near 5%), there is only a small error due to different MP3 encodings and minor differences in alignment of the windows. Different sources (second peak near 27%) give significantly higher errors and a larger variety due to varying quality of audience recordings.
For audio-fingerprinting tasks, the queries used here are fairly long (the mode is around 30 seconds). How well with this work with shorter clips? Re-running the experiment while truncating the queries to different maxiumum durations gives the following:
So, we can get to nearly 50% accuracy with only the first 3 seconds of a query. But for longer clips (1.5+ minutes), it is clearly needing to use all of that time to reach the full accuracy. This points to a weakness with the current approach.
Looking through the results individually shows some interesting cases of when the algorithm failed or succeeded.
MJM 339#3 (You Enjoy Myself, 1991/03/16) is an interesting failure. Not only does the search not find the correct song, but even when it is contrained to the correct song, it does not find the correct time offset. Upon manual inspection, the database version of this track is at different speed than the query:
Snippet from Query | Snippet from Database |
The matching algorithm simply overlays the entirety of the query over persepctive matches (i.e. there is no dynamic time-warping), so the timing over the course of the clip is very important. In addition, simple speeding-up of the audio shifts the pitch, which will have strong impact on the chroma features.
The following figure shows a spectral slice extracted from these two snippets, centered on one of the strong guitar notes. The pitch is off by about 1.5 half-steps. I beleive the database version is out of tune and thus the database copy has somehow been incorrectly sped-up in the steps between recording and distribution.
MJM 173 (Harry Hood, 1992/05/17) is another failure that is due to a mismatch in audio speed:
Snippet from Query | Snippet from Database |
I've found that at least one error is due to the difference between query and database in where the audio is split into tracks. Phish will often perform a seamless segue between two songs, making the exact time for splitting a concert recording into separate tracks somewhat arbitrary. In MJM 30 (Split Open and Melt > Catapult, 1999/12/31), the query clip spans two database tracks and thus the search algorithm as-is will not correctly locate it.
In this example, I confirmed manually that if the query is run again with these two neighboring tracks merged as one, it does correctly locate it. The whole experiment could be run such that we treat whole concerts (or sets) as continous tracks, at the cost of some processing time.
Most of the rest of the errors I would subjectively classify as having "low musical content". There is not very clear melodic or rhythmic content in the query; rather, it is dominated by things like soundscapes, feedback, and crowd noise. You can hear those on the results page.
What I've described here is a first-pass solution to this problem. It could be improved in both accuracy and efficiency.
The work of Haitsma and Kalker [2] also uses a 32-bit-per-frame fingerprint (although their bits come from taking diffs in something like a mel-spectrogram). They suggest a lookup table (or hash function) on 32-bit values under the assumption that at least one frame in the query clip will be identical to the matching clip in the database. Then, the search can be limited to those songs that share at least one frame and the time offsets can be limited to keeping the matching frames aligned.
Is that applicable here? The plot below suggests that probably is not.
This shows the percentage of frames (32-bit fingerprints) that are equal within the best-matching database clip. The first bar shows that for 41% of all queries, there are zero matches. So, for all of these, the match won't be found by checking for identical frames.
This fingerprinting approach works pretty well, but clearly makes a few odd mistakes, requires fairly long query snippets, and requires a brute-force search. A better approach would be to use time-frequency landmarks, as supposedly used by Shazaam [4] (also described by Ellis [3]). I would like to implement something like this and compare the performance.