phingerprint: An experiment in audio-fingerprinting on 4,240 hours of live Phish recordings

This page describes the process and results of applying audio-fingerprinting to finding snippets of audio in a large database of live concert recordings from a single artist.

My goal was to search a database of 4,240 hours of concert recordings (roughly every circulating show by the band Phish) for a given "query" audio snippet. The queries are different encodings and often from different sources and of varying quality (i.e. many are different audience recordings of the same concert).

By extracting fingerprint features and performing a very fast search, we can locate the correct date and track (and time offset within the track) with 96.6% accuracy. Details on all of the test set can be viewed on a separate page.

Note: this page doesn't contain code or a way to run this remotely. This is simply a write-up for anyone interested in the technical details and viewing some interesting examples of successful and unsuccessful queries.

1. Data

1.1. Database
1.2. Queries

2. Fingerprinting

2.1. Features
2.2. Search

3. Results

3.1. Bit-errors
3.2. Query durations
3.3. Selected Examples

3.3.1. Different speed
3.3.2. Track-splits
3.3.3. Low musical content

4. Future work

4.1. Inverted Index
4.2. Time-Frequency peak based indexing

5. References

5.1. Citations
5.2. Tools

1. Data

The data used consists of two sets: the "database" (like a training set), which is all the audio that will be searched-through for a given query. And the "queries" (like a test set).

1.1. Database

The "database" audio is a collection of live recordings from the band Phish, consisting of 4,240 hours taken from 1,689 concerts over the years 1983 to 2021. The source is "The Spreadsheet" – a fan-created spreadsheet of nearly every Phish concert with download links (legally shared, following the band's taping policy).

The sources are either "soundboard" (SBD) recordings (~20%) or audience recordings (AUD) (~80%). The band has always allowed audience taping and trading of their live shows, leading to multiple sources of recordings (for example, this page shows the multiple known sources for each show in 1995). Tapers typically trade lossless formats (FLAC or SHN), but all the files in the spreadsheet have been encoded to MP3.

Database stats:

Size of 1,692 .rar container files: 419.2 GiB.
Size of all files extracted from all .rar files: 420.8 GiB.
Total audio: 35,206 files consisting of 4,240 hours, 27 minutes, and 8.5 seconds (total seconds: 1.5 × 10⁷).
Missing four shows due to broken links (1991-10-13, 1992-06-20, 2009-10-30, 2011-06-30)
One show contains FLAC rather than MP3 (1990-04-09).

1.2. Queries

The test set "queries" come from "Mystery Jam Monday" (MJM), a reocurring blog post on the fan site phish.net. The organizer posts one or more mystery clips and users must guess the song and date for a small prize. MJM has been active for nearly 12 years. At the time of this experiment, there were 497 instances of MJM, containing 891 mystery clips. The organizers maintain a detailed history in an online spreadsheet. [Update: since doing this experiment, I have learned that MJM has gone on hiatus as of April 2022.]

Importantly for this experiment, the clips are often of a different source than those in the database, i.e. they come from different recordings of the same concert. I don't have the source information, but it appears MJM more often uses soundboard (SBD) than does the spreadsheet. For example, it will use official releases that aren't freely traded. Also, even if using the same source recording, there are audio differences from being a different MP3 encoding of the originally traded lossless source.

I parsed the MJM spreadsheet and programmatically downloaded all the audio referenced in the blog posts. 41% of the MJM audio contain more than one clip and thus have to be segmented. Most contain 1, 2, or 3 clips, but the number is as high as 14.

#clips	1	2	3	4	5	6	7	8	9	10	11	12	13	14
count	293	93	78	19	7	1	1	0	1	2	1	0	0	1

I was able to split most of these automatically by detecting fade-out/fade-in and knowing how many clip boundaries were needed in each audio file. These could be quickly visually verified. In other cases, I needed to split them manually (sometimes finding the boundary between clips was not easy because they were blended together. For example, see this one).

With the audio segmented, we have a "test set" of queries: 891 audio snippets, labeled with date and track name, taken from the MJM spreadsheet.

2. Fingerprinting

2.1. Features

Audio fingerprints are extracted using a custom implementation of the Chromaprint open-source library. I was expecting to experiment with variations on these features, but it worked quite well out-of-the-box, so for now, I'm using a representation that is essentially equivalent to Chromaprint, re-implemented in Python/numpy.

It converts an input waveform to mono and resamples to 11,050 Hz. Then, 32 binary features are computed every 123ms (~8 frames per second) by the following stages:

Compute a Short-time Fourier Transform.
Compute chroma features by summing frequency bins belonging to same the pitch class, resulting in 12 dimensions per frame.
Normalize chroma features (within each frame) and then apply a temporal smoothing filter.
Apply a set of 16 pre-defined 2-D filters over the chroma-spectrogram. All filter coefficients are +1 or -1, so they can be computed efficiently with an integral image.
Quantize the 16 output values into two bits each.

Visualized:

First steps in feature extraction

Visualizing the 16 filters in the chroma-time domain.

Output features

Note that for the database of 4,240 hours of music, the total size of computed features is only about 470 MiB (~32 bytes/sec), a roughly 900x compression from the MP3 data.

2.2. Search

Given a query of T_q frames, we can do a brute-force search of all length T_q regions within all fingerprints in the database. For a database track of length T_d (T_d ≥ T_q) we check K = T_d - T_q + 1 frames, i.e. we check all K possible offsets.

A brute-force search sounds expensive, but with a highly-optimized implementation, it is surprisingly fast. Since our features are binary, the error between two feature vectors is simply the count of bit differences. This can be expressed as popcnt(xor(F₁,F₂) where xor gives 1 if and only if the two inputs are different and popcnt ("population count") counts the number of set bits.

We use the libpopcnt library, which makes use of AVX2 instructions on Linux (using the algorithms described here). We are using a 4 core processor (4.2GHz Intel Core i7-7700K), and thus do 4x multiprocessing. The result is that we can search the entire database for a 30-second clip in about 2.2 seconds. Details of the speed are shown in the figure below.

Speed of brute-force search. Each blue dot is a test example and gray lines show median values (most are stacked-up around duration of 30 seconds). Longer queries begin to get faster because there are fewer total offsets to check.

It is a testament to modern hardware how much computation can be done in just over 2 seconds. For a 30 second query clip, there are:

1.15 × 10⁸ possible track+offset possibilities in the database.
2.80 × 10¹⁰ feature-vector comparisons.
8.96 × 10¹¹ bit-difference calculations (that's ~448 billion per second [calc]).

3. Results

For all 861 MJM clips, we find the track+offset with the minimum bit-error in the database. This matches the ground-truth track 96.6% (832 / 861) of the time. Details and audio for every test query can be reviewed on the results page.

3.1. Bit-errors

It is instructive to look at the bit-error on the correct track vs. all the incorrect tracks:

Histograms of bit error rate. Y-axis is scaled arbitrarily for each plot. This is the minimum over offsets in each track, thus even if mismatched tracks have 50/50 random errors, we see a peak at lower than 50% error.

Notice that the "target track" distribution (blue) is bi-modal: a large peak at less than 5% error and wider peak around 27% error. I suspect that this represents clips that do and do not use the same source as the database, respectively. With the same source (first peak near 5%), there is only a small error due to different MP3 encodings and minor differences in alignment of the windows. Different sources (second peak near 27%) give significantly higher errors and a larger variety due to varying quality of audience recordings.

3.2. Query durations

For audio-fingerprinting tasks, the queries used here are fairly long (the mode is around 30 seconds). How well with this work with shorter clips? Re-running the experiment while truncating the queries to different maxiumum durations gives the following:

Query accuracy when we truncate queries to some maximum duration. The red line represents using the full duration on all queries.

So, we can get to nearly 50% accuracy with only the first 3 seconds of a query. But for longer clips (1.5+ minutes), it is clearly needing to use all of that time to reach the full accuracy. This points to a weakness with the current approach.

3.3. Selected Examples

Looking through the results individually shows some interesting cases of when the algorithm failed or succeeded.

3.3.1. Different speed

MJM 339#3 (You Enjoy Myself, 1991/03/16) is an interesting failure. Not only does the search not find the correct song, but even when it is contrained to the correct song, it does not find the correct time offset. Upon manual inspection, the database version of this track is at different speed than the query:

Snippet from Query

Snippet from Database

The matching algorithm simply overlays the entirety of the query over persepctive matches (i.e. there is no dynamic time-warping), so the timing over the course of the clip is very important. In addition, simple speeding-up of the audio shifts the pitch, which will have strong impact on the chroma features.

The following figure shows a spectral slice extracted from these two snippets, centered on one of the strong guitar notes. The pitch is off by about 1.5 half-steps. I beleive the database version is out of tune and thus the database copy has somehow been incorrectly sped-up in the steps between recording and distribution.

Plot of spectral slice of the two clips above, each centered on the longest guitar note.

MJM 173 (Harry Hood, 1992/05/17) is another failure that is due to a mismatch in audio speed:

Snippet from Query

Snippet from Database

3.3.2. Track-splits

I've found that at least one error is due to the difference between query and database in where the audio is split into tracks. Phish will often perform a seamless segue between two songs, making the exact time for splitting a concert recording into separate tracks somewhat arbitrary. In MJM 30 (Split Open and Melt > Catapult, 1999/12/31), the query clip spans two database tracks and thus the search algorithm as-is will not correctly locate it.

In this example, I confirmed manually that if the query is run again with these two neighboring tracks merged as one, it does correctly locate it. The whole experiment could be run such that we treat whole concerts (or sets) as continous tracks, at the cost of some processing time.

3.3.3. Low musical content

Most of the rest of the errors I would subjectively classify as having "low musical content". There is not very clear melodic or rhythmic content in the query; rather, it is dominated by things like soundscapes, feedback, and crowd noise. You can hear those on the results page.

4. Future work

What I've described here is a first-pass solution to this problem. It could be improved in both accuracy and efficiency.

4.1. Inverted Index

The work of Haitsma and Kalker [2] also uses a 32-bit-per-frame fingerprint (although their bits come from taking diffs in something like a mel-spectrogram). They suggest a lookup table (or hash function) on 32-bit values under the assumption that at least one frame in the query clip will be identical to the matching clip in the database. Then, the search can be limited to those songs that share at least one frame and the time offsets can be limited to keeping the matching frames aligned.

Is that applicable here? The plot below suggests that probably is not.

Within the best-matching segment, how many of the 32-bit frams are identical?

This shows the percentage of frames (32-bit fingerprints) that are equal within the best-matching database clip. The first bar shows that for 41% of all queries, there are zero matches. So, for all of these, the match won't be found by checking for identical frames.

4.2. Time-Frequency peak based indexing

This fingerprinting approach works pretty well, but clearly makes a few odd mistakes, requires fairly long query snippets, and requires a brute-force search. A better approach would be to use time-frequency landmarks, as supposedly used by Shazaam [4] (also described by Ellis [3]). I would like to implement something like this and compare the performance.

5. References

5.1. Citations

Wojciech Muła, Nathan Kurz, and Daniel Lemire. Faster Population Counts Using AVX2 Instructions. [arxiv]

Jaap Haitsma and Ton Kalker. A Highly Robust Audio Fingerprinting System. ISMIR 2002. [pdf]

Dan Ellis. Robust Landmark-Based Audio Fingerprinting. 2012. [link]

Avery Li-Chun Wang (2003). An Industrial-Strength Audio Search Algorithm. [pdf]

5.2. Tools

chromaprint: opensource audio-fingerprint generator
openpyxl: Python library to read Excel spreadsheets
rarfile: Python library to read .rar files
libpopcnt: Library for efficient popcnt ("population count")
youtube-dl: downloading soundcloud/youtube links
matplotlib: making figures
sqlite: keeping everything organized