Extraction of musical structure

I think my next big project will involve automatically extracting structure from music. Mike and I had some discussions about doing this with machine learning / evolutionary algorithms, which produced some interesting ideas. For now I'm implementing some of the more traditional signal-processing techniques. There's an overview of the literature in this paper.

What I have to show so far is this:

This (ignoring the added colors) is a representation of the autocorrelation of a piece of music ("Starlight" by Muse). Each pixel of distance in either the x or y axis represents one second of time, and the darkness of the pixel at (x,y) is proportional to the difference in average intensity between those two points in time. Thus, light squares on the diagonal represent parts of the song that are homogenous with respect to energy.

The colored boxes were added by hand, and represent the musical structure (mostly, which instruments are active). So it's clear that the autocorrelation plot does express structure, although at this crude level it's probably not good enough for extracting this structure automatically. (For some songs, it would be; for example, this algorithm is very good at distinguishing "guitar" from "guitar with screaming" in "Smells Like Teen Spirit" by Nirvana.) An important idea here is that the plot can show not only where the boundaries between musical sections are, but also which sections are similar (see for example the two cyan boxes above).

The next step will be to compare power spectra obtained via FFT, rather than a one-dimensional average power. This should help distinguish sections which have similar energy but use different instruments. The paper referenced above also used global beat detection to lock the analysis frames to beats (and to measures, by assuming 4/4 time). This is fine for DDR music (J-Pop and terrible house remixes of 80's music) but maybe we should be a bit more general. On the other hand, this approach is likely to improve quality when the assumptions of constant meter and tempo are met.

On the output side, I'm thinking of using this to control the generation of flam3 animations. The effect would basically be Electric Sheep synced up with music of your choice, including smooth transitions between sheep at musical section boundaries. The sheep could be automatically chosen, or selected from the online flock in an interactive editor, which could also provide options to modify the extracted structure (associate/dissociate sections, merge sections, break a section into an integral number of equal parts, etc.) For physical installation, add a beefy compute cluster (for realtime preview), an iPod dock / USB port (so participants can provide their own music), a snazzy touchscreen interface, and a DVD burner to take home your creations.


  1. Anonymous29.7.07

    could you do something with diffusion of parameters along correlated samples...

    I'm not sure what I meant by that

  2. what you might do is set up a manual input mode to fly or dance the visualiser along with the music, then use the covariance matrix to recognize familiar regions and replay the manual input. If you make the manual control general enough to allow interpoalting visualiser states, you might be able to train the music -> visualiser mapping fairly quickly and well.

  3. Since you seem to already have some code that computes some similarity metric between different sections of music,

    could you use this code to turn a "pile" of music sample frames into a linear piece where the next frame is chosen with a probability proportional to its similarity to the previous frame ?

  4. hmm... I'm trying to do something similar for time series data on neurons at the moment. I could use this global average power thing if I were interested in repeated oscillation motifs, (which may itself be interesting, but I don't think I have any interesting structure like that in my current models).

    no, what I'm trying to do is look for specific sequences of neuron firing that happen to repeat (without any restriction on what I expect those sequences to be).

    so, my best guess is to just compare the whole population vector over 100ms between all time offsets. This is computationally annoying.