Cleaning two decades of family music
MediaThis is a bit of a shaggy-dog story, but it’s about something I put off doing for years before tackling it this weekend. We’ll need to go into a bit of background, including how early podcasts were handled, and a parent with a penchant for collecting things. Strap in!
How you end up with 800 GiB of audio
My parents took music seriously, and it was a big part of my sister’s and my childhood. They had dozens of shelves bursting with LPs and CDs; so much so that there wasn’t ever space to show all of them. This came to a head during one of our many moves where we were going from a larger house in Malaysia back to a Singaporean apartment.
For my dad’s birthday one year, my sister and I decided take all his CDs and sleeves out of their jewel cases, and organise them based on title and artist into folders that he’d always be able to access and find. All told, there were more than 500 discs.
While my sister did that, I got to work ripping each one. I went to Sim Lim Square and bought a giant new hard drive, then babysat iTunes for a few weeks while I fed it a constant stream of holographic discs. I’d have it spinning away in the background while I did other stuff, then I’d hear the clunk zzzip of the CD-ROM tray popping out, and I’d put the next one one.
We didn’t get them all done in time, but by then my dad was happy to help with the endeavour :).
Ripping CDs en masse like this worked out great for my dad, who was able to sync a random selection to his iPod every few weeks and take it in the car and on business trips. Over the years the library was slowly added to, and that iPod became an iPhone, but he’s used it as the basis of his music collection since.
The early days of podcasts
Around the same time we were ripping these CDs, I started getting into audio magazines, New Time Radio, and downloadable spoken word shows, which were later dubbed podcasts. Before Apple included official support in iTunes in the mid-2000s, you’d download them through a dedicated “podcatcher” like iPodder. This would run in the background and periodically poll your show’s RSS feeds, and would download and transfer new enclosures over to iTunes automatically.
This also worked exceedingly well. I listened to The Overnightscape, Whole Wheat Radio shows, Crap and Stuff, Dave Winer’s BloggerCon, and IT Conversations on regular rotation, by syncing the custom playlist to my iPod each morning before school. Standing waiting for the MRT with a third-gen iPod listening to Frank, Jim, Esther, Israel, and Doug feels like an age ago now.
I eventually splintered off into my own music library which I maintain to this day, albeit using other software now. But I still keep current backups of all my dad’s music, and it was this nearly 800 GiB directory of stuff that I tackled this weekend.
Duplicates and mixed content
People with experience ripping CDs might have already spotted the problem with our process above! When you’re ripping a couple of CDs, you can check to make sure the metadata is correct, that you haven’t ripped it before, and that the rip was successful (CDs have robust error correction, but even that could fail on badly scratched or handled discs). When you’re ripping hundreds of CDs, you slip up a bit more.
While my dad was still happy with all his ripped music and custom mixes of random songs on his trips, the reality was the iTunes library was a complete shambles!
There were a bunch of reasons. We’d all clearly bought the same CD multiple times over the years without checking, so there was hundreds of duplicated tracks. When you move between countries the same album might have been released under a different name, or have a different track order, or have MD5 metadata with different information (such as having or missing translations). Compilation albums also tend to pull from the same pool of songs, so Super Essential 1960s has the same stuff as Max Also 60’s Flower Power Also Can.
It was also a lesson that the CDDB could be great, but the community-contributed data wasn’t always accurate. My dad loves what we in the West euphemistically refer to as “world music” (aka: not English!), and these were especially prone to having misspellings, even the artist’s name on the same album. Albums would be erroneously listed as compilations, and vice versa. When Clara and I moved in together, a bunch of Cantonese and Japanese music was added that had similar metadata issues.
Into this hodgepodge were also a few years of those early podcast episodes, which iTunes had dutifully filed by artist as if they were songs. IT Conversations shows were especially prone to this, because Doug would attribute his guest over himself (which I used to think was quite noble). Some other early shows, shall we say, played fast and loose with their metadata, so the same show might have episodes strewn across hundreds of different folders.
The results
I started cleaning these folder by folder, but before I even got to B I wrote a few horrible Perl scripts to create a tree of albums and attempt to fuzzy match them based on artist name or album. Then I built a dictionary of every conceivable name and host of a podcast I would have previously listened to, and used the_silver_searcher to try and find them.
We had entire duplicate compilation music sets differentiated by a single underscore, or multiple versions of a song that differed only by bitrate. By the time I deleted these duplicates, sorted the remaining tracks, and pulled out all the podcasts into their own archive folder, I’d saved about 40% of the space!
A day of digital scrubbing later, and I now have two folders where I once had one. Before:
# du -sh /zmedia/audio/music
==> 813G
And after:
# ls -l /zmedia/audio
==> 279G music
==> 213G shows
Many of those podcasts no longer exist, and are fun snapshots in time. My plan is to archive them to cold storage, either a few M-Disc Blu-rays, or an LTO cart. This post is already long enough, so I’ll save that discussion for another time.
As for the music, this is now a far more manageable pool that I can send/receive with OpenZFS between our home server and the box at my dad’s place.
The final task is to take the remaining music and make sure all the metadata is correct. There are a few tools that promise to do this, but I think it’ll have to mostly be done manually. That’s a task for another day :).