Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Duplicate mp3s in fma_full? #23

Open
ejhumphrey opened this issue Mar 20, 2018 · 3 comments
Open

Duplicate mp3s in fma_full? #23

ejhumphrey opened this issue Mar 20, 2018 · 3 comments

Comments

@ejhumphrey
Copy link

After some digging, I'm reasonably confident that there are a fair number of files that have at least one exact duplicate in the fma_full zipfile. This came up when I was trouble-shooting some weird behavior, and noticed that the ID3 metadata associated with a track didn't match the CSV file of track metadata, but did match a different row.

Metadata matching is at best a wicked pain, so instead I took at look at which files match based on a hash of the bytestream:

import hashlib, glob, os
from joblib import Parallel, delayed

def hash_one(fname):
    hsh = hashlib.sha384()
    hsh.update(open(fname, 'rb').read())
    return hsh.digest().hex()

pool = Parallel(n_jobs=-2, verbose=20)
dfx = delayed(hash_one)
fnames = glob.glob('fma_full/*/*mp3')
fhashes = pool(dfx(fn) for fn in fnames)  # takes approx 20min w/64 cores :oD

groups = dict()
for fh, fn in zip(fhashes, fnames):
    if fh not in groups:
        groups[fh] = []
    groups[fh].append(os.path.splitext(os.path.basename(fn))[0])

This produces 105637 unique file hashes from 106574, with 105042 pointing to a single file.

I've reproduced this twice decompressing the zipfile, so I'm pretty sure it's nothing I did. That said, I also downloaded the dataset a long time ago (last summer, maybe?), and I'm curious if it's been updated at all?

I'm curious what might have caused this, and wonder if the 105k tracks without duplicates map to accurate metadata in the raw_tracks.csv file? I haven't had a chance to check the ID3 tag coverage yet, but that should be an easy thing to look into.

for what it's worth, I also haven't looked at the smaller partitions, so I'm not sure if / how this might affect other uses of the dataset. Will follow up later if / when I learn more.

@ejhumphrey
Copy link
Author

okay, tiny update: 56421 / 106574 (≈53%) do not have ID3 tags, so that's probably not a great avenue for sanity checking the track metadata CSV files.

@mdeff
Copy link
Owner

mdeff commented Jul 20, 2020

Thanks for the investigation @ejhumphrey. If I understood correctly, it meas that 105042 MP3s are unique, but there are 105637-105042=595 unique bytestreams out of the remaining 106574-105042=1532 MP3s.

Are we sure to catch every exact duplicates with this method? I guess tracks could have the same audio with different metadata, or even encoded differently. Should we try to identify near duplicates as well?

The dataset was last updated in May 2017 (updates are recorded in the "History" section of the README).

Duplicates mean there was duplicate uploads on https://freemusicarchive.org (maybe by the same artist in different albums?). The raw_tracks.csv file contains metadata acquired through the https://freemusicarchive.org API. The metadata do correspond to the track ID in the .mp3, but could be wrong (like the technical metadata identified in #4 (comment)).

For tracks where ID3 metadata didn't match the raw_tracks.csv file, do you know which is right? I can imagine artists editing the metadata on https://freemusicarchive.org while not updating the ID3 tags.

@mdeff
Copy link
Owner

mdeff commented Jul 20, 2020

I've collected known issues (with workarounds and fixes) in #41 and the wiki.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants