Machine leaning algorithm for data classification.
Posted
by twk
on Stack Overflow
See other posts from Stack Overflow
or by twk
Published on 2010-06-03T15:49:33Z
Indexed on
2010/06/05
17:22 UTC
Read the original article
Hit count: 273
machine-learning
|classification
Hi all,
I'm looking for some guidance about which techniques/algorithms I should research to solve the following problem. I've currently got an algorithm that clusters similar-sounding mp3s using acoustic fingerprinting. In each cluster, I have all the different metadata (song/artist/album) for each file. For that cluster, I'd like to pick the "best" song/artist/album metadata that matches an existing row in my database, or if there is no best match, decide to insert a new row.
For a cluster, there is generally some correct metadata, but individual files have many types of problems:
- Artist/songs are completely misnamed, or just slightly mispelled
- the artist/song/album is missing, but the rest of the information is there
- the song is actually a live recording, but only some of the files in the cluster are labeled as such.
- there may be very little metadata, in some cases just the file name, which might be artist - song.mp3, or artist - album - song.mp3, or another variation
A simple voting algorithm works fairly well, but I'd like to have something I can train on a large set of data that might pick up more nuances than what I've got right now. Any links to papers or similar projects would be greatly appreciated.
Thanks!
© Stack Overflow or respective owner