[mb-devel] WebService usage

Discussion:

Vidar Wahlberg

2006-07-27 07:35:03 UTC

Hi there, I got some questions regarding the WebService as I'm getting a
bit unexpected results.

Okay, so I got an ogg with lots of metadata, but I'm missing musicbrainz
id's and other handy metadata so I try to look it up using WebService.
For simplicity, let's say I got this metadata:
Artist : Europe
Release: 1982-1992
Track : Seven Doors Hotel

Now first I look up the artist running this query:
http://musicbrainz.org/ws/1/artist/?type=xml&limit=5&name=Europe
The result is exactly what I want it to be in a handy format, no
problems here.

Further I wish to lookup the release:
http://musicbrainz.org/ws/1/release/?type=xml&limit=5&title=1982-1992&artistid=ccfe7a3c-1a45-4984-a8d0-644549cefe61
Here's where the unexpected values show. I've submitted the id of the
artist and the title of the release, but yet I get results from other
artists (with a different artist id) who've released something with a
similar title.
Why is this?
Shouldn't the artist id I submit cause the result I get to only contain
releases from the given artist?

And then at last I want to look up the track:
http://musicbrainz.org/ws/1/track/?type=xml&limit=5&title=Seven%20Doors%20Hotel&artistid=ccfe7a3c-1a45-4984-a8d0-644549cefe61&releaseid=ea6be413-fbd7-49f2-8546-e162d481e857
Again a bit unexpected values. Even though I've submitted both artist id
and release id, I get results from completly different artists and
releases.

Now someone might ask, "Why can't you just look the track directly up?"
Well, the query would be something like this:
http://musicbrainz.org/ws/1/track/?type=xml&limit=5&title=Seven%20Doors%20Hotel&artist=Europe&release=1982-1992
Yes, that would give me a result with only 1 query, but then I won't get
metadata such as release type, asin, release date or artist sortname.
That's metadata I can't go without (especially not the sortname, I sort
my archive on that).

The documentation of the WebService is somewhat short, and doing 3
queries to look up 1 track doesn't seem right. Maybe there's a more sane
way to look up a track with WebService?

Also, why is it so that you need to log in to do a track query while
it's not required to log in to do an artist or a release query?

--
Regards,
Vidar Wahlberg

Robert Kaye

2006-08-03 02:25:55 UTC

Permalink

Post by Vidar Wahlberg
http://musicbrainz.org/ws/1/release/?
type=xml&limit=5&title=1982-1992&artistid=ccfe7a3c-1a45-4984-
a8d0-644549cefe61
Here's where the unexpected values show. I've submitted the id of the
artist and the title of the release, but yet I get results from other
artists (with a different artist id) who've released something with a
similar title.
Why is this?

This query is being handled by Lucene and not the database. Lucene
may return results that do not match that artist, but match other
criteria if not many search results are returned. I may be able to
restrict lucene to never return any hits that do not match a given
artist. Please file a bug report and assign it to me.

Post by Vidar Wahlberg
http://musicbrainz.org/ws/1/track/?type=xml&limit=5&title=Seven%
20Doors%20Hotel&artistid=ccfe7a3c-1a45-4984-
a8d0-644549cefe61&releaseid=ea6be413-fbd7-49f2-8546-e162d481e857
Again a bit unexpected values. Even though I've submitted both
artist id
and release id, I get results from completly different artists and
releases.

I see tracks limited to only one artist. But I do see releases that
are not limited to the specified releaseid. I think this is the same
issue as above.

Post by Vidar Wahlberg
Now someone might ask, "Why can't you just look the track directly up?"
http://musicbrainz.org/ws/1/track/?type=xml&limit=5&title=Seven%
20Doors%20Hotel&artist=Europe&release=1982-1992
Yes, that would give me a result with only 1 query, but then I
won't get
metadata such as release type, asin, release date or artist sortname.
That's metadata I can't go without (especially not the sortname, I sort
my archive on that).

Again, you are doing a Lucene search and it doesn't make sense to
stuff everything into Lucene. When you find a match you like, take
the track id and then request the track resource separately. So, that
means doing two queries, not three.

Post by Vidar Wahlberg
Also, why is it so that you need to log in to do a track query while
it's not required to log in to do an artist or a release query?

I don't need to log in to do this query:

http://musicbrainz.org/ws/1/track/7cf9251e-1b63-405f-
a5ac-2512bb1d3cc2?type=xml&inc=tracks

Are you using GET or POST to talk to the service? If POST, it
requires authentication because the only POST action on track is a
submitting a PUID, which requires authentication. Do a GET and all
should be well.

--

--ruaok Somewhere in Texas a village is *still* missing its idiot.

Robert Kaye -- ***@eorbit.net -- http://mayhem-chaos.net

Vidar Wahlberg

2006-09-24 06:19:10 UTC

Permalink

Post by Robert Kaye

I see tracks limited to only one artist. But I do see releases that
are not limited to the specified releaseid. I think this is the same
issue as above.

if you increase the limit from 5 to say 50, then you'll get different
artists as well.
for me this isn't a problem anymore, since i already know the id's i can
simply filter them out in my script. i don't know how this lucene works,
but maybe it is possible to increase the performance if lucene would
filter this out?
still want this feature/bug reported and assigned to you? :)

Post by Robert Kaye
Again, you are doing a Lucene search and it doesn't make sense to
stuff everything into Lucene. When you find a match you like, take
the track id and then request the track resource separately. So, that
means doing two queries, not three.

fair 'nuff, i figured out that looking up a release by submitting a
releaseid (which i can get from a track lookup) will give me all the
data i need, so i can fetch all the data i need with 2 queries.

further i have a question about lucene and the "score" you get from a
lookup:
i have a mp3 named "dj tiesto - honey.mp3", and i presume "dj tiesto" is
the artist and "honey" is the title of the song. so i try the following
lookup to see if i can find some more info about this song which lack
alot of metadata:
http://musicbrainz.org/ws/1/track/?type=xml&title=honey&artist=dj%20tiesto

the very first hit got a score of 100, and that track is called "Honey
Honey" and the artist is "DJ Lucky Johal".
that i don't get the result i'm looking for (which would be this:
http://musicbrainz.org/ws/1/track/342d9aae-3208-4e46-b63d-b610e96bc334?type=xml&inc=artist)
is quite okay, because the artist of that track isn't "dj tiesto", the
title isn't just "honey" and i've generally supplied very little
metadata.
what does it mean that the first hit got a score of 100?
or rather, what does "score" mean?

--
Regards,
Vidar Wahlberg

Robert Kaye

2006-09-26 00:32:44 UTC

Permalink

Post by Vidar Wahlberg
what does it mean that the first hit got a score of 100?
or rather, what does "score" mean?

Here is more than you ever want to know about Lucene scoring:

http://lucene.apache.org/java/docs/scoring.html

--

--ruaok Somewhere in Texas a village is *still* missing its idiot.

Robert Kaye -- ***@eorbit.net -- http://mayhem-chaos.net

flabbergasted at gmx.de ()

2006-09-27 16:13:30 UTC

Permalink

Hi,

a while ago I filed a bug about the missing type attribute in the
releases tag when doing XML queries. See

http://bugs.musicbrainz.org/ticket/2118

But it got assigned and then the owner got deleted and now nothing is
happening anymore.

I don't want to bug you too much but this seems to be an easy fix but
it's an extremely bad bug for me. At the moment, for each track I'm
looking for I have to do 7 queries to your server in the worst case
instead of 1.

So if you're working at it, I'll be patiently waiting. I just want to
know if there's something wrong with the filed bug or anything else that
keeps you from working at it.

Thanks again!

Tobias

Steve Wyles

2006-09-27 16:34:38 UTC

Permalink

Post by flabbergasted at gmx.de ()
a while ago I filed a bug about the missing type attribute in the
releases tag when doing XML queries. See
http://bugs.musicbrainz.org/ticket/2118
But it got assigned and then the owner got deleted and now nothing is
happening anymore.

Don't worry about it not having an owner, it doesn't mean that it won't
get looked at.

Post by flabbergasted at gmx.de ()
So if you're working at it, I'll be patiently waiting. I just want to
know if there's something wrong with the filed bug or anything else that
keeps you from working at it.

The development team is currently comprised of unpaid volunteers and as
the amount of time they can dedicate to the project is variable, they have
to prioritise the bugs according to severity etc.

You also need to understand a little of the way the developement/bug
fixing process works.

Generally, unless the bug is critical, such as causing db corruption or
impacting a large number of users, the fix won't be applied directly to
the live server. Instead, it will be worked on and incorporated into a
future server code release.

The time to resolution can be reduced by assisting the development team,
for instance if you understand the bug and know what is causing it,
patches are greatly appreciated.

Steve

Robert Kaye

2006-09-27 20:31:18 UTC

Permalink

Post by flabbergasted at gmx.de ()
Hi,
a while ago I filed a bug about the missing type attribute in the
releases tag when doing XML queries. See
http://bugs.musicbrainz.org/ticket/2118

I'm planning on fixing this bug with the next release, which I
officially started on yesterday.

Sit tight.

Post by flabbergasted at gmx.de ()
--

--ruaok Somewhere in Texas a village is *still* missing its
idiot.

Robert Kaye -- ***@eorbit.net -- http://mayhem-chaos.net

Vidar Wahlberg

2006-09-29 20:24:04 UTC

Permalink

shamelessly continuing this thread as this is related to the issue.
sorry 'bout the long mail.

last few days i didn't have anything useful to do, so i decided to look
a bit on this lucene thingy and see if i could learn something new.
and well, i'd claim i did.

when i tag my gross amounts of music files i don't want to sit there and
tag them more or less manually. i want to start a program, let it run
for as long as it would like and when i come back i want all my files to
be properly tagged. currently i'm having slight difficulties with
getting this to work as well as i wish it would.
in my perfect world, i send all the metadata i got about a song to a
server and the server will tell me exactly which song i got. ofcourse,
this is impossible, but i believe it's possible to come quite close
without too much fuss.

so, i downloaded lucene (the java one), spent some hours trying to get a
fair understanding of how the thing works and went on to build myself a
index. the index i'm stuck with now is not tuned alot, it's indexing
stuff i don't even search for and on top of that, it's 2.4gb and it took
more than 7 hours building the index.
further on i made myself a simple program to search the index and fed
this program with 976 filenames (stripped the extension) to see if the
results it spat out would somehow match the filenames.

before running the search i though to myself "no chance in hell this is
gonna be fast with a 2.4gb index", but quite surprisingly i was very
wrong.
i could do 10 queries per second on average with the filenames i fed it,
and just as important the metadata i was looking for was very frequently
returned within the first 10 hits.

so what's my philosphy about this?
put all unique metadata (albumartist, album, tracknum, track,
trackartist, ...) in a single field in a document.
simply search that field with all the metadata you got on a song.
it is fast, you don't have to worry about some moron putting both artist
and track in the track-tag (seen it lots of times in id3-tags) and you
actually don't have to do multiple queries to try different combinations
of filenames ("artist - album - track", "artist - track", "track -
artist", ...).

but enough fuss. i came up with the idea of putting all the metadata in
a single field and just search that last night, and the results i get
are very good. if i set up some server which spits out the 50 best
matches i can make the client do more throughrougly comparison of the
metadata i got from the server and the metadata i got from the file
(thus making the client do the hard work).

the code i've made can be found here: http://exent.net/~canidae/lousy/
it's only testing and nothing else, so yes it's messy, don't bother me
about the ugly code :p
you'll also find two text files there, "songs.txt" and "results.txt".
"songs.txt" is the filename of the songs i matched, only stripped of
their extension.
"results.txt" contains up to 10 hits from the search, but the "score"
given doesn't reflect how likely the hit is correct. rather think of it
how well the words matched, scoring must be done client side.
keep in mind that there are lots of norwegian titles, using our unusual
characters like "?", "?" and "?" which apparently were stored in
something else than utf-8, causing the characters to be displayed wrong.
this does effect the search as "sj??l" is not "sj?l". i've also _only_
searched using filenames and not taken anything from the tags, but the
idea is to just add the tags to the string you search for.
several of the songs can't be found in the database either, so they're
bound not to give a decent result (especially those songs with
"barne-tv" ("childrens tv" or something translated) in its title.

anyways, i'd like some thoughts about this.
i would love to see a similar feature over at musicbrainz.org where i
can just supply as much metadata as i got and get 50 or so results back
which i then process further client side.
infact, i'd like to help creating something like this as the music
archive i'm supposed to maintain get new tracks way faster than i can
tag them manually, so i need something automated with high precision.
although, python is not my favourite language, so if someone who've
played with this pylucene could pinpoint some docs and files i should
pay attention that would save me alot of hassle.

just for reference, here are the specs of the computer i tested on:
ibm r40 type 2681 (laptop)
2.0ghz pentium 4
512mb ram
using java 1.5

--
Regards,
Vidar Wahlberg

Robert Kaye

2006-09-29 21:55:42 UTC

Permalink

Post by Vidar Wahlberg
anyways, i'd like some thoughts about this.

This project does essentially what you've started:

http://bugs.musicbrainz.org/browser/pimpmytunes

However, its working OK, but needs more tuning and a lot more bug
fixing. And its written in Python...

But as far as your approach is concerned, big thumbs up. This is a
great way of doing mass tagging.

--

--ruaok Somewhere in Texas a village is *still* missing its idiot.

Robert Kaye -- ***@eorbit.net -- http://mayhem-chaos.net

Vidar Wahlberg

2006-10-17 01:36:28 UTC

Permalink

Post by Robert Kaye
http://bugs.musicbrainz.org/browser/pimpmytunes
However, its working OK, but needs more tuning and a lot more bug
fixing. And its written in Python...
But as far as your approach is concerned, big thumbs up. This is a
great way of doing mass tagging.

prepare for a long mail, this is gonna take some time to write :)
keep in mind that it probably took me 42 times as long to write it than
for you to read it :)

first i'd like to send a big thanks to Robert & Luk?? for their much
appreciated help, i'd be bald by now if it wasn't for their help with
python :)

this mail is mostly intended as an example on a different approach of
tagging music. if you intend on using this software to actually tag your
music then you do so on your own risk :)

okay, a bit too long summary of my idea:
as i've mentioned before, i want a simple, automated way of tagging as
much as possible of my music without me lifting a finger.
first i tried using the existing webservice, but i never managed to get
the tracks i wanted and more often than not i had to do several requests
to the webserver to receive all the data i needed. i felt it was
unnecessary to do more than 1 request per track so i decided to look a
bit on this lucene and see if i could come up with an idea to reduce the
amount of requests.
and i came up with this simple idea: put all metadata in 1 field in a
document. then the user submits filename and metadata for a track and
the webservice simply search that single field, not caring about what
text is the artist, what text is the track and so on. and well, it
worked surprisingly well. very often the song i was looking for was the
first hit, and almost always the song was within the first 30 songs
returned. and the great thing about this: the lucene search is damn
fast, virtually the only load you get is when you return the result to
the user.
so now i got a webservice which cause ~no load on the server and gives
me exactly what i'm looking for; a list of songs i most likely are
looking for. this was great, now it was up to the client to determine
which of the returned song is the right one, thus putting all the load
on the client rather than the server.
frankly i'd love to stop here, show it to you and claim "this is the
real deal, i tell you!", but something told me i'd never convince alot
of people if i didn't actually give you some "proof" that this is a
"sane" solution. well, the level of sanity can be questioned i guess,
but to move on:
so with the help of a friend, Thomas Adamcik, we continued the work on a
perl tagger we made about a year ago. we got "decent" success with that
tagger back then, but we still had alot of untagged songs. this perl
tagger was easy to modify to do requests to our new webservice instead
of using the musicbrainz library, but our matching "algorithm" we pretty
much had to write all over. i honestly thought writing this matching
"algorithm" would be a piece of cake, well, i was wrong. even so, i
decided to push on until i got something working, i'd come way too far
to give up. i certainly don't regret spending several days on this.

right, let's move on to how it works.

the webservice, called "lousy":
why did i call it "lousy"? that doesn't sound promising...
i'm not sure why, it just popped into my head, think it's "lucene" &
"lossy" combined or something, that along with the frustration of coding
in python.
"lousy" is infact a very simple piece of python code. all it does is
receive data from the user, search for the datain the lucene index and
return the x (1 to 100, default 50) tracks to the user.
i did cheat a bit and made the index using a small java program instead
of a python script, but that doesn't matter. it shouldn't be a problem
making the index with a python script.
i've set up this webservice and you can access it here:
http://mb.samfundet.no/
go there and you'll get a quick intro on how it works.
further you can get the code i've used for lousy here:
http://home.samfundet.no/~canidae/lousy/
this is a bzr archive so you can branch it (bzr branch
http://home.samfundet.no/~canidae/lousy/).

the client side tagger script:
keep in mind that Thomas has primarly worked on this, i don't know it
thoroughly, but enough to get you started.
this script is more advanced than lousy. it reads metadata from
mp3/ogg files, sends the metadata along with the filename to lousy and
match the result returned with the filename/metadata.
instead of babbling any more about it i'll tell you where to fetch it:
http://home.samfundet.no/~canidae/tagger/
it's a bzr archive: bzr branch http://home.samfundet.no/~canidae/tagger/

how to use this stuff:

lousy:
when i coded this i did at some point want to code it fairly similar to
the existing webservice, but it didn't turn out as nice as i wanted it.
for now you don't need to worry about setting this up, feel free to use
mb.samfundet.no, although keep in mind that i may decide to remove it at
any time, it's only there for testing purposes.

tagger:
this one requries some modules, a quick grep:
Class::Accessor
Data::Dumper
Data::Dumper::Simple
File::Basename
File::Copy
File::Find::Rule
File::Spec
MP3::Info
MP3::Tag
Ogg::Vorbis::Header
String::Approx
XML::Smart

frankly, this module thingy is not my table, but i've not had troubles
getting the script to work without doing any "ugly hacks" (iirc all of
these are in the debian repository (sarge)).
Thomas does however suggest getting a newer version of
MP3::Info/MP3:Tag than the ones in the debian archive. The case may be
the same for Ogg::Vorbis::Header, i know for sure that we've had some
issues saving ogg tags (see patch/libogg-vorbis-header-perl.patch).

ok, let's hope that's settled and move on to how to tag:
1. cd tagger
2. ./bin/fresh.pl -v <path to untagged files>
3. watch

don't worry, this is a "dry run", your files won't be tagged and moved,
it just shows what it would do. fresh.pl is made for testing, we're
working on making it better.
if you add "--save <path>" it should tag, rename and move your files,
but i do _not_ guarantee that it will work (if you got ogg files then
make sure you read the patch for Ogg::Vorbis::Header) so i recommend you
don't do this on files you value.

results from tests i've done:
i've primarly used this tagger on a set of 986 files, all mp3's iirc.
the files got _horrible_ metadata (if any at all), limited data in the
filename and on top of that, the filenames have been encoded between
utf-8/mac/latin1 and who knows what else.
in other words, this selection is about the worst selection you can come
across.
how long did it take to check 968 songs?
roughly 36 mins on a dual amd mp 2200+ running both lousy and tagger.
how many songs did "tagger" recognize?
532, or roughly 54%.
how many of these songs were tagged wrong?
4:
1. Alf Proysen - Tango for to.mp3
was recognized as "Alf Pr?ysen - Tango for TV".
reason: "Alf Pr?ysen - Tango for to" doesn't exist in my db,
the titles are striking similar.
2. Andrew Lloyd Webber - Phantom Of The Opera.mp3
was recognized as "Andrew Lloyd Webber - Overture".
reason: this is actually not a wrong match, the filename is wrong
3. Beach Boys - Surfin' Usa.mp3
was recognized as "The Beach Boys - Misirlou".
reason: the mp3 is cut off, 30 secs are missing, which happen to
match another song on the album "Surfin' U.S.A."
4. vestlandsfanden - For Livets Glade Gutter.mp3
was recognized as "Vestlandsfanden - Alvefolket".
reason: there are 3 tracks in the db that should match, however
their track length don't match (> +/-5 secs).
on the other hand they also got an album named the same as
the track, which confuses the current matching algorithm.

how many songs were "partially" tagged wrong?
to explain what i mean with "partially tagged wrong":
concider you got a mp3 named "europe - the final countdown.mp3" with no
metadata.
how many times has this track been released?
how on earth can you possibly know which album this mp3 comes from?
you can't, not even humans can. still, i want it tagged, because the mp3
is perfectly fine (well, except for the fact that it's a mp3), i don't
care which album "tagger" thinks it's connected with, as long as it's
able to recognize the song.
and currently, tagger does. it will tag the song, but it's highly random
which album it gets connected to. this is what i mean with "partially
tagged wrong".
if you only got a single mp3 this usually won't affect you much, but
let's say i got an entire album of europe, where all tracks got no
metadata and simply are named "<artist> - <track>.mp3" (the agony!).
that's bound to make the songs get tagged on different albums, which
obviously sucks.
still, in my view, this is better than nothing.
since the songs i'm tagging comes from someone else (who clearly don't
love his/her songs as much as we do) i don't know which album these
songs comes from, and it's impossible for me to determine how many of
them got connected to the wrong album.
if both trackname and albumname is given, then it's alot less common
that this happen.

results from a test with more sane tags (and more files):
due to the huge collection of songs, i've not checked the entire log, it
would take days.
songs: 7986
tagged: 6989 (87%)

don't know how many are tagged [partially] wrong, but a very brief
search did not give me the impression that it's any worse than the small
test (that is, i didn't find a single one tagged wrong, but i didn't
look very well either)).

for your amusement i've put out the logs from these two tests:
http://home.samfundet.no/~canidae/scan.log (986 songs, 561k)
http://home.samfundet.no/~canidae/scan2.log (7986 songs, 4.7m)

todo (yes, i'm soon done with this mail):
lousy can be improved. since there's a minor bug with pylucene (or
python or whatever it is) i can't access files greater than 2g. the
first index i built was 2.4g, but instead of indexing every field in the
documents i just indexed the field i search in and managed to push it
just below 2g.
lousy could be improved by:
- not returning hits that don't match the given tracknum (if any)
- only return tracks with a length +/-10 secs from the given length
- use mod_python instead of cgi so it won't open the index each search
- <add your suggestion here>

tagger could be improved by:
- making a more sane "match.pm"
- making a decent interface
- <your suggestion right here>

phew, my head is about toast now, so i'll stop here. i hope you get the
general idea and take the time to look at this despite the clumsy setup.
i wanted to present something for you a couple weeks ago, but it turned
out this was much harder than i anticipated.
since this "documentation" is very rough, not very well formatted and
probably not very helpful to many then do feel very free to ask about
stuff that's left unclear.
do feel even more free to play around with both lousy/tagger and improve
them, just send patches :)
i would however prefer if you mail to this list instead of directly to
me (unless the list admins disagree) as it may be someone else with a
similar request.
as some of you know i'm usually around on irc, so you can give me a
hilight there as well :)

right, sorry 'bout the long mail. if it helps, it hurts me more than it
hurts you =)

--
Regards,
Vidar Wahlberg

Vidar Wahlberg

2006-10-17 01:53:02 UTC

Permalink

Post by Vidar Wahlberg
- making a more sane "match.pm"
- making a decent interface
- <your suggestion right here>

one more important note on tagger i forgot to mention:
Thomas intend to make this as a CPAN-module and maintain it, but he
agreed to let me give it to you. so keep that in mind if you plan to
work on it.

--
Regards,
Vidar Wahlberg

Robert Kaye

2006-10-25 00:00:40 UTC

Permalink

Post by Vidar Wahlberg
so now i got a webservice which cause ~no load on the server and gives
me exactly what i'm looking for; a list of songs i most likely are
looking for. this was great, now it was up to the client to determine
which of the returned song is the right one, thus putting all the load
on the client rather than the server.
frankly i'd love to stop here, show it to you and claim "this is the
real deal, i tell you!", but something told me i'd never convince alot
of people if i didn't actually give you some "proof" that this is a
"sane" solution. well, the level of sanity can be questioned i guess,

I really like what you are doing! This is/was the inspiration behind
me writing pimpmytunes, which I haven't gotten around to finishing
yet. I firmly believe that lucene is an awesome tool for mass tagging
tracks that have a reasonable amount of metadata.

I would very much like to encourage you to keep hacking on this and
bring it to a state where people can download it easily without
having to follow a lot of instructions. That's always the hard part
(and exactly where I left off working on pimpmytunes).

Press on!!

--

--ruaok Somewhere in Texas a village is *still* missing its idiot.

Robert Kaye -- ***@eorbit.net -- http://mayhem-chaos.net