On Amarok database use.

Tue Oct 21, 2008 9:30 pm

Hello guys.
I must say that Amarok is the first music player that I really like. It's very cool. It makes windows jealous, lol. Using Amarok is like, "duh, why was I using something else before ?"

Anyway, I'd like to help the project a bit. Since I only use very low-level languages like C or C++ to write drivers or embedded applications, I won't propose to help with the code, but I can help with the database.

I was wondering why Amarok was taking forever to update my collection. The database is extremely small (*), about 80k records in the tables. I have about 10k music files, mostly FLAC that I ripped from my huge CD collection (thanks to Amarok I saved lots of wall space for CD shelves, lol).

So, why is mySQL taking 90% CPU during the collection update ? I had to download the source and look to find out. The answer was pretty obvious, as expected, and as usual : database noobiness from the developers.

This is not a shame, I mean, the UI design in Amarok shows there were some very smart people involved, and I mean it, good user interfaces are really a work of art. But the database sucks, really, hard, and that's why I'm writing this message.

I remember this website that was dying because of too many members. I talked to the webmaster of this website. He had just got a new quad Xeon to put his website on. I told him, your website should run on a stinky Celeron. After a few hours of SQL hacking the load average did drop from 20 to 0.1. So he got his Celeron, but he kept the donations page "for a new server" though, lol. He's still a database noob though.

Anyway. So, if I wanted to modify the Amarok code, I'd need lots of time to learn its structure. And the devs would have to waste their time reviewing my buggy patches. So I won't do that.

Instead I propose to teach someone in Amarok Team how to use a database properly (whoever wants to be that guy, come in).

I'll start on the collection building process. I see in the source code that there is a separate process that gathers information about the files (amarokcollectionscanner), which is good. I also see in the source code that for each file, several queries are being made, which is very bad.

Newbies tend to think of a database as a simple, dumb storage medium, like a hashtable for instance, where if you want to check something or change something, even just a row, you make a query. Therefore, noobs put all their logic in the application.

WRONG.

In an interpreted language like Python a hashtable access like this : storage[ key ] = value
takes much less than 1 microsecond. So you can use it liberally. A database query, on the other hand, is extremely slow. It can take up to 100 us to do a simple SELECT query. A slightly complex query can take one millisecond.

This is OK in a webpage that will make 5 queries totalling 1-2 milliseconds. It is not OK when you read 10.000 files from disk, and for each file, execute a few tens of queries.

So, here is how to make a fast collection scanner for Amarok.

Recursively scan directories. Filter on file extension, keeping only music file types. stat() files. Build an in-memory list of (file path, size, mtime). insert this list in a mysql temporary table (path, mtime, size). Of course, only issue an INSERT when you have something worth the hassle to insert, ie. at least 1000 rows.

CREATE INDEX on the path column of this temp table.

Let's suppose "files" is the table where the music files are permanently stored.

Now you can issue a SQL query which will process many rows efficiently.

SELECT FROM files LEFT JOIN temptable ON path WHERE temptable.path IS NULL gives you the deleted/moved files
SELECT FROM temptable LEFT JOIN files ON path WHERE (files.path IS NULL) gives you the new files
SELECT FROM temptable JOIN files ON path WHERE (files.mtime != temptable.mtime OR files.size != temptable.size) gives you the modified files

From those queries you can obtain a list of files that you are going to read the tags from. Read all the tags and insert them into a temp table, in batches. Then, issue a few SQL queries to update everything that needs to be updated (I can write those if needed).

I guess there is a at least a 10x speed improvement waiting to be found in the collection build/update process.

See ya

(*) database terminology for noobs :

- very small database = what the noob considers big
- small database = smaller than your RAM
- large database = larger than your RAM, ie. a few gigabytes
- very large database = a few tens of terabyres
- huge database = when you need an extra room for your harddisks

On Amarok database use.

Page 1 of 1 (2 posts)

On Amarok database use.

Re: On Amarok database use.

Bookmarks

Who is online