This forum has been archived. All content is frozen. Please use KDE Discuss instead.

PostgreSQL database design

Tags: None
(comma "," separated)
Jean Pégésql (PostgreSQL fan)
Karma
0

PostgreSQL database design

Thu Mar 30, 2006 9:23 am
Dear all,

I discovered Amarok recently. With Amarok, I can listen to my CDs again. Thanks for this great software.

Now approximatively 100 CDs have been ripped in ogg, it seems that Amarok is becoming slow. I suspect a database problem. I use PostgreSQL 8.1 as database server and recommand pgadmin3 graphical interface (part of Debian: apt-get install pgadmin3).

Here are some remarks.
I hope that we can discuss these issues and go further.

1) Tables without OIDs must carry a PRIMARY KEY

Tables WITHOUT OIDs should always carry a PRIMARY KEY.

Why ? Simply because PostgreSQL includes a powerfull query planner, which tries to find the best solutions to execute queries.

Without primary keys, the query planner is blind. And large joins cannot be optimized automatically by PostgreSQL.

As a result, Amarok PostgreSQL database is likely to be slow, especially on tables \"tags\" and \"directories\". Dammed !!!

Solution : either re-create all tables WITH OIDS (but it may be confusing for existing users to migrate their data) or create a primary key on each table.

Second solution is recommended.

Let\'s go for it:

2) Adding Primary keys

Now that we assume that all tables shoud carry a primary key, one remark : a primary key is considered by PostgreSQL planner as a unique index. Therefore, there is NO NEED to create a unique index when a primary key already exist.

It results in the following commands:

-- table tags
-- We are creating a primary key and dropping the uneccessary stuff.
DROP INDEX url_tag;
ALTER TABLE tags
ADD CONSTRAINT tags_pkey PRIMARY KEY(url);


-- table stats
-- We are creating a real primary key and dropping the uneccessary stuff.
DROP INDEX url_stats;
ALTER TABLE \"statistics\" DROP CONSTRAINT statistics_url_key;
ALTER TABLE \"statistics\" ADD CONSTRAINT statistics_pkey PRIMARY KEY(url);


-- table directories
-- We are creating a real primary key and dropping the uneccessary stuff.
DROP INDEX directories_dir;
ALTER TABLE directories DROP CONSTRAINT directories_dir_key;
ALTER TABLE directories
ADD CONSTRAINT directories_pkey PRIMARY KEY(dir);

-- table amazon
-- We are creating a real primary key on filename and adding an index on asin.
ALTER TABLE amazon
ADD CONSTRAINT amazon_pkey PRIMARY KEY(filename);
CREATE INDEX amazon_asin_idx
ON amazon
USING btree
(asin);

3) Duplicate entires in the table \"Image\"

The table image has duplicate entries. It is a problem for assigning a primary key.

For example, when a CD includes two albums, and one cover has been saved in a directory, the cover manager creates two entries in the Image table. There are several solutions to assign this problem:

Solution one: create an id and use a sequence to pullulate it. The SQL is :

-- create a sequence
CREATE SEQUENCE images_seq
INCREMENT 1
MINVALUE 1
MAXVALUE 9223372036854775807
START 110
CACHE 1;

-- adding an id and a primary key
ALTER TABLE album ADD COLUMN id int4;
ALTER TABLE album ALTER COLUMN id SET STORAGE PLAIN;
ALTER TABLE album ALTER COLUMN id SET NOT NULL;
ALTER TABLE album ALTER COLUMN id SET DEFAULT nextval(\'album_seq\'::regclass);
ALTER TABLE images
ADD CONSTRAINT image_pkey PRIMARY KEY(id);

This solution is the most trivial and can be applied easily.

Solution two : Find duplicate entries in the table image, based on \"path\", remove these entries and then create a primary key on \"path\". If you are interested in this clean solution, I can provide an SQL script. But it will destroy some SQL entries in the table image, therefore I do not know if you are interested.

4) Date and time in PostgreSQL

Several date fields in the PostgreSQL database are coded \"à la MySQL way\", which is completely non standard and does not comply with SQL99.

In PostgreSQL, all date fields should be \"abstime\" type and not \"int4\". Whenever int4 are needed by Amarok, we should cast the values to int4 using PostgreSQL build in casts.

Example: Let us suppose that the table tags includes one field \"createdate\" of type \"abstime\".

-- example of SELECT statement
SELECT createdate::int4 FROM tags WHERE clause blablabal;

-- example of UPDATE statement
UPDATE tags SET (createdate) VALUES (3746362849957326 as abstime);

If you are interested, I can provide an SQL script for migrating all table values from int4 to abstime. But you will need to cast values in Amarok source code. Tell me if you are interested.

5) Using views

Using views is very important for speed, because postgreSQL query planner analyses each view and remembers the best query plans.

Whenever you run a large join directly (not using views), the query analyser will not remember anything. This is why using views is always recommended.

But for the query planner to work properly, we need primary keys everywhere. Usually, the speed of an application can be multiplied by 10 or more using a nice set of views.

I wait for primary keys and will then provide a set of views.

Kind regards,
Jean
Jean Pégésql
Karma
0

Re:PostgreSQL database design

Thu Mar 30, 2006 10:50 am
Dear friends,

I turned SQL debugging on in PostgreSQL to analyse various queries. I copy and paste queries in pgAdmin3 graphical interface. it includes a graphical explain tool. A screen shot :

[img size=150]http://www.pgadmin.org/images/screenshots/pgadmin3_explain.png[/img]

Here are my first remarks:

Query ONE:

SELECT DISTINCT tags.url,
LOWER( genre.name) as __discard,
LOWER( artist.name) as __discard ,
LOWER( album.name) as __discard ,
LOWER( tags.discnumber) as __discard ,
tags.track as __discard,
LOWER( tags.url) as __discard
FROM tags INNER JOIN album ON album.id=tags.album
INNER JOIN artist ON artist.id=tags.artist
INNER JOIN genre ON genre.id=tags.genre
WHERE \'t\' AND
( \'f\' OR genre.name ILIKE \'Chanson française\' ESCAPE \'/\' )
AND ( \'f\' OR artist.name ILIKE \'Serge Gainsbourg\' ESCAPE \'/\' )
ORDER BY LOWER( genre.name ) ,LOWER( artist.name ) ,LOWER( album.name ) ,LOWER( tags.discnumber ) ,tags.track,LOWER( tags.url ) ;

i would be in favour of writing a small trigger to write lower values permanently in PostgreSQL. It will not be a real overhead in database size (and who cares, there are 100Gb disks).

Also, please note that this query should be a LEFT JOIN, starting with table tags. The query planner does not know how to optimize INNER JOINS very well, because it does not know what is the main table (here: \"tags\"). This does not happen with a LEFT JOIN:

SELECT DISTINCT tags.url,
LOWER( genre.name) as __discard,
LOWER( artist.name) as __discard ,
LOWER( album.name) as __discard ,
LOWER( tags.discnumber) as __discard ,
tags.track as __discard,
LOWER( tags.url) as __discard
FROM tags LEFT JOIN album ON album.id=tags.album
LEFT JOIN artist ON artist.id=tags.artist
LEFT JOIN genre ON genre.id=tags.genre
WHERE \'t\' AND
( \'f\' OR genre.name ILIKE \'Chanson française\' ESCAPE \'/\' )
AND ( \'f\' OR artist.name ILIKE \'Serge Gainsbourg\' ESCAPE \'/\' )
ORDER BY LOWER( genre.name ) ,LOWER( artist.name ) ,LOWER( album.name ) ,LOWER( tags.discnumber ) ,tags.track,LOWER( tags.url ) ;

In this case, the nested loop only happens in the end, which speeds up everything. Also, if the URL is the primary key like proposed, there is no need to query unique results, as it will always be unique. Also a real gain.

Query Two;

SELECT album.name, artist.name, genre.name, tags.title, year.name, tags.comment, tags.discnumber, tags.composer, tags.track, tags.bitrate, tags.length, tags.samplerate, tags.filesize, tags.filetype
FROM tags, album, artist, genre, year
WHERE
album.id = tags.album
AND artist.id = tags.artist
AND genre.id = tags.genre
AND year.id = tags.year
AND tags.url = \'/media/musique/G/Serge Gainsbourg/1967 - Anna/01 - Sous le soleil exactement (or .mp3\';

These queries happen when using drag and drop for an album. It is a real flaw in design, as ONE query is issued for EACH track. On the converse, there should be only ONE query for all tracks.

Also, please note that the lack of primary keys lead to complex queries:

AND artist.id = tags.artist
AND genre.id = tags.genre
AND year.id = tags.year
AND tags.url = \'/media/musique/G/Serge Gainsbourg/1967 - Anna/01 - Sous le soleil exactement (or .mp3\';

will only make sure that the tag is unique. This query is really stupid.

Also, there should be only ONE large query for each drag&drop.

Maybe we should establish a meeting on IRQ to discuss these database issues. I recommend that Amarok developer install and use PostgreSQL 8.1 + pgAdmin3. It will allow them to debug queries graphicaly. It can speed Amarok by 100, and will be easier to implement that using complex memory debugging tools.

Anyway, optimizing queris is really needed.
Can we establish a meeting on IRQ ?
Jean pégésql
Karma
0

Re:PostgreSQL database design

Thu Mar 30, 2006 11:27 am
Query Three:

SELECT tags.title,tags.url,tags.track,year.name,tags.length,tags.discnumber,LOWER( tags.discnumber) as __discard ,tags.track as __discard
FROM tags
INNER JOIN year ON year.id=tags.year
WHERE \'t\' AND ( \'f\' OR tags.album ILIKE \'3807\' ESCAPE \'/\' )
AND ( \'f\' OR tags.artist ILIKE \'2422\' ESCAPE \'/\' )
AND tags.sampler = \'f\'
ORDER BY LOWER( tags.discnumber ) ,tags.track;

Same remarks:
- Using a LEFT JOIN is recommended,
- Do not use ILIKE for int4 values,
- Do not sort by LOWER(), because it will create unecessary disc access instead of using a hash join. Lower values should be stored server-side, once for all. I will provide the needed triggers.
- tags.discnumber is an int4. No need to lower it.
- In the end, this kind of JOIN should be transformed into a VIEW, so that the query planner understands what is going on.

A minimum query would be:


SELECT tags.title,tags.url,tags.track,year.name,tags.length,tags.discnumber,LOWER( tags.discnumber) as __discard ,tags.track as __discard
FROM tags
LEFT JOIN year ON year.id=tags.year
WHERE tags.album = 3807
AND tags.artist =2422
AND tags.sampler = \'f\'
ORDER BY tags.discnumber , tags.track;
User avatar
sebr
Moderator
Posts
301
Karma
0

Re:PostgreSQL database design

Thu Mar 30, 2006 11:31 am
jean - thank you very much for this marvelous work. Unfortunately, none of the developers use postgres and we have no active maintainer.

If you are interested in joining our team to do this work, we would be very grateful! Join us in #amarok @ irc.freenode.net to chat to us.
User avatar
markey
KDE Developer
Posts
2286
Karma
3
OS

Re:PostgreSQL database design

Thu Mar 30, 2006 11:35 am
Might some of these optimizations also be applicable to our other database backends, in particular SQLite (which has performance problems)?


--
Mark Kretschmann - Amarok Developer
Jean Pégésql
Karma
0

Re:PostgreSQL database design

Thu Mar 30, 2006 11:54 am
Dear friends,

Might some of these optimizations also be applicable to our other database backends, in particular SQLite (which has performance problems)?


- The constant use of lower() function in SQL scripts forces PostgreSQL to access data on disc, instead of using Hash joins from the beginning to the end of processing. In the end of processing, SORT BY LOWER forces another disc access. Might well be the same for SQLite.

- The lack of PRIMARY keys on large tables forces disc access too. Also, it limits the planner capacity to optimize queries. Might well be the same for SQLite.

- Because of INNER JOINS, the planner is "blind" and does not know which is the main table. It can discover the main table, if all tables have PRIMARY KEYS, which is not the case. Might well be the same for SQLite.

None of the developers use PostgreSQL

Well, PostgreSQL is a reference implementation, like xorg implements X or GNU/Linux implements a reference Unix. PostgreSQL complies with SQL99 and is a reference for it.

pgAdmin3 includes a graphical query debugger. The debugger explains how the database searches for information (and displays a graphic) and how long it takes to proceed. When using MySQL, you cannot analyse these problems, simply because MySQL is ages behind MySQL.

IMHO, if all developpers used pgAdmin3, there would not be any long-lasting queries. Most problems today are linked to database design problems.

Anyway, I am interested in joining the team to help.
Will contact you on IRQ today.

Thanks for the invitation.
Peter Ndikuwera
Karma
0

Re:PostgreSQL database design

Thu Mar 30, 2006 12:55 pm
Jean,

Your emails were very detailed and quite interesting.

Couple of questions/clarifications.

1. OIDs
The amarok tables all have OIDs. By default, most tools don\'t display this, but if you do:

[code:1]SELECT OID,* FROM tags[/code:1], you\'ll see them

2. Using a trigger to write lower values to the tables.
The items being \"lowered\" in the query include fields such as \"artistname\" and \"albumname\". These fields can\'t be entered into the database as lowercase because they\'re used throughout amarok. Personally, I prefer to see artist as \"Paul McCartney\" rather than \"paul mccartney\"!

3. Inner Joins
Agree 100%

Overall, I think your optimisations can help not just PostgreSQL but MySQL and SQLite.
Peter Ndikuwera
Karma
0

Re:PostgreSQL database design

Thu Mar 30, 2006 1:02 pm
To answer myself above. What we can do is add indexes that index on the lowercased string, e.g.:

CREATE INDEX album_idx2 ON album( lower(name) );

I\'ll need to check if this works with MySQL and SQLite.
Jean Pégésql
Karma
0

Re:PostgreSQL database design

Thu Mar 30, 2006 1:02 pm
Hello Peter,

1. Your table has OIDs, probably because your PostgreSQL installation creates tables with OIDs by default. My PostgreSQL 8.1 server does not create table with OIDs by default.

If I run SELECT OID,* FROM tags, an error is returned.

Anyway, the lack of primary keys is a problem. Let us take the example of the \"tags\" table. Obviously, a file on disc is unique, therefore the url should be the primary key.

2. I don\'t prpose to lower all values, but to add other fields that store lowered values. This will allow most database systems to run queries in memory, without accessing disc.

3. Okay.

I registered the dev mailing list and will join you there to discuss.
Peter Ndikuwera
Karma
0

Re:PostgreSQL database design

Thu Mar 30, 2006 1:04 pm
To answer myself above. What we can do is add indexes that index on the lowercased string, e.g.:

CREATE INDEX album_idx2 ON album( lower(name) );

I\'ll need to check if this works with MySQL and SQLite.
Jean Pégésql
Karma
0

Re:PostgreSQL database design

Thu Mar 30, 2006 1:14 pm
Dear Peter,

This is a neat solution. If lower values are only needed for sorting, this is the preferred solution.

I posted on the dev mailing list. Can we continue there?

Kind regards,
Jean-Michel
Mattias Fliesberg
Registered Member
Posts
1
Karma
0

Re:PostgreSQL database design

Thu Mar 30, 2006 1:41 pm
I\'ve actually noticed this problem with MySQL too.. I fairly recently cleared my old MySQL database and started a new, it went without lag for a couple of weeks, but now it lags every so often.. (my collection contains >24000 tracks).
UsagiJer
Karma
0

Re:database performance

Fri Apr 21, 2006 5:50 pm
I\'m running Amarok 1.4-beta SVN with xine-engine and Mysql backend.

I found I could speed up my startup and query times by modifying
collectiondb::likeCondition so that if it doesn\'t have any wildcards, and doesn\'t have any escaped characters, it should use = instead of LIKE.
there is also a slight performance hit putting quotes around a int value, as MySql needs to convert the string back to int to do its compares.
UsagiJer
Karma
0

Re:database performance

Mon Apr 24, 2006 3:19 pm
Also, using the \'describe\' command , you can see what keys are being used for a query.

mysql> describe SELECT tags.url,tags.title,artist.name FROM tags LEFT JOIN artist ON artist.id=tags.artist LEFT JOIN statistics ON tags.url = statistics.url where statistics.url IS null AND ( 0 OR artist.name = \'The Rolling Stones\' ) ORDER BY RAND() LIMIT 4 OFFSET 0 ;
+------------+--------+---------------+---------+---------+-------------+--------+---------------------------------+
| table | type | possible_keys | key | key_len | ref | rows | Extra |
+------------+--------+---------------+---------+---------+-------------+--------+---------------------------------+
| tags | ALL | NULL | NULL | NULL | NULL | 414944 | Using temporary; Using filesort |
| artist | eq_ref | PRIMARY | PRIMARY | 4 | tags.artist | 1 | Using where |
| statistics | ref | url,url_stats | url | 256 | tags.url | 1 | Using where; Using index |
+------------+--------+---------------+---------+---------+-------------+--------+---------------------------------+


Tags is going to be searched top-to-bottom . slow slow slow.

if we switch the order of \'tags\' and \'artist\' in that first LEFT JOIN, we get:

mysql> describe SELECT tags.url,tags.title,artist.name FROM artist LEFT JOIN tags ON artist.id=tags.artist LEFT JOIN statistics ON tags.url = statistics.url where statistics.url IS null AND ( 0 OR artist.name = \'The Rolling Stones\' ) ORDER BY RAND() LIMIT 4 OFFSET 0 ;
+------------+------+---------------+------------+---------+-----------+------+----------------------------------------------+
| table | type | possible_keys | key | key_len | ref | rows | Extra |
+------------+------+---------------+------------+---------+-----------+------+----------------------------------------------+
| artist | ref | artist_idx | artist_idx | 256 | const | 1 | Using where; Using temporary; Using filesort |
| tags | ref | artist_tag | artist_tag | 5 | artist.id | 11 | |
| statistics | ref | url,url_stats | url | 256 | tags.url | 1 | Using where; Using index |
+------------+------+---------------+------------+---------+-----------+------+----------------------------------------------+
3 rows in set (0.00 sec)

All keyed lookups. much quicker.
OmniDux
Registered Member
Posts
1
Karma
0

Re: PostgreSQL database design

Thu Sep 04, 2008 2:17 pm
I realize this thread is a bit old now, but I was just wondering if anyone has any idea of the current status on the suggestions made here?

And slightly off topic, but in the case of tuning Amarok for large collections it would be useful (in particular on a slow computer) to have an option to select how long Amarok waits from the last keystroke until it queries the database.


Bookmarks



Who is online

Registered users: bancha, Bing [Bot], daret, Evergrowing, Google [Bot], lockheed, sandyvee, Sogou [Bot]