I have built a small inventory system using postgresql and psycopg2. Everything works great, except, when I want to create aggregated summaries/reports of the content, I get really bad performance due to count()'ing and sorting.
The DB schema is as follows:
CREATE TABLE hosts
(
id SERIAL PRIMARY KEY,
name VARCHAR(255)
);
CREATE TABLE items
(
id SERIAL PRIMARY KEY,
description TEXT
);
CREATE TABLE host_item
(
id SERIAL PRIMARY KEY,
host INTEGER REFERENCES hosts(id) ON DELETE CASCADE ON UPDATE CASCADE,
item INTEGER REFERENCES items(id) ON DELETE CASCADE ON UPDATE CASCADE
);
There are some other fields as well, but those are not relevant.
I want to extract 2 different reports:
- List of all hosts with the number of items per, ordered from highest
to lowest count
- List of all items with the number of hosts per, ordered from highest to lowest count
I have used 2 queries for the purpose:
Items with host count:
SELECT i.id, i.description, COUNT(hi.id) AS count
FROM items AS i
LEFT JOIN host_item AS hi
ON (i.id=hi.item)
GROUP BY i.id
ORDER BY count DESC
LIMIT 10;
Hosts with item count:
SELECT h.id, h.name, COUNT(hi.id) AS count
FROM hosts AS h
LEFT JOIN host_item AS hi
ON (h.id=hi.host)
GROUP BY h.id
ORDER BY count DESC
LIMIT 10;
Problem is: the queries runs for 5-6 seconds before returning any data. As this is a web based application, 6 seconds are just not acceptable. The database is heavily populated with approximately 50k hosts, 1000 items and 400 000 host/items relations, and will likely increase significantly when (or perhaps if) the application will be used.
After playing around, I found that by removing the "ORDER BY count DESC" part, both queries would execute instantly without any delay whatsoever (less than 20ms to finish the queries).
Is there any way I can optimize these queries so that I can get the result sorted without the delay? I was trying different indexes, but seeing as the count is computed it is possible to utilize an index for this. I have read that count()'ing in postgresql is slow, but its the sorting that are causing me problems...
My current workaround is to run the queries above as an hourly job, putting the result into a new table with an index on the count column for quick lookup.
I use Postgresql 9.2.