Optimize date query for large child tables: GiST or GIN?
- by Dave Jarvis
Problem
72 child tables, each having a year index and a station index, are defined as follows:
CREATE TABLE climate.measurement_12_013
(
-- Inherited from table climate.measurement_12_013: id bigint NOT NULL DEFAULT nextval('climate.measurement_id_seq'::regclass),
-- Inherited from table climate.measurement_12_013: station_id integer NOT NULL,
-- Inherited from table climate.measurement_12_013: taken date NOT NULL,
-- Inherited from table climate.measurement_12_013: amount numeric(8,2) NOT NULL,
-- Inherited from table climate.measurement_12_013: category_id smallint NOT NULL,
-- Inherited from table climate.measurement_12_013: flag character varying(1) NOT NULL DEFAULT ' '::character varying,
CONSTRAINT measurement_12_013_category_id_check CHECK (category_id = 7),
CONSTRAINT measurement_12_013_taken_check CHECK (date_part('month'::text, taken)::integer = 12)
)
INHERITS (climate.measurement)
CREATE INDEX measurement_12_013_s_idx
ON climate.measurement_12_013
USING btree
(station_id);
CREATE INDEX measurement_12_013_y_idx
ON climate.measurement_12_013
USING btree
(date_part('year'::text, taken));
(Foreign key constraints to be added later.)
The following query runs abysmally slow due to a full table scan:
SELECT
count(1) AS measurements,
avg(m.amount) AS amount
FROM
climate.measurement m
WHERE
m.station_id IN (
SELECT
s.id
FROM
climate.station s,
climate.city c
WHERE
-- For one city ...
--
c.id = 5182 AND
-- Where stations are within an elevation range ...
--
s.elevation BETWEEN 0 AND 3000 AND
6371.009 * SQRT(
POW(RADIANS(c.latitude_decimal - s.latitude_decimal), 2) +
(COS(RADIANS(c.latitude_decimal + s.latitude_decimal) / 2) *
POW(RADIANS(c.longitude_decimal - s.longitude_decimal), 2))
) <= 50
) AND
--
-- Begin extracting the data from the database.
--
-- The data before 1900 is shaky; insufficient after 2009.
--
extract( YEAR FROM m.taken ) BETWEEN 1900 AND 2009 AND
-- Whittled down by category ...
--
m.category_id = 1 AND
m.taken BETWEEN
-- Start date.
(extract( YEAR FROM m.taken )||'-01-01')::date AND
-- End date. Calculated by checking to see if the end date wraps
-- into the next year. If it does, then add 1 to the current year.
--
(cast(extract( YEAR FROM m.taken ) + greatest( -1 *
sign(
(extract( YEAR FROM m.taken )||'-12-31')::date -
(extract( YEAR FROM m.taken )||'-01-01')::date ), 0
) AS text)||'-12-31')::date
GROUP BY
extract( YEAR FROM m.taken )
The sluggishness comes from this part of the query:
m.taken BETWEEN
/* Start date. */
(extract( YEAR FROM m.taken )||'-01-01')::date AND
/* End date. Calculated by checking to see if the end date wraps
into the next year. If it does, then add 1 to the current year.
*/
(cast(extract( YEAR FROM m.taken ) + greatest( -1 *
sign(
(extract( YEAR FROM m.taken )||'-12-31')::date -
(extract( YEAR FROM m.taken )||'-01-01')::date ), 0
) AS text)||'-12-31')::date
The HashAggregate from the plan shows a cost of 10006220141.11, which is, I suspect, on the astronomically huge side.
There is a full table scan on the measurement table (itself having neither data nor indexes) being performed. The table aggregates 237 million rows from its child tables.
Question
What is the proper way to index the dates to avoid full table scans?
Options I have considered:
GIN
GiST
Rewrite the WHERE clause
Separate year_taken, month_taken, and day_taken columns to the tables
What are your thoughts?
Thank you!