Bulk Insert of hundreds of millions of records
- by Dave Jarvis
What is the fastest way to insert 237 million records into a table that has rules (for distributing the data across 84 child tables)?
First I tried inserts. No go.
Then I tried inserts with BEGIN/COMMIT. Not nearly fast enough.
Next, I tried COPY FROM, but then noticed the documentation states that the rules are ignored. (And it was having difficulties with the column order and date format -- it said that '1984-07-1' was not a valid integer; true, but a bit unexpected.)
Some example data:
station_id,taken,amount,category_id,flag
1,'1984-07-1',0,4,
1,'1984-07-2',0,4,
1,'1984-07-3',0,4,
1,'1984-07-4',0,4,T
Here is the table structure (with one rule included):
CREATE TABLE climate.measurement
(
id bigserial NOT NULL,
station_id integer NOT NULL,
taken date NOT NULL,
amount numeric(8,2) NOT NULL,
category_id smallint NOT NULL,
flag character varying(1) NOT NULL DEFAULT ' '::character varying
)
WITH (
OIDS=FALSE
);
ALTER TABLE climate.measurement OWNER TO postgres;
CREATE OR REPLACE RULE i_measurement_01_001 AS
ON INSERT TO climate.measurement
WHERE date_part('month'::text, new.taken)::integer = 1 AND new.category_id = 1 DO INSTEAD INSERT INTO climate.measurement_01_001 (id, station_id, taken, amount, category_id, flag)
VALUES (new.id, new.station_id, new.taken, new.amount, new.category_id, new.flag);
I can generate the data into any format.
Am looking for something that won't take four days.
I originally had the data in MySQL (still do), but am hoping to get a performance increase by switching to PostgreSQL and am eager to use its PL/R extensions for stats.
I was also thinking about using:
http://pgbulkload.projects.postgresql.org/
Any help, tips, or guidance would be greatly appreciated.
Thank you!