I have a working MySQL data warehouse that is organized as a star schema and I am using Talend Open Studio for Data Integration 5.1 to create the ETL process. I would like this process to run once per day. I have estimated that one of the dimension tables (dimUser) will have approximately 2 million records and 23 columns.
I created a small test ETL process in Talend that worked, but given the amount of data that may need to be updated daily, the current performance will not cut it. It takes the ETL process four minutes to UPDATE or INSERT 100 records to dimUser. If I assumed a linear relationship between the count of records and the amount of time to UPDATE or INSERT, then there is no way the ETL can finish in 3-4 hours (my hope), let alone one day.
Since I'm unfamiliar with Java, I wrote the ETL as a Python script and ran into the same problem. Although, I did discover that if I did only INSERT, the process went much faster. I am pretty sure that the bottleneck is caused by the UPDATE statements.
The primary key in dimUser is an auto-increment integer. My friend suggested that I scrap this primary key and replace it with a multi-field primary key (in my case, 2-3 fields).
Before I rip the test data out of my warehouse and change the schema, can anyone provide suggestions or guidelines related to
the design of the data warehouse
the ETL process
how realistic it is to have an ETL process INSERT or UPDATE a few million records each day
will my friend's suggestion significantly help
If you need any further information, just let me know and I'll post it.
UPDATE - additional information:
mysql> describe dimUser;
Field Type Null Key Default Extra
user_key int(10) unsigned NO PRI NULL auto_increment
id_A int(10) unsigned NO NULL
id_B int(10) unsigned NO NULL
field_4 tinyint(4) unsigned NO 0
field_5 varchar(50) YES NULL
city varchar(50) YES NULL
state varchar(2) YES NULL
country varchar(50) YES NULL
zip_code varchar(10) NO 99999
field_10 tinyint(1) NO 0
field_11 tinyint(1) NO 0
field_12 tinyint(1) NO 0
field_13 tinyint(1) NO 1
field_14 tinyint(1) NO 0
field_15 tinyint(1) NO 0
field_16 tinyint(1) NO 0
field_17 tinyint(1) NO 1
field_18 tinyint(1) NO 0
field_19 tinyint(1) NO 0
field_20 tinyint(1) NO 0
create_date datetime NO 2012-01-01 00:00:00
last_update datetime NO 2012-01-01 00:00:00
run_id int(10) unsigned NO 999
I used a surrogate key because I had read that it was good practice. Since, from a business perspective, I want to keep aware of potential fraudulent activity (say for 200 days a user is associated with state X and then the next day they are associated with state Y - they could have moved or their account could have been compromised), so that is why geographic data is kept. The field id_B may have a few distinct values of id_A associated with it, but I am interested in knowing distinct (id_A, id_B) tuples. In the context of this information, my friend suggested that something like (id_A, id_B, zip_code) be the primary key.
For the large majority of daily ETL processes (80%), I only expect the following fields to be updated for existing records: field_10 - field_14, last_update, and run_id (this field is a foreign key to my etlLog table and is used for ETL auditing purposes).