Optimising speeds in HDF5 using Pytables

Posted by Sree Aurovindh on Stack Overflow See other posts from Stack Overflow or by Sree Aurovindh
Published on 2012-03-19T06:58:49Z Indexed on 2012/03/20 5:29 UTC
Read the original article Hit count: 371

Filed under:

pytables

The problem is with respect to the writing speed of the computer (10 * 32 bit machine) and the postgresql query performance.I will explain the scenario in detail.

I have data about 80 Gb (along with approprite database indexes in place). I am trying to read it from Postgresql database and writing it into HDF5 using Pytables.I have 1 table and 5 variable arrays in one hdf5 file.The implementation of Hdf5 is not multithreaded or enabled for symmetric multi processing.I have rented about 10 computers for a day and trying to write them inorder to speed up my data handling.

As for as the postgresql table is concerned the overall record size is 140 million and I have 5 primary- foreign key referring tables.I am not using joins as it is not scalable

So for a single lookup i do 6 lookup without joins and write them into hdf5 format. For each lookup i do 6 inserts into each of the table and its corresponding arrays.

The queries are really simple

select * from x.train where tr_id=1 (primary key & indexed)
select q_t from x.qt where q_id=2 (non-primary key but indexed)

(similarly five queries)

Each computer writes two hdf5 files and hence the total count comes around 20 files.

Some Calculations and statistics:

Total number of records : 14,37,00,000
Total number of records per file : 143700000/20 =71,85,000 
The total number of records in each file : 71,85,000 * 5 = 3,59,25,000

Current Postgresql database config :

My current Machine : 8GB RAM with i7 2nd generation Processor.

I made changes to the following to postgresql configuration file : shared_buffers : 2 GB effective_cache_size : 4 GB

Note on current performance:

I have run it for about ten hours and the performance is as follows: The total number of records written for each file is about 6,21,000 * 5 = 31,05,000

The bottle neck is that i can only rent it for 10 hours per day (overnight) and if it processes in this speed it will take about 11 days which is too high for my experiments.

Please suggest me on how to improve. Questions: 1. Should i use Symmetric multi processing on those desktops(it has 2 cores with about 2 GB of RAM).In that case what is suggested or prefereable? 2. If i change my postgresql configuration file and increase the RAM will it enhance my process. 3. Should i use multi threading.. In that case any links or pointers would be of great help

Thanks Sree aurovindh V

Developer IT