I apologise in advance for the length of this post, but I want to paint an accurate picture of the problems my app is facing and then pose some questions below;
I am trying to address some self inflicted design pain that is now leading to my application crashing due to out of memory errors.
An abridged description of the problem domain is as follows;
The application takes in a “dataset” that consists of numerous text files containing related data
An individual text file within the dataset usually contains approx 20 “headers” that contain metadata about the data it contains. It also contains a large tab delimited section containing data that is related to data in one of the other text files contained within the dataset. The number of columns per file is very variable from 2 to 256+ columns.
The original application was written to allow users to load a dataset, map certain columns of each of the files which basically indicating key information on the files to show how they are related as well as identify a few expected column names. Once this is done, a validation process takes place to enforce various rules and ensure that all the relationships between the files are valid. Once that is done, the data is imported into a SQL Server database. The database design is an EAV (Entity-Attribute-Value) model used to cater for the variable columns per file. I know EAV has its detractors, but in this case, I feel it was a reasonable choice given the disparate data and variable number of columns submitted in each dataset.
The memory problem
Given the fact the combined size of all text files was at most about 5 megs, and in an effort to reduce the database transaction time, it was decided to read ALL the data from files into memory and then perform the following;
perform all the validation whilst the data was in memory
relate it using an object model
Start DB transaction and write the key columns row by row, noting the Id of the written row (all tables in the database utilise identity columns), then the Id of the newly written row is applied to all related data
Once all related data had been updated with the key information to which it relates, these records are written using SqlBulkCopy. Due to our EAV model, we essentially have; x columns by y rows to write, where x can by 256+ and rows are often into the tens of thousands.
Once all the data is written without error (can take several minutes for large datasets), Commit the transaction.
The problem now comes from the fact we are now receiving individual files containing over 30 megs of data. In a dataset, we can receive any number of files. We’ve started seen datasets of around 100 megs coming in and I expect it is only going to get bigger from here on in. With files of this size, data can’t even be read into memory without the app falling over, let alone be validated and imported. I anticipate having to modify large chunks of the code to allow validation to occur by parsing files line by line and am not exactly decided on how to handle the import and transactions.
Potential improvements
I’ve wondered about using GUIDs to relate the data rather than relying on identity fields. This would allow data to be related prior to writing to the database. This would certainly increase the storage required though. Especially in an EAV design. Would you think this is a reasonable thing to try, or do I simply persist with identity fields (natural keys can’t be trusted to be unique across all submitters).
Use of staging tables to get data into the database and only performing the transaction to copy data from staging area to actual destination tables.
Questions
For systems like this that import large quantities of data, how to you go about keeping transactions small. I’ve kept them as small as possible in the current design, but they are still active for several minutes and write hundreds of thousands of records in one transaction. Is there a better solution?
The tab delimited data section is read into a DataTable to be viewed in a grid. I don’t need the full functionality of a DataTable, so I suspect it is overkill. Is there anyway to turn off various features of DataTables to make them more lightweight?
Are there any other obvious things you would do in this situation to minimise the memory footprint of the application described above?
Thanks for your kind attention.