Which is faster: Appropriate data input or appropriate data structure?
- by Anon
I have a dataset whose columns look like this:
Consumer ID | Product ID | Time Period | Product Score
1 | 1 | 1 | 2
2 | 1 | 2 | 3
and so on.
As part of a program (written in C) I need to process the product scores given by all consumers for a particular product and time period combination for all possible combinations. Suppose that there are 3 products and 2 time periods. Then I need to process the product scores for all possible combinations as shown below:
Product ID | Time Period
1 | 1
1 | 2
2 | 1
2 | 2
3 | 1
3 | 2
I will need to process the data along the above lines lots of times ( 10k) and the dataset is fairly large (e.g., 48k consumers, 100 products, 24 time periods etc). So speed is an issue.
I came up with two ways to process the data and am wondering which is the faster approach or perhaps it does not matter much? (speed matters but not at the cost of undue maintenance/readability):
Sort the data on product id and time period and then loop through the data to extract data for all possible combinations.
Store the consumer ids of all consumers who provided product scores for a particular combination of product id and time period and process the data accordingly.
Any thoughts? Any other way to speed up the processing? Thanks