Logic: Best way to sample & count bytes of a 100MB+ file
- by Jami
Let's say I have this 170mb file (roughly 180 million bytes). What I need to do is to create a table that lists:
all 4096 byte combinations found [column 'bytes'], and
the number of times each byte combination appeared in it [column 'occurrences']
Assume two things:
I can save data very fast, but
I can update my saved data very slow.
How should I sample the file and save the needed information?
Here're some suggestions that are (extremely) slow:
Go through each 4096 byte combinations in the file, save each data, but search the table first for existing combinations and update it's values. this is unbelievably slow
Go through each 4096 byte combinations in the file, save until 1 million rows of data in a temporary table. Go through that table and fix the entries (combine repeating byte combinations), then copy to the big table. Repeat going through another 1 million rows of data and repeat the process. this is faster by a bit, but still unbelievably slow
This is kind of like taking the statistics of the file.
NOTE:
I know that sampling the file can generate tons of data (around 22Gb from experience), and I know that any solution posted would take a bit of time to finish. I need the most efficient saving process