tahabi - Developer IT

What is the right way to process inconsistent data files?

- by Tahabi

I'm working at a company that uses Excel files to store product data, specifically, test results from products before they are shipped out. There are a few thousand spreadsheets with anywhere from 50-100 relevant data points per file. Over the years, the schema for the spreadsheets has changed significantly, but not unidirectionally - in the sense that, changes often get reverted and then re-added in the space of a few dozen to few hundred files. My project is to convert about 8000 of these spreadsheets into a database that can be queried. I'm using MongoDB to deal with the inconsistency in the data, and Python. My question is, what is the "right" or canonical way to deal with the huge variance in my source files? I've written a data structure which stores the data I want for the latest template, which will be the final template used going forward, but that only helps for a few hundred files historically. Brute-forcing a solution would mean writing similar data structures for each version/template - which means potentially writing hundreds of schemas with dozens of fields each. This seems very inefficient, especially when sometimes a change in the template is as little as moving a single line of data one row down or splitting what used to be one data field into two data fields. A slightly more elegant solution I have in mind would be writing schemas for all the variants I can find for pre-defined groups in the source files, and then writing a function to match a particular series of files with a series of variants that matches that set of files. This is because, more often that not, most of the file will remain consistent over a long period, only marred by one or two errant sections, but inside the period, which section is inconsistent, is inconsistent. For example, say a file has four sections with three data fields, which is represented by four Python dictionaries with three keys each. For files 7000-7250, sections 1-3 will be consistent, but section 4 will be shifted one row down. For files 7251-7500, 1-3 are consistent, section 4 is one row down, but a section five appears. For files 7501-7635, sections 1 and 3 will be consistent, but section 2 will have five data fields instead of three, section five disappears, and section 4 is still shifted down one row. For files 7636-7800, section 1 is consistent, section 4 gets shifted back up, section 2 returns to three cells, but section 3 is removed entirely. Files 7800-8000 have everything in order. The proposed function would take the file number and match it to a dictionary representing the data mappings for different variants of each section. For example, a section_four_variants dictionary might have two members, one for the shifted-down version, and one for the normal version, a section_two_variants might have three and five field members, etc. The script would then read the matchings, load the correct mapping, extract the data, and insert it into the database. Is this an accepted/right way to go about solving this problem? Should I structure things differently? I don't know what to search Google for either to see what other solutions might be, though I believe the problem lies in the domain of ETL processing. I also have no formal CS training aside from what I've taught myself over the years. If this is not the right forum for this question, please tell me where to move it, if at all. Any help is most appreciated. Thank you.

Developer IT