Classifying captured data in unknown format?
- by monch1962
I've got a large set of captured data (potentially hundreds of thousands of records), and I need to be able to break it down so I can both classify it and also produce "typical" data myself. Let me explain further...
If I have the following strings of data:
132T339G1P112S
164T897F5A498S
144T989B9B223T
155T928X9Z554T
...
you might start to infer the following:
possibly all strings are 14 characters long
the 4th, 8th, 10th and 14th characters may always be alphas, while the rest are numeric
the first character may always be a '1'
the 4th character may always be the letter 'T'
the 14th character may be limited to only being 'S' or 'T'
and so on...
As you get more and more samples of real data, some of these "rules" might disappear; if you see a 15 character long string, then you have evidence that the 1st "rule" is incorrect. However, given a sufficiently large sample of strings that are exactly 14 characters long, you can start to assume that "all strings are 14 characters long" and assign a numeric figure to your degree of confidence (with an appropriate set of assumptions around the fact that you're seeing a suitably random set of all possible captured data).
As you can probably tell, a human can do a lot of this classification by eye, but I'm not aware of libraries or algorithms that would allow a computer to do it.
Given a set of captured data (significantly more complex than the above...), are there libraries that I can apply in my code to do this sort of classification for me, that will identify "rules" with a given degree of confidence?
As a next step, I need to be able to take those rules, and use them to create my own data that conforms to these rules. I assume this is a significantly easier step than the classification, but I've never had to perform a task like this before so I'm really not sure how complex it is.
At a guess, Python or Java (or possibly Perl or R) are possibly the "common" languages most likely to have these sorts of libraries, and maybe some of the bioinformatic libraries do this sort of thing. I really don't care which language I have to use; I need to solve the problem in whatever way I can.
Any sort of pointer to information would be very useful. As you can probably tell, I'm struggling to describe this problem clearly, and there may be a set of appropriate keywords I can plug into Google that will point me towards the solution.
Thanks in advance