Create a term-document matrix from files
- by Joe
I have a set of files from example001.txt to example100.txt. Each file contains a list of keywords from a superset (the superset is available if we want it).
So example001.txt might contain
apple
banana
...
otherfruit
I'd like to be able to process these files and produce something akin to a matrix so there is the list of examples* on the top row, the fruit down the side, and a '1' in a column if the fruit is in the file.
An example might be...
x example1 example2 example3
Apple 1 1 0
Babana 0 1 0
Coconut 0 1 1
Any idea how I might build some sort of command-line magic to put this together? I'm on OSX and happy with perl or python...