Starting out NLP - Python + large data set
- by pencilNero
Hi,
I've been wanting to learn python and do some NLP, so have finally gotten round to starting. Downloaded the english wikipedia mirror for a nice chunky dataset to start on, and have been playing around a bit, at this stage just getting some of it into a sqlite db (havent worked with dbs in the past unfort).
But I'm guessing sqlite is not the way to go for a full blown nlp project(/experiment :) - what would be the sort of things I should look at ? HBase (.. and hadoop) seem interesting, i guess i could run then im java, prototype in python and maybe migrate the really slow bits to java... alternatively just run Mysql.. but the dataset is 12gb, i wonder if that will be a problem? Also looked at lucene, but not sure how (other than breaking the wiki articles into chunks) i'd get that to work..
What comes to mind for a really flexible NLP platform (i dont really know at this stage WHAT i want to do.. just want to learn large scale lang analysis tbh) ?
Many thanks.