Java or Python distributed compute job (on a student budget)?

Posted by midget_sadhu on Stack Overflow See other posts from Stack Overflow or by midget_sadhu
Published on 2010-05-16T14:28:34Z Indexed on 2010/05/16 16:20 UTC
Read the original article Hit count: 227

Filed under:
|
|
|
|

I have a large dataset (c. 40G) that I want to use for some NLP (largely embarrassingly parallel) over a couple of computers in the lab, to which i do not have root access, and only 1G of user space. I experimented with hadoop, but of course this was dead in the water-- the data is stored on an external usb hard drive, and i cant load it on to the dfs because of the 1G user space cap. I have been looking into a couple of python based options (as I'd rather use NLTK instead of Java's lingpipe if I can help it), and it seems distributed compute options look like:

  • Ipython
  • DISCO

After my hadoop experience, i am trying to make sure i try and make an informed choice -- any help on what might be more appropriate would be greatly appreciated.

Amazon's EC2 etc not really an option, as i have next to no budget.

© Stack Overflow or respective owner

Related posts about hadoop

Related posts about nlp