Performing a SVD on tweets. Memory problem

Posted by plotti on Stack Overflow See other posts from Stack Overflow or by plotti
Published on 2010-05-12T12:23:12Z Indexed on 2010/05/12 12:24 UTC
Read the original article Hit count: 242

Filed under:
|
|

I have generated a huge csv file as an output from my pos tagging and stemming. It looks like this:

        word1, word2, word3, ..., word14400
person1   1      2      0            1
person2   0      0      1            0
...
person650

It contains the word counts for each person. Like this I am getting characteristic vectors for each person.

I want to run a SVD on this beast, but it seems the matrix is too big to be held in memory to perform the operation. My quesion is:

  • should i reduce the column size by removing words which have a column sum of for example 1, which means that they have been used only once. Do I bias the data too much with this attempt?

  • I tried the rapidminer attempt, by loading the csv into the db. and then sequentially reading it in with batches for processing, like rapidminer proposes. But Mysql can't store that many columns in a table. If i transpose the data, and then retranspose it on import it also takes ages....

--> So in general I am asking for advice how to perform a svd on such a corpus.

© Stack Overflow or respective owner

Related posts about svd

Related posts about matrix