Performing a SVD on tweets. Memory problem
- by plotti
I have generated a huge csv file as an output from my pos tagging and stemming. It looks like this:
word1, word2, word3, ..., word14400
person1 1 2 0 1
person2 0 0 1 0
...
person650
It contains the word counts for each person. Like this I am getting characteristic vectors for each person.
I want to run a SVD on this beast, but it seems the matrix is too big to be held in memory to perform the operation. My quesion is:
should i reduce the column size by removing words which have a column sum of for example 1, which means that they have been used only once. Do I bias the data too much with this attempt?
I tried the rapidminer attempt, by loading the csv into the db. and then sequentially reading it in with batches for processing, like rapidminer proposes. But Mysql can't store that many columns in a table. If i transpose the data, and then retranspose it on import it also takes ages....
-- So in general I am asking for advice how to perform a svd on such a corpus.