kmeans based on mapreduce by python
Posted
by
user3616059
on Stack Overflow
See other posts from Stack Overflow
or by user3616059
Published on 2014-06-10T09:15:25Z
Indexed on
2014/06/10
9:24 UTC
Read the original article
Hit count: 187
I am going to write a mapper and reducer for the kmeans algorithm, I think the best course of action to do is putting the distance calculator in mapper and sending to reducer with the cluster id as key and coordinates of row as value. In reducer, updating the centroids would be performed. I am writing this by python.
As you know, I have to use Hadoop streaming to transfer data between STDIN
and STOUT
. according to my knowledge, when we print (key + "\t"+value)
, it will be sent to reducer. Reducer will receive data and it calculates the new centroids but when we print new centroids, I think it does not send them to mapper to calculate new clusters and it just send it to STDOUT
and as you know, kmeans is a iterative program. So, my questions is whether Hadoop streaming suffers of doing iterative programs and we should employ MRJOB
for iterative programs?
© Stack Overflow or respective owner