What's the best way to count unique visitors with Hadoop?
Posted
by beagleguy
on Stack Overflow
See other posts from Stack Overflow
or by beagleguy
Published on 2010-05-21T20:37:52Z
Indexed on
2010/05/21
20:40 UTC
Read the original article
Hit count: 349
hey all, just getting started on hadoop and curious what the best way in mapreduce would be to count unique visitors if your logfiles looked like this...
DATE siteID action username
05-05-2010 siteA pageview jim
05-05-2010 siteB pageview tom
05-05-2010 siteA pageview jim
05-05-2010 siteB pageview bob
05-05-2010 siteA pageview mike
and for each site you wanted to find out the unique visitors for each site?
I was thinking the mapper would emit siteID \t username and the reducer would keep a set() of the unique usersnames per key and then emit the length of that set. However that would be potentially storing millions of usernames in memory which doesn't seem right. Anyone have a better way?
I'm using python streaming by the way
thanks
© Stack Overflow or respective owner