how to prune data set?
- by sakura90
The MovieLens data set provides a table with columns:
userid | movieid | tag | timestamp
I have trouble reproducing the way they pruned the MovieLens data set used in:
http://www.cse.ust.hk/~yzhen/papers/tagicofi-recsys09-zhen.pdf
In 4.1 Data Set of the above paper, it writes
"For the tagging information, we only keep those tags which are added
on at least 3 distinct movies. As for the users, we only
keep those users who used at least 3 distinct tags in their
tagging history. For movies, we only keep those movies that
are annotated by at least 3 distinct tags."
I tried to query the database:
select TMP.userid, count(*) as tagnum
from (select distinct T.userid as userid, T.tag as tag from tags T) AS TMP
group by TMP.userid
having tagnum = 3;
I got a list of 1760 users who labeled 3 distinct tags. However, some of the tags
are not added on at least 3 distinct movies.
Any help is appreciated.