how to find maximum frequent item sets from large transactional data file
Posted
by ANIL MANE
on Stack Overflow
See other posts from Stack Overflow
or by ANIL MANE
Published on 2010-04-24T11:45:14Z
Indexed on
2010/04/24
11:53 UTC
Read the original article
Hit count: 421
Hi,
I have the input file contains large amount of transactions like
Transaction ID Items
T1 Bread, milk, coffee, juice
T2 Juice, milk, coffee
T3 Bread, juice
T4 Coffee, milk
T5 Bread, Milk
T6 Coffee, Bread
T7 Coffee, Bread, Juice
T8 Bread, Milk, Juice
T9 Milk, Bread, Coffee,
T10 Bread
T11 Milk
T12 Milk, Coffee, Bread, Juice
i want the occurrence of every unique item like
Item Name Count Bread 9 Milk 8 Coffee 7 Juice 6
and from that i want an a fp-tree now by traversing this tree i want the maximal frequent itemsets as follows
The basic idea of method is to dispose nodes in each “layer” from bottom to up. The concept of “layer” is different to the common concept of layer in a tree. Nodes in a “layer” mean the nodes correspond to the same item and be in a linked list from the “Head Table”. For nodes in a “layer” NBN method will be used to dispose the nodes from left to right along the linked list. To use NBN method, two extra fields will be added to each node in the ordered FP-Tree. The field tag of node N stores the information of whether N is maximal frequent itemset, and the field count’ stores the support count information in the nodes at left.
In Figure, the first node to be disposed is “juice: 2”. If the min_sup is equal to or less than 2 then “bread, milk, coffee, juice” is a maximal frequent itemset. Firstly output juice:2 and set the field tag of “coffee:3” as “false” (the field tag of each node is “true” initially ). Next check whether the right four itemsets juice:1 be the subset of juice:2. If the itemset one node “juice:1” corresponding to is the subset of juice:2 set the field tag of the node “false”. In the following process when the field tag of the disposed node is FALSE we can omit the node after the same tagging. If the min_sup is more than 2 then check whether the right four juice:1 is the subset of juice:2. If the itemset one node “juice:1” corresponding to is the subset of juice:2 then set the field count’ of the node with the sum of the former count’ and 2 After all the nodes “juice” disposed ,begin to dispose the node “coffee:3”.
Any suggestions or available source code, welcome.
thanks in advance
© Stack Overflow or respective owner