Hi all,
I've been reading about Random Forests (1,2) because I think it'd be really cool to be able to classify a set of 1,000 sentences into pre-defined categories. I'm wondering if someone can explain to me the algorithm better, I think the papers are a bit dense. Here's the gist from 1:
Overview
We assume that the user knows about
the construction of single
classification trees. Random Forests
grows many classification trees. To
classify a new object from an input
vector, put the input vector down each
of the trees in the forest. Each tree
gives a classification, and we say the
tree "votes" for that class. The
forest chooses the classification
having the most votes (over all the
trees in the forest).
Each tree is grown as follows:
If the number of cases in the training set is N, sample N cases at
random - but with replacement, from
the original data. This sample will be
the training set for growing the tree.
If there are M input variables, a number m « M is specified such that
at each node, m variables are selected
at random out of the M and the best
split on these m is used to split the
node. The value of m is held constant
during the forest growing.
Each tree is grown to the largest extent possible. There is no
pruning.
So, does this look right?
I'd have N = 1,000 training cases (sentences),
M = 100 variables (let's say, there are only 100 unique words across all sentences), so the input vector is a bit vector of length 100 corresponding to each word.
I randomly sample N = 1000 cases at random (with replacement) to build trees from.
I pick some small number of input variables m « M, let's say 10, to build a tree off of.
Do I build tree nodes randomly, using all m input variables? How many classification trees do I build?
Thanks for the help!