Random forests for short texts
Posted
by Jasie
on Stack Overflow
See other posts from Stack Overflow
or by Jasie
Published on 2010-05-02T23:45:12Z
Indexed on
2010/05/02
23:47 UTC
Read the original article
Hit count: 242
random-forests
|machine-learning
Hi all,
I've been reading about Random Forests (1,2) because I think it'd be really cool to be able to classify a set of 1,000 sentences into pre-defined categories. I'm wondering if someone can explain to me the algorithm better, I think the papers are a bit dense. Here's the gist from 1:
Overview
We assume that the user knows about the construction of single classification trees. Random Forests grows many classification trees. To classify a new object from an input vector, put the input vector down each of the trees in the forest. Each tree gives a classification, and we say the tree "votes" for that class. The forest chooses the classification having the most votes (over all the trees in the forest).
Each tree is grown as follows:
- If the number of cases in the training set is N, sample N cases at random - but with replacement, from the original data. This sample will be the training set for growing the tree.
- If there are M input variables, a number m « M is specified such that at each node, m variables are selected at random out of the M and the best split on these m is used to split the node. The value of m is held constant during the forest growing.
- Each tree is grown to the largest extent possible. There is no pruning.
So, does this look right?
- I'd have N = 1,000 training cases (sentences),
- M = 100 variables (let's say, there are only 100 unique words across all sentences), so the input vector is a bit vector of length 100 corresponding to each word.
- I randomly sample N = 1000 cases at random (with replacement) to build trees from.
- I pick some small number of input variables m « M, let's say 10, to build a tree off of.
- Do I build tree nodes randomly, using all m input variables? How many classification trees do I build?
Thanks for the help!
© Stack Overflow or respective owner