Naive Bayesian classification (spam filtering) - Doubt in one calculation? Which one is right? Plz c
- by Microkernel
Hi guys,
I am implementing Naive Bayesian classifier for spam filtering. I have doubt on some calculation. Please clarify me what to do. Here is my question.
In this method, you have to calculate
P(S|W) - Probability that Message is spam given word W occurs in it.
P(W|S) - Probability that word W occurs in a spam message.
P(W|H) - Probability that word W occurs in a Ham message.
So to calculate P(W|S), should I do
(1) (Number of times W occuring in spam)/(total number of times W occurs in all the messages)
OR
(2) (Number of times word W occurs in Spam)/(Total number of words in the spam message)
So, to calculate P(W|S), should I do (1) or (2)? (I thought it to be (2), but I am not sure, so plz clarify me)
I am refering http://en.wikipedia.org/wiki/Bayesian_spam_filtering for the info by the way.
I got to complete the implementation by this weekend :(
Thanks and regards,
MicroKernel :)
@sth:
Hmm... Shouldn't repeated occurrence of word 'W' increase a message's spam score? In the your approach it wouldn't, right?.
Lets take a scenario and discuss...
Lets say, we have 100 training messages, out of which 50 are spam and 50 are Ham. and say word_count of each message = 100.
And lets say, in spam messages word W occurs 5 times in each message and word W occurs 1 time in Ham message.
So total number of times W occuring in all the spam message = 5*50 = 250 times.
And total number of times W occuring in all Ham messages = 1*50 = 50 times.
Total occurance of W in all of the training messages = (250+50) = 300 times.
So, in this scenario, how do u calculate P(W|S) and P(W|H) ?
Naturally we should expect, P(W|S) P(W|H)??? right.
Please share your thought...