A good machine learning technique to weed out good URLs from bad
- by git-noob
Hi,
I have an application that needs to discriminate between good HTTP GET requests and bad.
For example:
http://somesite.com?passes=dodgy+parameter # BAD
http://anothersite.com?passes=a+good+parameter # GOOD
My system can make a binary decision about whether or not a URL is good or bad - but ideally I would like it to predict whether or not a previously unseen URL is good or bad.
http://some-new-site.com?passes=a+really+dodgy+parameter # BAD
I feel the need for a support vector machine (SVM) ... but I need to learn machine learning. Some questions:
1) Is an SVM appropriate for this task?
2) Can I train it with the raw URLs? - without explicitly specifying 'features'
3) How many URLs will I need for it to be good at predictions?
4) What kind of SVM kernel should I use?
5) After I train it, how do I keep it up to date?
6) How do I test unseen URLs again the SVM to decide whether it's good or bad?
I