Where would you start if you were trying to solve this PDF classification problem?
- by burtonic
We are crawling and downloading lots of companies' PDFs and trying to pick out the ones that are Annual Reports. Such reports can be downloaded from most companies' investor-relations pages.
The PDFs are scanned and the database is populated with, among other things, the:
Title
Contents (full text)
Page count
Word count
Orientation
First line
Using this data we are checking for the obvious phrases such as:
Annual report
Financial statement
Quarterly report
Interim report
Then recording the frequency of these phrases and others. So far we have around 350,000 PDFs to scan and a training set of 4,000 documents that have been manually classified as either a report or not.
We are experimenting with a number of different approaches including Bayesian classifiers and weighting the different factors available. We are building the classifier in Ruby. My question is: if you were thinking about this problem, where would you start?