Where would you start if you were trying to solve this PDF classification problem?
Posted
by
burtonic
on Programmers
See other posts from Programmers
or by burtonic
Published on 2012-06-24T17:55:53Z
Indexed on
2012/06/24
21:22 UTC
Read the original article
Hit count: 212
algorithms
|ruby
We are crawling and downloading lots of companies' PDFs and trying to pick out the ones that are Annual Reports. Such reports can be downloaded from most companies' investor-relations pages.
The PDFs are scanned and the database is populated with, among other things, the:
- Title
- Contents (full text)
- Page count
- Word count
- Orientation
- First line
Using this data we are checking for the obvious phrases such as:
- Annual report
- Financial statement
- Quarterly report
- Interim report
Then recording the frequency of these phrases and others. So far we have around 350,000 PDFs to scan and a training set of 4,000 documents that have been manually classified as either a report or not.
We are experimenting with a number of different approaches including Bayesian classifiers and weighting the different factors available. We are building the classifier in Ruby. My question is: if you were thinking about this problem, where would you start?
© Programmers or respective owner