Where would you start if you were trying to solve this PDF classification problem?

Posted by burtonic on Programmers See other posts from Programmers or by burtonic
Published on 2012-06-24T17:55:53Z Indexed on 2012/06/24 21:22 UTC
Read the original article Hit count: 230

Filed under:

algorithms

|

ruby

We are crawling and downloading lots of companies' PDFs and trying to pick out the ones that are Annual Reports. Such reports can be downloaded from most companies' investor-relations pages.

The PDFs are scanned and the database is populated with, among other things, the:

Title
Contents (full text)
Page count
Word count
Orientation
First line

Using this data we are checking for the obvious phrases such as:

Annual report
Financial statement
Quarterly report
Interim report

Then recording the frequency of these phrases and others. So far we have around 350,000 PDFs to scan and a training set of 4,000 documents that have been manually classified as either a report or not.

We are experimenting with a number of different approaches including Bayesian classifiers and weighting the different factors available. We are building the classifier in Ruby. My question is: if you were thinking about this problem, where would you start?

© Programmers or respective owner

Related posts about algorithms

Finding a problem in some task [closed]

as seen on Programmers - Search for 'Programmers'
Recently I competed in nation wide programming contest finals. Not unexpectedly all problems were algorithmic. I lost (40 points out of 600. Winner got ~300). I know why I lost very well - I don't know how to find actual problem in those obfuscated tasks which are life-blood of every competition.… >>> More
Genetic algorithms

as seen on Stack Overflow - Search for 'Stack Overflow'
I'm trying to implement a genetic algorithm that will calculate the minimum of the Rastrigin functon and I'm having some issues. I need to represent the chromosome as a binary string and as the Rastrigin's function takes a list of numbers as a parameter, how can decode the chromosome to a list of… >>> More
How to know whether to create a general system or to hack a solution

as seen on Programmers - Search for 'Programmers'
I'm new to coding , learning it since last year actually. One of my worst habits is the following: Often I'm trying to create a solution that is too big , too complex and doesn't achieve what needs to be achieved, when a hacky kludge can make the fit. One last example was the following (see paste… >>> More
understanding evaluation function

as seen on Programmers - Search for 'Programmers'
I am developing a chess program. And have made use of an alpha beta algorithm and a static evaluation function. I have successfully implemented both but I want to improve the evaluation function by automatically tuning the weights assigned to its features. At this point am totally confused about the… >>> More
How to implement a genetic algorithm with distance, time, and cost

as seen on Programmers - Search for 'Programmers'
I want to make a solution to find the optimum route of school visit. For example, I want to visit 5 schools (A, B, C, D, E) in my city. Then I must find out what school I should visit first, then the second, then the third etc. with distance, time, and cost criteria. The problem is, I am confused… >>> More

Related posts about ruby

Setting up Rails to work with sqlserver

as seen on Stack Overflow - Search for 'Stack Overflow'
Ok I followed the steps for setting up ruby and rails on my Vista machine and I am having a problem connecting to the database. Contents of database.yml development: adapter: sqlserver database: APPS_SETUP Host: WindowsVT06\SQLEXPRESS Username: se Password: paswd Run rake db:migrate… >>> More
marshal data too short!!!

as seen on Stack Overflow - Search for 'Stack Overflow'
My application requires to keep large data objects in session. There are like 3-4 data objects each created by parsing a csv containing 150 X 20 cells having strings of 3-4 characters. My application shows this error- "marshal data too short". I tried this- Deleting the old session table. Deleting… >>> More
Sinatra and XML POST request

as seen on Stack Overflow - Search for 'Stack Overflow'
I don't know is it my mistake or no. So i have that code: <code> post '/singin/get_token' do content_type :xml puts request.body.read puts xmlRequest xmlRequest = REXML::Document.new(request.body.read) ... </code> And when i post something like that: <code> <?xml… >>> More
how to change ruby path from /usr/bin/ruby to /usr/local/bin/ruby

as seen on Stack Overflow - Search for 'Stack Overflow'
reading around the various ruby install tutorials it's required to change path from /usr/bin/ruby to /usr/local/bin/ruby but i cant seem to be able to do it. Ultimately i want to install Ruby 1.9.2, should i uninstall 1.8.7 or what? i tried to install Ruby 1.9.2 with macports, the installation seemed… >>> More
strange bundler error: tar_input.rb:49:in `initialize': not in gzip format (Zlib::GzipFile::Error) o

as seen on Stack Overflow - Search for 'Stack Overflow'
i am getting a strange bundler error when running bundle pack with bundler 0.9.12 any ideas? (see pastie for a better formatted code: http://pastie.org/881328 ) /opt/ruby-enterprise-1.8.7-2010.01/lib/ruby/site_ruby/1.8/rubygems/package/tar_input.rb:49:in `initialize': not in gzip format (Zlib::GzipFile::Error) … >>> More