Extracting information from PDFs of research papers

Posted by Christopher Gutteridge on Stack Overflow See other posts from Stack Overflow or by Christopher Gutteridge
Published on 2009-11-28T19:03:42Z Indexed on 2010/04/21 10:13 UTC
Read the original article Hit count: 486

Filed under:
|
|

I need a mechanism for extracting bibliographic metadata from PDF documents, to save people entering it by hand or cut-and-pasting it.

At the very least, the title and abstract. The list of authors and their affiliations would be good. Extracting out the references would be amazing.

Ideally this would be an open source solution.

The problem is that not all PDF's encode the text, and many which do fail to preserve the logical order of the text, so just doing pdf2text gives you line 1 of column 1, line 1 of column 2, line 2 of column 1 etc.

I know there's a lot of libraries. It's identifying the abstract, title authors etc. on the document that I need to solve. This is never going to be possible every time, but 80% would save a lot of human effort.

© Stack Overflow or respective owner

Related posts about pdf

Related posts about metadata