Extracting information from PDFs of research papers

Posted by Christopher Gutteridge on Stack Overflow See other posts from Stack Overflow or by Christopher Gutteridge
Published on 2009-11-28T19:03:42Z Indexed on 2010/04/21 10:13 UTC
Read the original article Hit count: 497

Filed under:

pdf

|

metadata

|

extraction

I need a mechanism for extracting bibliographic metadata from PDF documents, to save people entering it by hand or cut-and-pasting it.

At the very least, the title and abstract. The list of authors and their affiliations would be good. Extracting out the references would be amazing.

Ideally this would be an open source solution.

The problem is that not all PDF's encode the text, and many which do fail to preserve the logical order of the text, so just doing pdf2text gives you line 1 of column 1, line 1 of column 2, line 2 of column 1 etc.

I know there's a lot of libraries. It's identifying the abstract, title authors etc. on the document that I need to solve. This is never going to be possible every time, but 80% would save a lot of human effort.

© Stack Overflow or respective owner

Related posts about pdf

PDF Converter wanted: Convert 8.5*11 PDF images into 600*800px PDF images for the Nook

as seen on Super User - Search for 'Super User'
I have PDF files that are maritime charts, For example this one from the Delaware Bay http://ocsdata.ncd.noaa.gov/BookletChart/12304_BookletChart_HomeEd.pdf There is a lot of detailed information in the image. When I show them on a monitor the details are shown. When I put them on the Nook they… >>> More
Loop through values and display in a pdf file

as seen on Stack Overflow - Search for 'Stack Overflow'
Hello all! i have written the following code: As you can see there is a for loop to go through some values and display them in the generated pdf. The problem is that all the values are being written at the same place. I have tried to insert a new line but it does not seem to work. Can anyone suggest… >>> More
convert scanned images pdf file to searchable pdf file

as seen on Super User - Search for 'Super User'
I have a pdf of a scanned book. I'm looking for a free software that will perform ocr and then provide an option to save it as pdf/doc. Is there one? Thanks. >>> More
Integrate Nitro PDF Reader with Windows 7

as seen on How to geek - Search for 'How to geek'
Would you like a lightweight PDF reader that integrates nicely with Office and Windows 7? Here we look at the new Nitro PDF Reader, a nice PDF viewer that also lets you create and markup PDF files. Adobe Reader is the de-facto PDF viewer, but it only lets you view PDFs and not much else. … >>> More
foxpro to pdf and pdf to foxpro

as seen on Stack Overflow - Search for 'Stack Overflow'
how will i convert pdf database converted from foxpro back to foxpro >>> More

Related posts about metadata

Storing SCA Metadata in the Oracle Metadata Services Repository by Nicolás Fonnegra Martinez and Markus Lohn

as seen on Oracle Blogs - Search for 'Oracle Blogs'
The advantages of using the Oracle Metadata Services Repository as a central storage for the metadata. SCA has been available since the release of the Oracle SOA Suite 11g. This technology combines and orchestrates several SOA components inside an SCA composite, making design, development… >>> More
Logical Domain Modeling Made Simple

as seen on Oracle Blogs - Search for 'Oracle Blogs'
How can logical domain modeling be made simple and collaborative? Many non-technical end-users, managers and business domain experts find it difficult to understand the visual models offered by many UML tools. This creates trouble in capturing and verifying the information that goes into a logical… >>> More
Error starting modern compiler

as seen on Stack Overflow - Search for 'Stack Overflow'
In my servlet , I m using Tomcat 5.0 and JRE is 1.5.0 but it is giving error when I click on the URL . As when I created a war file of my project and deployed in tomcat than it is working fine . It means that only problem with my eclipse configuration ERROR IS : - Apr 5, 2010 3:20:22 PM org.apache… >>> More
How to remove Thumbnail property of a JPEG image without distrubing other Exif data in C ++ .net.

as seen on Stack Overflow - Search for 'Stack Overflow'
I have an application which edits the metadata part of the JPEG image. I have to remove the thumbnail metadata with out disturbing other metadata. I have tried out the code below but was not successful in removing thumbnail metadata. can some help me out in solving this query thanks in advance. array<String^>^… >>> More
Ask Tom: "On Constraints, Metadata, and Truth"

as seen on Oracle Technology Network - Search for 'Oracle Technology Network'
Our technologist uses constraints to improve query performance. >>> More