Is there a way to extracting semantic informations from PDF? (converting PDF to pure XHTML)

Posted by Eonil on Stack Overflow See other posts from Stack Overflow or by Eonil
Published on 2010-02-05T09:46:14Z Indexed on 2010/03/22 13:41 UTC
Read the original article Hit count: 441

Filed under:

pdf

|

semantic

|

structure

Hi. I'm finding a way to extract semantic structural informations (like title, heading, paragraph or lists) from PDF. Because I want to get a pure structural data from PDF.

Finally, I want to create an pure XHTML from the PDF. With only structural informations. No design or layout.

I know, PDF can be created without any structural information. I don't consider those PDFs. Only regularly well-structured PDFs are considered.

I'm new to PDF. So I don't know it offers regular semantic structure or not. If it exists, it's library will offer it. So I want to know whether PDF spec has those information, and best way to get those information if exists.

© Stack Overflow or respective owner

Related posts about pdf

PDF Converter wanted: Convert 8.5*11 PDF images into 600*800px PDF images for the Nook

as seen on Super User - Search for 'Super User'
I have PDF files that are maritime charts, For example this one from the Delaware Bay http://ocsdata.ncd.noaa.gov/BookletChart/12304_BookletChart_HomeEd.pdf There is a lot of detailed information in the image. When I show them on a monitor the details are shown. When I put them on the Nook they… >>> More
Loop through values and display in a pdf file

as seen on Stack Overflow - Search for 'Stack Overflow'
Hello all! i have written the following code: As you can see there is a for loop to go through some values and display them in the generated pdf. The problem is that all the values are being written at the same place. I have tried to insert a new line but it does not seem to work. Can anyone suggest… >>> More
convert scanned images pdf file to searchable pdf file

as seen on Super User - Search for 'Super User'
I have a pdf of a scanned book. I'm looking for a free software that will perform ocr and then provide an option to save it as pdf/doc. Is there one? Thanks. >>> More
Integrate Nitro PDF Reader with Windows 7

as seen on How to geek - Search for 'How to geek'
Would you like a lightweight PDF reader that integrates nicely with Office and Windows 7? Here we look at the new Nitro PDF Reader, a nice PDF viewer that also lets you create and markup PDF files. Adobe Reader is the de-facto PDF viewer, but it only lets you view PDFs and not much else. … >>> More
foxpro to pdf and pdf to foxpro

as seen on Stack Overflow - Search for 'Stack Overflow'
how will i convert pdf database converted from foxpro back to foxpro >>> More

Related posts about semantic

The Art of Narrative and the Semantic Web

as seen on Internet.com - Search for 'Internet.com'
As the Internet continues to evolve, Semantic Web technologies are beginning to emerge, but widespread adoption is likely to still be two to three years out. >>> More
Use Cases Dictate How You Adopt the Semantic Web

as seen on Internet.com - Search for 'Internet.com'
Designing Semantic Web applications involves a much more sophisticated and forward-thinking set of use cases than traditional web applications do. Find out why. >>> More
Is there any killer application for Ontology/semantics/OWL/RDF yet?

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi Guys, I got interested in semantic technologies after reading a lot of books, blogs and articles on the net saying that it would make data machine-understandable, allow intelligent agents make great reasoning, automated & dynamic service composition etc.. I am still reading the same stuff… >>> More
In XHTML/HTML which elements has semantic value , which are presentitional and which are not in bot

as seen on Stack Overflow - Search for 'Stack Overflow'
In XHTML/HTML which elements has semantic value , which are presentational and which are not in both category? And who decide which tag is semantic, presentational? What is the difference between structural and semantic mark-up? >>> More
HTML5 Semantics - H1 or H2 for ARTICLE titles in a SECTION

as seen on Pro Webmasters - Search for 'Pro Webmasters'
It's my understanding (based from this chapter of Dive into HTML5: http://goo.gl/9zliD) that it can be considered semantically appropriate to use H1 tags in multiple areas of the page, as a method of setting the most important title for that particular content. If I'm using this methodology, and… >>> More