Is there a way to extracting semantic informations from PDF? (converting PDF to pure XHTML)

Posted by Eonil on Stack Overflow See other posts from Stack Overflow or by Eonil
Published on 2010-02-05T09:46:14Z Indexed on 2010/03/22 13:41 UTC
Read the original article Hit count: 378

Filed under:
|
|

Hi. I'm finding a way to extract semantic structural informations (like title, heading, paragraph or lists) from PDF. Because I want to get a pure structural data from PDF.

Finally, I want to create an pure XHTML from the PDF. With only structural informations. No design or layout.

I know, PDF can be created without any structural information. I don't consider those PDFs. Only regularly well-structured PDFs are considered.

I'm new to PDF. So I don't know it offers regular semantic structure or not. If it exists, it's library will offer it. So I want to know whether PDF spec has those information, and best way to get those information if exists.

© Stack Overflow or respective owner

Related posts about pdf

Related posts about semantic