Is there a way to extracting semantic informations from PDF? (converting PDF to pure XHTML)
Posted
by Eonil
on Stack Overflow
See other posts from Stack Overflow
or by Eonil
Published on 2010-02-05T09:46:14Z
Indexed on
2010/03/22
13:41 UTC
Read the original article
Hit count: 375
Hi. I'm finding a way to extract semantic structural informations (like title, heading, paragraph or lists) from PDF. Because I want to get a pure structural data from PDF.
Finally, I want to create an pure XHTML from the PDF. With only structural informations. No design or layout.
I know, PDF can be created without any structural information. I don't consider those PDFs. Only regularly well-structured PDFs are considered.
I'm new to PDF. So I don't know it offers regular semantic structure or not. If it exists, it's library will offer it. So I want to know whether PDF spec has those information, and best way to get those information if exists.
© Stack Overflow or respective owner