Is there a way to extracting semantic informations from PDF? (converting PDF to pure XHTML)
- by Eonil
Hi. I'm finding a way to extract semantic structural informations (like title, heading, paragraph or lists) from PDF. Because I want to get a pure structural data from PDF.
Finally, I want to create an pure XHTML from the PDF. With only structural informations. No design or layout.
I know, PDF can be created without any structural information. I don't consider those PDFs. Only regularly well-structured PDFs are considered.
I'm new to PDF. So I don't know it offers regular semantic structure or not. If it exists, it's library will offer it. So I want to know whether PDF spec has those information, and best way to get those information if exists.