Reliable and fast way to convert a zillion ODT files in PDF?

Posted by Marco Mariani on Stack Overflow See other posts from Stack Overflow or by Marco Mariani
Published on 2010-05-25T10:24:40Z Indexed on 2010/05/25 10:31 UTC
Read the original article Hit count: 341

Filed under:
|
|
|

I need to pre-produce a million or two PDF files from a simple template (a few pages and tables) with embedded fonts. Usually, I would stay low level in a case like this, and compose everything with a library like ReportLab, but I joined late in the project.

Currently, I have a template.odt and use markers in the content.xml files to fill with data from a DB. I can smoothly create the ODT files, they always look rigth.

For the ODT to PDF conversion, I'm using openoffice in server mode (and PyODConverter w/ named pipe), but it's not very reliable: in a batch of documents, there is eventually a point after which all the processed files are converted into garbage (wrong fonts and letters sprawled all over the page).

Problem is not predictably reproducible (does not depend on the data), happens in OOo 2.3 and 3.2, in Ubuntu, XP, Server 2003 and Windows 7. My Heisenbug detector is ticking.

I tried to reduce the size of batches and restarting OOo after each one; still, a small percentage of the documents are messed up.

Of course I'll write about this on the Ooo mailing lists, but in the meanwhile, I have a delivery and lost too much time already.

Where do I go?

  1. Completely avoid the ODT format and go for another template system.

    • Suggestions? Anything that takes a few seconds to run is way too slow. OOo takes around a second and it sums to 15 days of processing time. I had to write a program for clustering the jobs over several clients.
  2. Keep the format but go for another tool/program for the conversion.

    • Which one? There are many apps in the shareware or commercial repositories for windows, but trying each one is a daunting task. Some are too slow, some cannot be run in batch without buying it first, some cannot work from command line, etc.
    • Open source tools tend not to reinvent the wheel and often depend on openoffice.
  3. Converting to an intermediate .DOC format could help to avoid the OOo bug, but it would double the processing time and complicate a task that is already too hairy.

  4. Try to produce the PDFs twice and compare them, discarding the whole batch if there's something wrong.

    • Although the documents look equal, I know of no way to compare the binary content.
  5. Restart OOo after processing each document.

    • it would take a lot more time to produce them
    • it would lower the percentage of the wrong files, and make it very hard to identify them.
  6. Go for ReportLab and recreate the pages programmatically. This is the approach I'm going to try in a few minutes.

  7. Learn to properly format bulleted lists

Thanks a lot.

© Stack Overflow or respective owner

Related posts about python

Related posts about pdf