pypdf - Developer IT

Wrong values reported by pyPDF for various box regions

- by romor

Using pyPdf, for most files I get matched results concerning various box's dimensions compared to what Acrobat reports. However for some files I get different values reported by pyPdf and Acrobat, like: pyPdf: artBox: 595.3 x 841.9 bleedBox: 595.3 x 841.9 cropBox: 595.3 x 841.9 trimBox: 517.3 x 754 Acrobat: artBox: 439.35 x 666.13 pt bleedBox: 439.35 x 666.13 pt cropBox: 439.35 x 666.13 pt trimBox: 439.35 x 666.13 pt I thought it's units issue, but then ratio between widths and heights doesn't match also, not mentioning trimBox mismatch Correct results are those reported by Acrobat of course. Does someone know why is this and is there a way I get correct dimensions by using pyPdf? Thanks couple of minutes later... After reading this question: Are PDF box coordinates relative or absolute? I figured I didn't considered uper left corner to be different then 0 (zero). It turned out that box starts at 77.95 x 87.87, so if we reduce reported values of trimBox by this values correct result is obtained. artBox: 0 x 0 bleedBox: 0 x 0 cropBox: 0 x 0 trimBox: 77.95 x 87.87 Other boxes seem with misleading values or I misinterpret them. Snippet: from pyPdf import PdfFileReader pdfread = PdfFileReader(file('my.pdf', 'rb')) page = 1 width = pdfread.getPage(page).trimBox[2]-pdfread.getPage(page).trimBox[0] height = pdfread.getPage(page).trimBox[3] - pdfread.getPage(page).trimBox[1] print width, height

Read the article

pyPDF - Retrieve page numbers from document

- by SquidneyPoitier

At the moment I'm looking into doing some PDF merging with pyPdf, but sometimes the inputs are not in the right order, so I'm looking into scraping each page for its page number to determine the order it should go in (e.g. if someone split up a book into 20 10-page PDFs and I want to put them back together). I have two questions - 1.) I know that sometimes the page number is stored in the document data somewhere, as I've seen PDFs that render on Adobe as something like [1243] (10 of 150), but I've read documents of this sort into pyPDF and I can't find any information indicating the page number - where is this stored? 2.) If avenue #1 isn't available, I think I could iterate through the objects on a given page to try to find a page number - likely it would be its own object that has a single number in it. However, I can't seem to find any clear way to determine the contents of objects. If I run: pdf.getPage(0).getContents() This usually either returns: {'/Filter': '/FlateDecode'} or it returns a list of IndirectObject(num, num) objects. I don't really know what to do with either of these and there's no real documentation on it as far as I can tell. Is anyone familiar with this kind of thing that could point me in the right direction?

Read the article

Change metadata of pdf file with pypdf.

- by Baudouin Tamines

Hello ! I'd like to create/modify the title of a pdf document using pypdf. It seems that the title is readonly. Is there a way to access this metadata r/w? If answer positive, a piece of code would be appreciated. Thanks

Read the article

Dynamically generated PDF files working in most readers except Adobe Reader

- by Shane

I'm trying to dynamically generate PDFs from user input, where I basically print the user input and overlay it on an existing PDF that I did not create. It works, with one major exception. Adobe Reader doesn't read it properly, on Windows or on Linux. QuickOffice on my phone doesn't read it either. So I thought I'd trace the path of me creating the files - 1 - Original PDF of background PDF 1.2 made with Adobe Distiller with the LZW encoding. I didn't make this. 2 - PDF of background PDF 1.4 made with Ghostscript. I used pdf2ps then ps2pdf on the above to strip LZW so that the reportlab and pyPDF libraries would recognize it. Note that this file looks "fuzzy," like a bad scan, in Adobe Reader, but looks fine in other readers. 3 - PDF of user-input text formatted to be combined with background PDF 1.3 made with Reportlab from user input. Opens properly and looks good in every reader I've tried. 4 - Finished PDF PDF 1.3 made from PyPDF's mergePage() function on 2 and 3. Does not open in: Adobe Reader for Windows Adobe Reader for Linux QuickOffice for Android Opens perfectly in: Google Docs' PDF viewer on the web evince for linux ghostscript viewer for linux Foxit reader for Windows Preview for Mac Are there known issues that I should know about? I don't know exactly what "flate" is, but from the internet I gather that it's some sort of open source alternative to LZW for PDF compression? Could that be causing my problem? If so, are there any libraries I could use to fix the cause in my code?

Read the article

How to read line by line in pdf file using PyPdf ???

- by Rami Jarrar

hi, i working with pdf files, and i have some code to read from a pdf file,, but is there a way to read line by line from the pdf file( not pages ) ,, using Pypdf,, Python 2.6 ,, on windows ???

Read the article

Python module for converting PDF to text

- by cnu

Is there any python module to convert PDF files into text? I tried one piece of code found in Activestate which uses pypdf but the text generated had no space between and was of no use.

Read the article

How to remove package from apt-get autoremove "queue"

- by Darth

I just installed Calibre for ebook management via apt-get on Ubuntu 10.04, however I found out that it's one major version behind the current release, so I decided to reinstall it directly from sources. When I uninstalled the packaged version, apt added bunch of dependencies to the autoremove queue, and as I installed newer version of Calibre from sources, it has no knowledge of it being dependent on those packages. Now I basically have all libraries that I want, but they are still in the autoremove queue. The following packages were automatically installed and are no longer required: libqt4-script libqt4-designer libqt4-dbus python-lxml python-cherrypy3 python-encutils libqt4-xmlpatterns libqt4-help python-qt4 python-clientform python-sip python-django python-mechanize libqt4-svg python-django-tagging libphonon4 libqt4-xml libqt4-assistant libqt4-webkit libqt4-scripttools python-beautifulsoup python-pypdf python-dateutil python-cssutils Use 'apt-get autoremove' to remove them. How do I tell apt that I want to keep these packages installed, without reinstalling them manually?

Read the article

How can I programmatically renumber pages in a PDF?

- by Andrew

As a graduate student, I come across PDFs of articles and book chapters on a daily basis. Sometimes these PDFs are paginated correctly internally (that is, if an article starts on page 67, the PDF starts on page 67 as well; not on page 1). When they aren't, I have to open the file in Acrobat and renumber the pages in the "Page Thumbnails" panel. I would love to be able to automate this whole process with a script (bash, Python, AppleScript, whatever) that lets me pass the first actual page number... something like fixpagination example.pdf 67. However, I cannot find any terminal-based program that can re-paginate PDFs. Neither pdftk nor PyPDF seem to be able to deal with pagination. Are there any scriptable programs that can internally re-paginate PDF files?

Read the article

Dynamically generate PDF and email it using django

- by Shane

I have a django app that dynamically generates a PDF (using reportlab + pypdf) from user input on an HTML form, and returns the HTTP response with an application/pdf MIMEType. I want to have the option between doing the above, or emailing the generated pdf, but I cannot figure out how to use the EmailMessage class's attach(filename=None, content=None, mimetype=None) method. The documentation doesn't give much of a description of what kind of object content is supposed to be. I've tried a file object and the above application/pdf HTTP response. I currently have a workaround where my view saves a pdf to disk, and then I attach the resulting file to an outgoing email using the attach_file() method. This seems wrong to me, and I'm pretty sure there is a better way.

Read the article

How do I overwrite a file currently being read by Python

- by Brian

Hi guys, I am not too sure the best way to word this, but what I want to do, is read a pdf file, make various modifications, and save the modified pdf over the original file. As of now, I am able to save the modified pdf to a separate file, but I am looking to replace the original, not create a new file. Here is my current code: from pyPdf import PdfFileWriter, PdfFileReader output = PdfFileWriter() input = PdfFileReader(file('input.pdf', 'rb')) blank = PdfFileReader(file('C:\\BLANK.pdf', 'rb')) # Copy the input pdf to the output. for page in range(int(input.getNumPages())): output.addPage(input.getPage(page)) # Add a blank page if needed. if (input.getNumPages() % 2 != 0): output.addPage(blank.getPage(0)) # Write the output to pdf. outputStream = file('input.pdf', 'wb') output.write(outputStream) outputStream.close() If i change the outputStream to a different file name, it works fine, I just cant save over the input file because it is still being used. I have tried to .close() the stream, but it was giving me errors as well. I have a feeling this has a fairly simple solution, I just haven't had any luck finding it. Thanks!

Developer IT