pdftotext - Developer IT

pdftotext can't find any of the files to convert when called within a python script

- by hatorade

i have a python script which keeps crashing on: subprocess.call(["pdftotext", pdf_filename]) the error being: OSError: [Errno 2] No such file or directory the absolute path to the filename (which i am storing in a log file as i debug) is fine; on the command line, if i type pdftotext <pdf_filename_goes_here> it works for any of the alledgedly bad file names. but when called using subprocess in python i keep getting that error. what is going on???

Read the article

Converting PDF portfolios to plain text (pdftotext?)

- by Andrea

I am trying to convert a large number of PDFs (~15000) to plain text using pdftotext. This is working pretty well except for a few of the PDFs (~600) which, I guess, are "PDF portfolios." When I run these PDFs through pdftotext, it just outputs: For the best experience, open this PDF portfolio in Acrobat 9 or Adobe Reader 9, or later. Get Adobe Reader Now! If I do open these PDFs in Adobe Reader, they look like two or more PDFs inside a single file. Has anyone encountered this issue before? Is there any tool I can use to convert these PDFs automatically? (Either directly to text or at least to regular PDFs that pdftotext can then understand.)

Read the article

cleaning up pdftotext font issues

- by mankoff

I'm using pdftotext to make an ASCII version of a PDF document (made with LaTeX), because collaborators prefer a simple document in MS word. The plain text version I see looks good, but upon closer inspection the f character seems to be frequently mis-converted depending on what characters follow. For example, fi and fl often seem to become one special character, which I will try to paste here: ? and ?. What is the best way to clean up the output of pdftotext? I am thinking sed might be the right tool, but am not sure how to detect these special characters.

Read the article

pdftotext not outputting hebrew characters

- by Ofri Raviv

I'm using Xpdf's pdftotext to get the text out of some hebrew pdf files on Ubuntu. On my local machine this worked fine. I then tried to do it on another machine and the hebrew characters don't show up in the text file. I verified that I have the language package (see below why I think so). Where else can I look for the problem? >> tail -2 /etc/xpdf/xpdfrc include /etc/xpdf/includes >> cat /etc/xpdf/includes # This file was automatically generated by /usr/sbin/update-xpdfrc. # Instead, add or remove files in /etc/xpdf/ then run # /usr/sbin/update-xpdfrc to regenerate this file. include /etc/xpdf/xpdfrc-latin2 include /etc/xpdf/xpdfrc-thai include /etc/xpdf/xpdfrc-greek include /etc/xpdf/xpdfrc-turkish include /etc/xpdf/xpdfrc-arabic include /etc/xpdf/xpdfrc-hebrew include /etc/xpdf/xpdfrc-cyrillic >> cat /etc/xpdf/xpdfrc-hebrew #----- begin Hebrew support package (2003-feb-16) unicodeMap ISO-8859-8 /usr/share/xpdf/hebrew/ISO-8859-8.unicodeMap unicodeMap Windows-1255 /usr/share/xpdf/hebrew/Windows-1255.unicodeMap #----- end Hebrew support package >> ls /usr/share/xpdf/hebrew/ ISO-8859-8.unicodeMap Windows-1255.unicodeMap

Read the article

PHP Explode with an Unicode character as separator

- by Young Roger

XPDFs pdftotext converts pdf to text and outputs it at command line level. If needed it inserts PageBreaks between the pages as specified in TextOutputDev.cc: eopLen = uMap->mapUnicode(0x0c, eop, sizeof(eop)); This Unicode symbol is encoding independent, -enc ASCII7 wouldn't change it. I'm currently willing to use PHP for converting and splitting the PDF file into several TXT pages for database storage. However, the following function does work, but takes twice as long as a conversion of the whole book in one time. for($i = 1; $i <= $pages[0]; $i++) $page[$i] = shell_exec('/usr/bin/pdftotext sample.pdf -f '.$i.' -l '.$i.' -'); How am I supposed to explode(0x0c, $wholePDF) with an Unicode character as separator? Currently, page[$i] doesn't seem to retrieve those weird Unicode PageBreak characters from the shell_exec(). I tried several headers for encoding (UTF-8 especially) but it didn't work out so far.

Read the article

How to extract text using Zend_Pdf from pdf page

- by Brant

Can anyone help with extracting text from a page in a pdf? <?php $pdf = Zend_Pdf::load('example.pdf'); $page = $pdf->page[0]; I would assume a page method would exist but I could not find anything to let me extract the contents. Example: $page-getContents(); $page-toString(); $page-extractText(); ...Help!!!! This is driving me crazy!

Read the article

"No such file or directory" when the file is there

- by Arlaud Agbe Pierre

I'm trying to run XPDF on a linux (probably red hat) OVH shared server. I've managed to have ftp ssh access and put the 64 bits binaries onto a folder. The problem is : even though the files are there with the right permissions, if I try running it I'm getting a file not found problem (I'm thinking about a missing link..) Long story short : jurisedi@ssh1:~/xpdf$ file pdftotext pdftotext: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), stripped jurisedi@ssh1:~/xpdf$ ./pdftotext -ovh: jurisedi@ssh1:~/xpdf$: No such file or directory Any ideas ?

Read the article

Find keyword values from PDF [closed]

- by JukkaA

I have a lot of PDF reports I'd need to index. They're mostly "text-based PDFs", not images. I know they all have account number in certain format, 123456AAAAA and some other keyword info like addresses, customer names etc. needed in indexing these files. Basically if the file is ab.pdf, I need to create ab.txt that contains: ACC=123456AAAA Customer=John Doe Date=20120808 What would be the best software/solution to generate indexing information for these? I know there's pdftotext, but piping it to different grep/awk commands is a hack... It would be nice to specify an area in PDF to search for the account number, and specify the format it is in.

Read the article

SharePoint OCR image files indexing

Introduction This article describes how to setup indexing of the image files (including TIFF, PDF, JPEG, BMP...) using OCR technology. The indexing described below utilizes Microsoft IFilter technology and as such is not specific to SharePoint, but can be used with any product that uses Microsoft indexing: Microsoft Search, Desktop search, SQL Server search, and through the plug-ins with Google desktop search. I however use it with Microsoft Windows SharePoint Services 2003. For those other products, the registration may need to be slightly different. Background One of the projects I was working on required a storage of old documents scanned into PDF files. Then there was a separate team of people responsible for providing a tags for a search engine so those image documents could be found. The whole process was clumsy, labor intensive, and error prone. That was what started me on my exploration path. OCR The first search I fired was for the Open Source OCR products. Pretty quickly, I narrowed it down to TESSERACT (http://code.google.com/p/tesseract-ocr/). Tesseract is an orphaned brain child of HP that worked on it from 1985 to 1995. Then it was moved to the Open Source, and now if I understand it correctly, Google is working on it. With credentials like that, it's no wonder that Tesseract scores one of the highest marks on OCR recognition and accuracy. After downloading and struggling just a bit, I got Tesseract to work. The struggling part was that the home page claims that its base input format is a TIFF file. May be my TIFFs were bad, but I was able to get it to work only for BMP files. Image files conversion So now that I have an OCR that can convert BMP files into text, how do I get text out of the image PDF files? One more search, and I settled down on ImageMagic (http://www.imagemagick.org/). This is another wonderful Open Source utility that can convert any file into image. It did work out of the box, converting any TIFF files into bitmaps, but to get PDF files converted, it requires a GhostScript (http://mirror.cs.wisc.edu/pub/mirrors/ghost/GPL/gs864/gs864w32.exe). Dealing with text PDFs With that utility installed, I was cooking - I can convert any file (in particular PDF and TIFF) into bitmap, and then I can extract the text out of the bitmap. The only consideration was to somehow treat PDF files containing text differently - after all, OCR is very computation intensive and somewhat error prone even with perfect image quality and resolution. So another quick search, and I have a PDFTOTEXT (ftp://ftp.foolabs.com/pub/xpdf/xpdf-3.02pl4-win32.zip) - thank God for Open Source! With these guys, I can pull text out of PDF in an eye blink. However, I would get nothing for pure image PDFs, but I already have a solution for that! Batch process It took another 15 minutes to setup a batch script to automate the process: Check the file extension If file is a PDF file try to extract text out of it if there is more than certain amount of text in the file - done! if there is no text, convert first page into bitmap run OCR on the bitmap For any other file type, convert file into bitmap Run OCR on the bitmap Once you unzip the attached project, check out the bin\OCR.BAT file. It will create a temporary file in the directory where your source file is with the same name + the '.txt' extension.Continue span.fullpost {display:none;}

Developer IT