Search Results

Search found 4479 results on 180 pages for 'pdf scraping'.

Page 9/180 | < Previous Page | 5 6 7 8 9 10 11 12 13 14 15 16  | Next Page >

  • Batch OCR for many PDF files (not already OCRed) ?

    - by David
    Hello, I use Google Desktop Search (I am on Vista) and not all my PDF files are recognized in my archive folder. It is normal as "PDF files that contain scanned images" are not indexed (http://desktop.google.com/support/bin/answer.py?hl=en&answer=90651) So I would like to OCR many of my PDF files that are not already OCRed. My goal : I give the program a folder and it search alone in the subfolders the PDF files that need to be converted into PDF-OCRed files. Note: In the past, if a PDF file was password protected, I removed the password with another batch (paying) tool: verypdf.com "pwdremover" Any (not too much expensive) idea ? I already tried : Finereader 6 pro on xp at the time, but there was no batch processor included... Paperfile paperfile.net which uses Tesseract code.google.com/p/tesseract-ocr/. But the OCR is only PDF to text, not PDF to PDF! There is also another project code.google.com/p/ocropus Thanks in advance ;)

    Read the article

  • Simple Merging Of PDF Documents with iTextSharp 5.4.5.0

    - by Mladen Prajdic
    As we were working on our first SQL Saturday in Slovenia, we came to a point when we had to print out the so-called SpeedPASS's for attendees. This SpeedPASS file is a PDF and contains thier raffle, lunch and admission tickets. The problem is we have to download one PDF per attendee and print that out. And printing more than 10 docs at once is a pain. So I decided to make a little console app that would merge multiple PDF files into a single file that would be much easier to print. I used an open source PDF manipulation library called iTextSharp version 5.4.5.0 This is a console program I used. It’s brilliantly named MergeSpeedPASS. It only has two methods and is really short. Don't let the name fool you It can be used to merge any PDF files. The first parameter is the name of the target PDF file that will be created. The second parameter is the directory containing PDF files to be merged into a single file. using iTextSharp.text; using iTextSharp.text.pdf; using System; using System.IO; namespace MergeSpeedPASS { class Program { static void Main(string[] args) { if (args.Length == 0 || args[0] == "-h" || args[0] == "/h") { Console.WriteLine("Welcome to MergeSpeedPASS. Created by Mladen Prajdic. Uses iTextSharp 5.4.5.0."); Console.WriteLine("Tool to create a single SpeedPASS PDF from all downloaded generated PDFs."); Console.WriteLine(""); Console.WriteLine("Example: MergeSpeedPASS.exe targetFileName sourceDir"); Console.WriteLine(" targetFileName = name of the new merged PDF file. Must include .pdf extension."); Console.WriteLine(" sourceDir = path to the dir containing downloaded attendee SpeedPASS PDFs"); Console.WriteLine(""); Console.WriteLine(@"Example: MergeSpeedPASS.exe MergedSpeedPASS.pdf d:\Downloads\SQLSaturdaySpeedPASSFiles"); } else if (args.Length == 2) CreateMergedPDF(args[0], args[1]); Console.WriteLine(""); Console.WriteLine("Press any key to exit..."); Console.Read(); } static void CreateMergedPDF(string targetPDF, string sourceDir) { using (FileStream stream = new FileStream(targetPDF, FileMode.Create)) { Document pdfDoc = new Document(PageSize.A4); PdfCopy pdf = new PdfCopy(pdfDoc, stream); pdfDoc.Open(); var files = Directory.GetFiles(sourceDir); Console.WriteLine("Merging files count: " + files.Length); int i = 1; foreach (string file in files) { Console.WriteLine(i + ". Adding: " + file); pdf.AddDocument(new PdfReader(file)); i++; } if (pdfDoc != null) pdfDoc.Close(); Console.WriteLine("SpeedPASS PDF merge complete."); } } } } Hope it helps you and have fun.

    Read the article

  • Rails PDF Generation with Prawn in IE7

    - by fluid_chelsea
    I'm using Prawn and Prawnto to generate a PDF in a Ruby on Rails app (Rails version 2.2.2) which works great and generates PDFs happily and sends them to the user to download in Firefox. The problem is in IE7. I have a route set up like so: map.invoice_pdf '/invoices.pdf', :controller => 'invoices', :action => 'index', :format => 'pdf' Which I then have a link like so to call: invoice_pdf_path(:year => params[:year], :month => params[:month], :unpaid_only => params[:unpaid_only]) And the following in my controller: def index params[:year] = default params[:year] params[:month] = default params[:month] params[:page] ||= 1 @invoices = Arobl.find_invoices_for_customer(current_customer.strCustomerID, params) respond_to do |format| format.html{ render :action => 'index' } format.pdf{ prawnto :inline => false, :filename => "#{current_customer.strCustomerID}_invoice.pdf" end In FF this works as expected, when the link is clicked the show action is invoked with a format of .pdf, and responds with the correctly named PDF. When it's hit with IE7 it says that the file or website could not be found, and references "invoices.pdf" instead of the expected customer_id_invoice.pdf filename. Any idea what could be causing this behaviour? Thanks!

    Read the article

  • Rendering PDF on WebPage

    - by Priyank
    Hi. We are trying to load a pdf file in web browser using pdfobject javascript api. Currently the size of the pdf's that we are trying to display is close to 10MBs. This creates a long delay in displaying a PDF on web page; while the complete PDF gets downloaded. We need to remove this lag by achieving either of the alternatives: Show a progress bar until the PDF is actually displayed. We couldn't find an event which is triggered and can be used to find out if pdf is visible now. This lacking doesn't let us decide when to stop showing progress bar/spinner OR lazy load the PDF such that it gets displayed as soon as first page gets loaded. With that ateast user will have a visual indication as to something is happening. We couldn'find anything in pdf object that lets us do a lazy load. User alternative pdf rendering api; this is a low priority as we already have complete code in place; but in an event of first 2 alternatives not being met; we'd have to consider this option. So please feel free to suggest. Any other ideas as to how user interaction can be made more intuitive or pleasant; would be welcome. Cheers

    Read the article

  • HTML Agility Pack Screen Scraping XPATH isn't returning data

    - by Matthias Welsh
    I'm attempting to write a screen scraper for Digikey that will allow our company to keep accurate track of pricing, part availability and product replacements when a part is discontinued. There seems to be a discrepancy between the XPATH that I'm seeing in Chrome Devtools as well as Firebug on Firefox and what my C# program is seeing. The code I'm currently using is pretty quick and dirty... //This function retrieves data from the digikey private static List<string> ExtractProductInfo(HtmlDocument doc) { List<HtmlNode> m_unparsedProductInfoNodes = new List<HtmlNode>(); List<string> m_unparsedProductInfo = new List<string>(); //Base Node for part info string m_baseNode = @"//html[1]/body[1]/div[2]"; //Write part info to list m_unparsedProductInfoNodes.Add(doc.DocumentNode.SelectSingleNode(m_baseNode + @"/table[1]/tr[1]/td[1]/table[1]/tr[1]/td[1]")); //More lines of similar form will go here for more info //this retrieves digikey PN foreach(HtmlNode node in m_unparsedProductInfoNodes) { m_unparsedProductInfo.Add(node.InnerText); } return m_unparsedProductInfo; } Although the path I'm using appears to be "correct" I keep getting NULL when I look at the list "m_unparsedProductInfoNodes" Any idea what's going on here? I'll also add that if I do a "SelectNodes" on the baseNode it only returns a div... not sure what that indicates but it doesn't seem right.

    Read the article

  • Python web scraping involving HTML tags with attributes

    - by rohanbk
    I'm trying to make a web scraper that will parse a web-page of publications and extract the authors. The skeletal structure of the web-page is the following: <html> <body> <div id="container"> <div id="contents"> <table> <tbody> <tr> <td class="author">####I want whatever is located here ###</td> </tr> </tbody> </table> </div> </div> </body> </html> I've been trying to use BeautifulSoup and lxml thus far to accomplish this task, but I'm not sure how to handle the two div tags and td tag because they have attributes. In addition to this, I'm not sure whether I should rely more on BeautifulSoup or lxml or a combination of both. What should I do? At the moment, my code looks like what is below: import re import urllib2,sys import lxml from lxml import etree from lxml.html.soupparser import fromstring from lxml.etree import tostring from lxml.cssselect import CSSSelector from BeautifulSoup import BeautifulSoup, NavigableString address='http://www.example.com/' html = urllib2.urlopen(address).read() soup = BeautifulSoup(html) html=soup.prettify() html=html.replace('&nbsp', '&#160') html=html.replace('&iacute','&#237') root=fromstring(html) I realize that a lot of the import statements may be redundant, but I just copied whatever I currently had in more source file. EDIT: I suppose that I didn't make this quite clear, but I have multiple tags in page that I want to scrape.

    Read the article

  • Problem with eastern european characters when scraping data from the European Parliaments Website

    - by Thomas Jensen
    Dear Experts I am trying to scrape a lot of data from the European Parliament website for a research project. Ther first step is the create a list if all parliamentarians, however due to the many Eastern European names and the accents they use i get a lot of missing entries. Here is an example of what is giving me troubles (notice the accents at the end of the family name): ANDRIKIENE, Laima Liucija Group of the European People's Party (Christian Democrats) So far I have been using PyParser and the following code: parser_names name = Word(alphanums + alphas8bit) begin, end = map(Suppress, "<") names = begin + ZeroOrMore(name) + "," + ZeroOrMore(name) + end for name in names.searchString(page): print(name) However this does not catch the name from the html above. Any advice in how to proceed? Best, Thomas

    Read the article

  • Scraping a page from a secure URL which is possibly using a session ID

    - by VN44CA
    How to scrape a page like this. https://www.procom.ca/JobList.aspx?keywords=&Cities=&reference=&JobType=0 It is secure, and requires a referrer? I can't get anything using wget or httplib2. If you go through this page, you get a list and it works on a browser but not the command line. https://www.procom.ca/jobsearch.aspx I am interested in command line fetching. thx

    Read the article

  • Screen scraping an application window and interacting with the mouse and keyboard

    - by ccook
    The other day I found myself addicted to a flash game and frustrated by the thing at the same time. In a moment of frustration with the game I thought I would make a 'bot' to beat it for me. Well, I really wouldn't, but it made me realize: I don't know how to interact with another application in a way to do this. Which brings me to the question, how would one take screenshots of another running application and interact with it with the keyboard and mouse. Ideally the solution would be in a managed language like c#. When doing the background reading the net was drowning with articles on scrapping HTML. There were not many articles on actually screen scrapping an application. Diverse answers are appreciated as I’m really looking at surveying what’s out there.

    Read the article

  • Scraping Google docs (can't use API)

    - by Andy Waite
    I'm building an iPhone app which needs a peice of meta data from a user's Google Spreadsheet. Unfortunately the meta data I need is not exposed by the API, so I will need to scrape it from the document's HTML source (it would not be present in any of the exported variants). Is there anyway to include authentication parameters in a call such as: http://spreadsheets.google.com/ccc?key=abc123&username=...&password=...

    Read the article

  • Scraping ASP.NET site with Ruby

    - by JillianK
    I would like to scrape the search results of this ASP.NET site using Ruby and preferably just using Hpricot (I cannot open an instance of Firefox): http://www.ngosinfo.gov.pk/SearchResults.aspx?name=&foa=0 However, I am having trouble figuring out how to go through each page of results. Basically, I need simulate clicking on links like these: <a href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$Pager1$2','')" class="blue_11" id="ctl00_ContentPlaceHolder1_Pager1">2</a> <a href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$Pager1$3','')" class="blue_11" id="ctl00_ContentPlaceHolder1_Pager1">3</a> etc. I tried using Net::HTTP to handle the post, but while that received the correct HTML, there were no search results (I'm probably not doing that correctly). In addition, the URL of the page does not contain any parameters indicating page, so it is not possible to force the results that way. Any help would be greatly appreciated.

    Read the article

  • PHP Magento Screen Scraping

    - by Grant unwin
    I am trying to scrape a suppliers magento site in an effort to save some time because of there being around 2000 products I need to gather info for. I'm totally OK with writing a screen scraper for pretty much anything but i've encountered a major problem. Im using get_file_contentsto gather the html of the product page. The problem is: You need to be logged in, to view the product page. Its a standard magento login, so how can I get round this in my screen scraper? I don't require a full script, just advice on a method. Thanks

    Read the article

  • Creating a multi-page PDF doc

    - by codemercenary
    Hi, has anyone already created a PDF document in an iPad app. i see that there are new functions in the UIKit to do this, but I can't find any code example for this. BOOL UIGraphicsBeginPDFContextToFile ( NSString *path, CGRect bounds, NSDictionary *documentInfo ); void UIGraphicsBeginPDFPage ( void ); I found an example that is supposed to work on the iPhone, but this gives me errors: Fri Apr 30 11:55:32 wks104.hs.local PDF[1963] <Error>: CGFont/Freetype: The function `create_subset' is currently unimplemented. Fri Apr 30 11:55:32 wks104.hs.local PDF[1963] <Error>: invalid Type1 font: unable to stream font. Fri Apr 30 11:55:32 wks104.hs.local PDF[1963] <Error>: FT_Load_Glyph failed: error 6. Fri Apr 30 11:55:32 wks104.hs.local PDF[1963] <Error>: FT_Load_Glyph failed: error 6. Fri Apr 30 11:55:32 wks104.hs.local PDF[1963] <Error>: FT_Load_Glyph failed: error 6. Fri Apr 30 11:55:32 wks104.hs.local PDF[1963] <Error>: FT_Load_Glyph failed: error 6.

    Read the article

  • PDF form submission

    - by Jeff
    I have a PDF form (made in Acrobat) that has button to submit via HTTP. What I want to do it have a PHP script that will take the PDF form and e-mail it to me via attachment. What I don't want: --PDF Submit via e-mail button. This requires webmail users to save the pdf and attach it, and is just too confusing for most users. I want one-click and done. --Submit via mailto:[email protected]. Does the same thing as above. If there's a pdf on the server, I know how to use PHP's mail() function to e-mail it to someone. What I don't know how to do is process the PDF once someone hits Submit within the PDF. Does that make sense? Thanks, Jeff

    Read the article

  • PostScript versus PDF as an output format

    - by Brecht Machiels
    I'm currently writing a typesetting application and I'm using PSG as the backend for producing postscript files. I'm now wondering whether that choice makes sense. It seems the ReportLab Toolkit offers all the features PSG offers, and more. ReportLab outputs PDF however. Advantages PDF offers: transparancy better support for character encodings (Unicode, for example) ability to embed TrueType and even OpenType fonts hyperlinks and bookmarks Is there any reason to use Postscript instead of directly outputting to PDF? While Postscript is a full programming language as opposed to PDF, as a basic output format for documents, that doesn't seem to offer any advantage. I assume a PDF can be readily converted to PostScript for printing? Some useful links: Wikipedia: PDF Adobe: PostScript vs. PDF

    Read the article

  • What is a good PDF report generator tool for python?

    - by jlouis
    What is a good tool for PDF report generation in Python? I've checked out ReportLab, but it seems to be awfully low-level for what I want to do. My current hunch is to call TeX on the command-line and let it produce the PDF, but if there is something that is easier to work with (and looks professional - We'll send this to customers) I'd very much like a prod in the right direction.

    Read the article

  • How to embed evince in firefox 4?

    - by Alaukik
    I installed mozplugger and created the file mozpluggerrc with the following content according to this post But whenever I open a .pdf it opens in a separate evince windows is there a way I can truly embed it in Firefox like the chrome pdf reader? application/pdf: pdf: PDF file application/x-pdf: pdf: PDF file text/pdf: pdf: PDF file text/x-pdf: pdf: PDF file application/x-postscript: ps: PostScript file application/postscript: ps: PostScript file application/x-dvi: dvi: DVI file : evince $file

    Read the article

  • Implements EAN13 and UPC-A barcode in PDF using fpdf in classic ASP

    - by Jeremy N
    /* FPDF library for ASP can be downloaded from: http://www.aspxnet.it/public/default.asp INFORMATIONS: Translated by: Jeremy Author: Olivier License: Freeware DESCRIPTION: This script implements EAN13 and UPC-A barcodes (the second being a particular case of the first one). Bars are drawn directly in the PDF (no image is generated) function EAN13(x,y,barcode,h,w) -x = x coordinate to start drawing the barcode -y = y coordinate to start drawing the barcode -barcode = code to write (must be all numeric) -h = height of the bar -w = the minimum width of individual bar function UPC_A(x,y,barcode,h,w) Same parameters An EAN13 barcode is made up of 13 digits, UPC-A of 12 (leading zeroes are added if necessary). The last digit is a check digit; if it's not supplied or if it is incorrect, it will be automatically computed. USAGE: Copy all of this text and save it in a file called barcode.ext file under fpdf/extends folder EXAMPLE: Set pdf=CreateJsObject("FPDF") pdf.CreatePDF "P","mm","letter" pdf.SetPath("fpdf/") pdf.LoadExtension("barcode") pdf.Open() pdf.AddPage() 'set the fill color to black pdf.setfillcolor 0,0,0 pdf.UPC_A 80,40,"123456789012",16,0.35 pdf.Close() pdf.NewOutput "" , true, "test.pdf" */ this.EAN13=function (x,y,barcode,h,w) { return this.Barcode(x,y,barcode,h,w,13); }; this.UPC_A=function (x,y,barcode,h,w) { return this.Barcode(x,y,barcode,h,w,12); }; function GetCheckDigit(barCode) { bc = barCode.replace(/[^0-9]+/g,''); total = 0; //Get Odd Numbers for (i=bc.length-1; i=0; i=i-2) { total = total + parseInt(bc.substr(i,1)); } //Get Even Numbers for (i=bc.length-2; i=0; i=i-2) { temp = parseInt(bc.substr(i,1)) * 2; if (temp 9) { tens = Math.floor(temp/10); ones = temp - (tens*10); temp = tens + ones; } total = total + temp; } //Determine the checksum modDigit = (10 - total % 10) % 10; return modDigit.toString(); } //Test validity of check digit function TestCheckDigit(barcode) { var cd=GetCheckDigit(barcode.substring(0,barcode.length-1)); return cd==parseInt(barcode.substring(barcode.length-1,1)); } this.Barcode=function Barcode(x,y,barcode,h,w,len) { //Padding while(barcode.length < len-1) { barcode = '0' + barcode; } if(len==12) {barcode='0' + barcode;} //Add or control the check digit if(barcode.length==12) { barcode += GetCheckDigit(barcode); } else { //if the check digit is incorrect, fix the check digit. if(!TestCheckDigit(barcode)) { barcode = barcode.substring(0,barcode.length-1) + GetCheckDigit(barcode.substring(0,barcode.length-1)); } } //Convert digits to bars var codes=[['0001101','0011001','0010011','0111101','0100011','0110001','0101111','0111011','0110111','0001011'], ['0100111','0110011','0011011','0100001','0011101','0111001','0000101','0010001','0001001','0010111'], ['1110010','1100110','1101100','1000010','1011100','1001110','1010000','1000100','1001000','1110100'] ]; var parities=[[0,0,0,0,0,0], [0,0,1,0,1,1], [0,0,1,1,0,1], [0,0,1,1,1,0], [0,1,0,0,1,1], [0,1,1,0,0,1], [0,1,1,1,0,0], [0,1,0,1,0,1], [0,1,0,1,1,0], [0,1,1,0,1,0] ]; var code='101'; var p=parities[parseInt(barcode.substr(0,1))]; var i; for(i=1;i<=6;i++) { code+= codes[p[i-1]][parseInt(barcode.substr(i,1))]; } code+='01010'; for(i=7;i<=12;i++) { code+= codes[2][parseInt(barcode.substr(i,1))]; } code+='101'; //Draw bars for(i=0;i<code.length;i++) { if(code.substr(i,1)=='1') { this.Rect(x+i*w,y,w,h,'F'); } } //Print text uder barcode. this.SetFont('Arial','',12); //Set the x so that the font is centered under the barcode this.Text(x+parseInt(0.5*barcode.length)*w,y+h+11/this.k,barcode.substr(barcode.length-len,len)); }

    Read the article

  • Does Google use any “Language” flags / tags set within a PDF file when determining its language?

    - by Ally Ak
    When determining the language of a HTML page, I understand that Google looks at any language declarations that the page owner has set, and then also applies its own language detection algorithms. But does Google similarly look at language meta data set in PDF files when determining a PDF file's language? (Authors of PDF files can set document-wide properties describing the language (or languages) contained within it.) Or does Google rely exclusively on language detection algorithms and disregard the language flag set within the PDF file? Can anyone shed any light?

    Read the article

  • How to convert a .pdf file into a folder of images?

    - by Shawn
    I have some .pdf files that I would like to convert to my preferred reading format of .cbr or .cbz or, if this isn't directly possible, I need to extract all pages from the .pdf as images and then compress them into my format of choice. I have only been able to save pages one at a time with Document Viewer. Obviously, I'd like to do it a little quicker. I have tried pdfsam, pdf shuffler, and pdfmod all with no luck. I am using Ubuntu 11.10.

    Read the article

  • How to import a pdf in libreoffice? under ubuntu, all pages are blank

    - by Daniele
    I have some .pdf generated by a scanner, that I want to import in LibreOffice and do some small editing. The PDF has only one object per page, a page-size image. If I open it in LibreOffice under Ubuntu 12.10, it imports "successfully" but all pages are blank. I have the libreoffice-pdfimport package installed. That is true with both LibreOffice 3.6 (part of Ubuntu 12.10) and with 4.0.2, from libreoffice ppa. The same .pdf files open perfectly fine on both LibreOffice for Windows and LibreOffice for Mac (yes, I have three computers with all three OSes), but on Ubuntu 12.10, all pages are blank, so I can only conclude this is an issue with Ubuntu packaging, or something really weird prevents it from working under linux. How can I import these kinds of .pdf into LibreOffice for editing?

    Read the article

  • How to reduce the size of a pdf file?

    - by Nicole
    I'm looking for a way in Ubuntu to reduce the size of a pdf (by reducing the quality of the images). I know that this can be done in Ghostscript by typing the following command in terminal: gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/screen -dNOPAUSE -dQUIET -dBATCH -sOutputFile=output.pdf input.pdf The problem is that I can't specify the quality with any accuracy. The parameter -dPDFSETTINGS=/screen is the one that decides the quality; but the alternatives are quite rigid (for example it is possible to do -dPDFSETTINGS=/ebook for slightly better quality). I'm looking for a way to reduce the size of a pdf in a way that allows me to specify the desired quality numerically. I know that this is possible in a Mac, so it must be possible in Linux -- right? Any help would be well appreciated.

    Read the article

< Previous Page | 5 6 7 8 9 10 11 12 13 14 15 16  | Next Page >