pdf scraping - Page 7 - Developer IT

What's the requests/second standard for scraping websites?

- by feydr

This was the closest question to my question and it wasn't really answered very well imo: http://stackoverflow.com/questions/2022030/web-scraping-etiquette I'm looking for the answer to #1: How many requests/second should you be doing to scrape? Right now I pull from a queue of links. Every site that gets scraped has it's own thread and sleeps for 1 second in between requests. I ask for gzip compression to save bandwidth. Are there standards for this? Surely all the big search engines have some set of guidelines they follow in regards to this.

Read the article

Creating PDF Documents with ASP.NET and iTextSharp

The Portable Document Format (PDF) is a popular file format for documents. Due to their ubiquity and layout capabilities, it's not uncommon for a websites to use PDF technology. For example, an eCommerce store may offer a "printable receipt" option that, when selected, displays a PDF file within the browser. Last week's article, Filling in PDF Forms with ASP.NET and iTextSharp, looked at how to work with a special kind of PDF document, namely one that has one or more fields defined. A PDF document can contain various types of user interface elements, which are referred to as fields. For instance, there is a text field, a checkbox field, a combobox field, and more. Typically, the person viewing the PDF on her computer interacts with the document's fields; however, it is possible to enumerate and fill a PDF's fields programmatically, as we saw in last week's article. This article continues our investigation into iTextSharp, a .NET open source library for PDF generation, showing how to use iTextSharp to create PDF documents from scratch. We start with an example of how to programmatically define and piece together paragraphs, tables, and images into a single PDF file. Following that, we explore how to use iTextSharp's built-in capabilities to convert HTML into PDF. Read on to learn more! Read More >

Read the article

Problems in "Save as PDF" plugin with Arabic numbers

- by Mohamed Mohsen

I use the "Save as PDF" plugin with Word 2007 to generate a PDF document from a DOCX document. It works great except that the Arabic numbers in the Word file have been converted to English numbers in the PDF document. Kindly find two links containing two screen shots explaining the problem. The first image is the generated PDF file with the English numbers highlighted. The second image is the original word file with the Arabic numbers highlighted. Update: Thanks very much Isaac, ChrisF and Wil. I changed the Numeral at word to Context and confirmed that all the numbers are Arabic at the Word file. I still have the problem as the PDF file still have English numbers. (Note: The Arabic numbers called Hindi numbers). I also tried changing the font to Tahoma with no hope.

Read the article

ifilter not working with MOSS 2007, cant crawl .pdf

- by SORRYPROFESSEROFYEARNING

Installed ifilter and followed the guides: http://msmvps.com/blogs/sundar_narasiman/archive/2008/02/06/configuring-moss-2007-to-search-pdf-documents-install-and-configure-pdf-ifilters.aspx and the accompanying link to the MS hotfix.. I have initiated multiple crawls that don't show any .pdf documents, let alone the contents of the .pdfs (I did constantly upload test documents with real content). In the 'file types' menu of the shared servies, it didn't show the pdf icon as I think it was meant to, it also lists 'pdf' as filetype 'AcroExch.Document', is this correct? Any ideas anyone?

Read the article

Apple Automator "New PDF from Images" maintaining same filename

- by mech

I will potentially have 26k of old legacy PICT images to transfer first to PDF for migration. I am using Apple Automator and also the "Dispense Items Incrementally" to loop through it. However, I can't seem to let "New PDF from Images" to remember the original filename. Anyone able to offer some advice :) FYI, I am transforming it to PDF because I can't do it using ImageMagick to convert directly to my ultimate JPEG format. Due to the fact that my PICT was created very long ago and thus has some convert: improper image header error. See this ticket for more information. Thus I am doing a intermediate convert PICT to PDF first, then convert that PDF to JPEG :) The only thing left is the naming of the "Output File Name" which do not allow me to identify original filename. See the screen here:

Read the article

Libre Office/Writer PDF export: white borders appear between lines even after setting borders to none

- by Yttric

My document prints ok (exactly as it appears on the screen in Libre Office), but when I export to PDF and view the PDF on screen there are white borders around each text or picture object. Here's a sample snapshot from PDF/Preview: http://imgur.com/TWip5 I've tried selecting a paragraph and changing the border property to None as described in Libre Office help (http://help.libreoffice.org/Common/Borders), setting the "Line arrangement Default" to "Set no borders". But borders set by the Format dialog don't correspond to the borders I see in PDF/Preview. In PDF/Preview the border appears on line boundaries. Borders set in Format appear around each picture, for example. What am I doing wrong?

Read the article

Finding Image resolution in PDF file?

- by Dave

I have a problem of having some users creating very large PDFs. On the other hands I have PDF sent from our fax machines that are really small in size and totally printable. My question is Is there any way I can find the resolution (DPI) of the PDF. I search the internet, could not find any answer. Checked the properties of the file, this information was not stored there, at least in my case. What is the optimum resolution of converting text file into image PDF. 96dpi, 300dpi or more ? Fun question. Can I resize a PDF which was scanned with high dpi into smaller dpi? I know some answers might not be available as I have already searched the internet and could not find answers. Note: My PDF are entirely images, text to images. I am also familiar with primoPDF (free) something you can experiment with

Read the article

Stripping Non-Text from a Scanned, OCRd PDF

- by Daniel S.

I have a PDF created from a scanned document. OCR was used to recognize text. In Acrobat, if I select text, and click 'copy with formatting', I can paste the formatted text into Word, so it seems that fonts and colors are also embedded in the document in addition to just plain text and possibly the size. Is there any way to use this information to create a PDF that just contains the formatted OCRd text, without the scanned image. Currently, my document only shows the scanned image, and the text is on an invisible layer. I would like to create a PDF document that removes the image that was scanned, and displays the formatted text that is currently hidden. The following post has a section on "How can we make the invisible text visible?" PDF has an extra blank in all words after running through Ghostscript However, doing this does not show the correct text formatting (that is retained when pasting in Word), and I also would like to remove the scanned image so that the final PDF just contains formatted (color, font, size) vector fonts, and no images.

Read the article

PowerPoint 2007 animated slides are only partially converted to PDF

- by Tim

I have recently encountered a problem with PowerPoint 2007. When I use "Save as PDF/XPS" to create a PDF version of my presentation, some slides are only partially included in the resulting PDF file. For example, this: is reduced to this: So far, I have only encountered this with slides that contain animation elements, but which part of the elements remain in the PDF version appears not to have anything to do with the order in which the animated elements appear, so that might just be a coincidence. When viewing the affected slides in Acrobat Reader, it complains about this file containing invalid elements, and that I should complain to whoever generated the PDF file... Perhaps it has something to do with the Office 2007 Service Pack 3, because these problems started only after it had been installed. Has anyone noticed something similar? Is there a workaround?

Read the article

Automate opening HTML and printing to PDF

- by craigpatik

I need a way to automate the following process in Windows 7: Open an .html file in Internet Explorer Print to PDF Save the PDF with a patterned file name (i.e., original_name_YYYY-MM-DD.pdf) Ideally, I could drag and drop several files or open a whole folder of files at once and a PDF would be created for each one. A command line solution is also acceptable. The files have to be opened in the browser because parts of the page are rendered with JavaScript on page load. In other words, if you simply right-click on the file in Explorer and choose "print", the resulting file is not the same because the JS didn't run. If it helps, Internet Explorer can be set as the default browser, and a PDF printer can be set as the default printer.

Read the article

Converting massive images to PDF, without crashing applications

- by BloodyIron

I'm trying to work with a large-format scanner, and we are scanning very long documents. Example, one of our documents we cut into two pieces, and one of those pieces is 3633x82486 in resolution. My application, Scanning Master 21+, which comes with the device (Graphtec CSX300-09) can output PDF, however when I try to save to PDF it complains about file being too large. I can successfully output to BMP however. GIMP can even open this BMP, after taking a while to load it. The resulting files range from 200MB - 1.2GB in size. Acrobat refuses to open the BMP format, saying it isn't supported or is damaged (which I know is not true). As I mentioned, the PDF plugin for GIMP crashes when I try to export to PDF. I'm really not sure what is the best tool for this job. So what is the best tool to produce PDF documents of very large images?

Read the article

Copying first 1000 PDF files having single, double quotes in their name to another folder

- by racer_ace

I am having this folder with PDFs into it and I need to process 1000 at a time. So I need to move them into another folder, process them and delete them. For this I tried using $ find . -maxdepth 1 -type f |head -1000|xargs cp -t $destdir It gives error on single and double quotes in filename. There are thousands of files and I have no idea how many of them has these quotes in them. Can anyone help me find a solution? And I tried with the -0 option, it did not work

Read the article

Adobe Acrobat: How to batch to combine multiple pdf files?

- by Andrei Andre

I have 3 folders: Folder 1 Folder 2 Folder 3 In each folder I have 5 pdf files: Folder 1 file1.pdf file2.pdf Folder 2 file1.pdf file2.pdf Folder 3 file1.pdf file2.pdf I want that in each folder to have a combined file of those two files: Folder 1 binder.pdf Folder 2 binder.pdf Folder 3 binder.pdf Any idea? Don't tell to do it manually. This case is just to explain you my problem. Think that I have hundreds of folders. :) Maybe I can use another tool instead of Adobe Acrobat?!

Read the article

Open Source PDF reader for windows as an alternative to Adobe reader

- by Tom Feiner

With the latest javascript vulnerabilities in Adobe reader and bloat it has aquired over the years, I've been thinking of moving the network I'm in charge of to a different product for PDF reading on Windows. The ideal PDF reader should be something that is: Small in size (Adobe reader is more than 200MB these days after installation). As secure by default as possible (For example, javascript disabled by default). Nice looking and easy to use interface. Not bloated with features (I just want to read PDFs, that's it). Does not install any toolbars/unwanted add ons/spyware. Does not display any ads while viewing PDFs. Preferably Open Source. (this pretty much ensures no ads). Full Unicode support. Idealy , something like evince from gnome, will be the best option, but unfortunately that's not available on Windows. Foxit is an option, as it is small, and has a nice interface. But it still has javascript enabled by default which might lead to vulnerabilities - and it installs a toolbar , and displays ads while reading PDFs which is distracting. There is a site dedicated to Open Source PDF readers, pdfreaders.org, however, the Windows pdf readers each have their problems, mostly the interface is not as convenient (as evince, adobe or foxit). Here's a list of all PDF software from WikiPedia. There's a "Viewers" section for each OS. What Windows PDF reader would you recommend ?

Read the article

How to retain headers for all the pages of an exported pdf in php?

- by udaya

Hi I am exporting data from php page to pdf when the datas exceeed the page limit the header is not available for the consecutive pages function where i call the export to pdf is function changeDetails() { $bType = $this-input-post('textvalue'); if($bType == "pdf") { $this->load->library('table'); $this->load->plugin('to_pdf'); $data['countrytoword'] = $this->AddEditmodel1->export(); $this->table->set_heading('Country','State','Town','Name'); $out = $this->table->generate($data['countrytoword']); $html = $this->load->view( 'newpdf',$data, true); pdf_create($html, $cur_date); } } This is my view page from which i export data to pdf Name Country State Town Here I am getting the result as page:1 Name country State Town udaya india Tamilnadu kovai chandru srilanka columbo aaaaa page:2 vivek england gggkj gjgjkj in the page 2 i dont get the headers name, country ,state and town

Read the article

Very heavy .PDF file. How to handle it?

- by Luigi

I should print this file on paper, but it is heavy (2,3 MB) and the printing process is very slow. But this is not the whole problem. I should first create another .pdf file (grayscale) with four pages on each sheet. When I try to print this file as a .pdf file, the process of creation of the new file is even slower than the printing process, and the output file is much heavier than the original (hundreds of megabytes). How can I shrink the pdf file? Is there a way to create a printer friendly .pdf version of the file? Before you ask, I can't simply print it on pages. I must create this .pdf file with four pages on each sheet.

Read the article

Screen Scraping HTML with C#

- by WildBill

I have been given the task at work of screen scraping one of our legacy web apps to extract certain data from the code. The data is formatted and "should" be displayed exactly the same every time. I am just not sure how to go about doing this. It's a full html file with header and footer navigations but in the middle of all this is the data I need. I need to extract the Company Name value, Contact Name, Telephone, email address, etc. Here is an example of what the code looks like: ...html above here <br /><br /> <table cellpadding="0" cellspacing="12" border="0"> <tr> <td valign="top" align="center">  <table cellpadding="0" cellspacing="0" border="0"> <tr> <td class="black"> <table cellspacing="1" cellpadding="0" border="0" width="370"> <tr> <th>ABC INDUSTRIES</th> </tr> <tr> <td class="search"> <table cellpadding="5" cellspacing="0" border="0" width="100%"> <tr> <td> <table cellpadding="1" cellspacing="0" border="0" width="100%"> <tr> <td align="center" colspan="2"><hr></td> </tr> <tr> <td align="right" nowrap><b><font color="FF0000">Contact Person <img src="/images/icon_contact.gif" align="absmiddle"> :</font></b></td> <td align="left" width="100%"> Joe Smith</td> </tr> <tr> <td align="right" nowrap><b><font color="FF0000">Phone Number <img src="/images/icon_phone.gif" align="absmiddle"> :</font></b></td> <td align="left" width="100%"> 555-555-5555</td> </tr> <tr> <td align="right" nowrap><b><font color="FF0000">E-mail Address <img src="/images/icon_email.gif" align="absmiddle"> :</font></b></td> <td align="left" width="100%"> <a HREF="mailto:[email protected]">[email protected]</a></td> </tr> more... There is more code on the screen in a different table structure that I also need to pull.

Read the article

Alternatives to ImageMagick for PDF downsizing

- by Craig Coston

Having an issue with some PDF files not displaying properly in our iPad app. I have come to the conclusion that we are needing to standardize by "converting" PDF to PDF. I have successfully processed this using ImageMagick to convert the PDF to PNG (resized), and then pushing the PNG(s) back into a PDF. However, something within ImageMagick is making photos within PDFs display wrong. Same issue just converting a JPG or other graphic to PDF in ImageMagick. I solved that by taking the output of the converted ImageMagick file and converting it again using GD to PNG, then pushing it through our PDF converter. So my question is this: What other PHP workflows would work with this, other than using ImageMagick for the conversion back to PDF? We are not opposed to a paid solution, we just need something that works. Our server runs centOS.

Read the article

scale down pdf to fit in iframe in uiwebview

- by Claes Gustavsson

I have a uiwebview that loads a iframe to display a pdf file. But only the top left corner of the pdf file i visible, I need to fit the pdf file inside then iframe, that is, I need to scale down the pdf file so that all of it is visible. I know that with webpages you can open a pdf and set the starting scale on it like this page1.pdf#zoom=25 but that doesnt work in the webview. Does anybody know what to do?

Read the article

pdf file creation in php

- by pavun_cool

Actually I have used following code , for creating the simple pdf file . It executed fine in the browsers. But I am not able to get the pdf file . But it gives me some output when I am running the code in CLI . my doubt is , where I need specify pdf file name ( creation file ) . <?php require('fpdf.php'); $pdf=new FPDF(); $pdf->AddPage(); $pdf->SetFont('Arial','B',16); $pdf->Cell(40,10,'Hello World!'); $pdf->Output(); ?> CLI output: 2 0 obj << /Type /Page /Parent 1 0 R /Contents 3 0 R endobj 3 0 obj << /Length 4 0 R stream 2.834646 0 0 2.834646 0 841.9 cm 2 J 0.2 w BT /F1 5.64 Tf ET BT 11 -16.692 Td (Hello World!) Tj ET

Read the article

FPDF Page Break Question

- by Michael

I am using PHP and FPDF to generate a PDF with a list of items. My problem is if the item list goes on to a second or third page, I want to keep the Item Name, Quantity and Description together. Right now, it will go to a second page, but it may split up all of the details for a particular item. PLEASE HELP! <?php require_once('auth.php'); require_once('config.php'); require_once('connect.php'); $sqlitems="SELECT * FROM $tbl_items WHERE username = '" . $_SESSION['SESS_LOGIN'] . "'"; $resultitems=mysql_query($sqlitems); require_once('pdf/fpdf.php'); require_once('pdf/fpdi.php'); $pdf =& new FPDI(); $pdf->AddPage('P', 'Letter'); $pdf->setSourceFile('pdf/files/healthform/meds.pdf'); $tplIdx = $pdf->importPage(1); $pdf->useTemplate($tplIdx); $pdf->SetAutoPageBreak(on, 30); $pdf->SetTextColor(0,0,0); $pdf->Ln(10); while($rowsitems=mysql_fetch_array($resultitems)){ $pdf->SetFont('Arial','B',10); $pdf->Cell(50,4,'Item Name:',0,0,'L'); $pdf->SetFont(''); $pdf->Cell(100,4,$rowsitems['itemname'],0,0,'L'); $pdf->SetFont('Arial','B',10); $pdf->Cell(50,4,'Quantity:',0,0,'L'); $pdf->SetFont(''); $pdf->Cell(140,4,$rowsitems['itemqty'],0,1,'L'); $pdf->SetFont('Arial','B'); $pdf->Cell(50,4,'Description:',0,0,'L'); $pdf->SetFont(''); $pdf->Cell(140,4,$rowsitems['itemdesc'],0,1,'L'); } $pdf->Output('Items.pdf', 'I'); ?>

Read the article

Show a PDF file to the user with perl or php, not creating it nor download it. Just show it.

- by dimassony

Hello guys. I want to show my users PDF files. The reason why I use cgi to show the pdf is I want to track the clicks for the pdf, and cloak the real location of the saved pdf. I've been searching on the Internet and only found how to show save dialog to the users and creating a pdf, not show the files to the users. What I wanted for is show the users my pdf files, not creating or download the pdf. Here is what I got form the official php documentation: <?php header('Content-type: application/pdf'); readfile('the.pdf'); ?> Also my google-search-result perl code: open(PDF, "the.pdf") or die "could not open PDF [$!]"; binmode PDF; my $output = do { local $/; <PDF> }; close (PDF); print "Content-Type: application/pdf\n"; print "Content-Length: " .length($output) . "\n\n"; print $output if you do it on ruby, please say it to me. But I'm not sure if my server support rails. Sorry if my code is too far away from the method to show the pdf, since I don't know anything about pdf processing and how to implement this problem. Lets assume that the users have the Adobe Reader plug-in. So, how to fix my problem?

Read the article

How to work around a site forbidding me to scrape their images with PHP

- by Petruza

I'm scraping a site, searching for JPGs to download. Scraping the site's HTML pages works fine. But when I try getting the JPGs with CURL, copy(), fopen(), etc., I get a 403 forbiden status. I know that's because the site owners don't want their images scraped, so I understand a good answer would be just don't do it, because they don't want you to. Ok, but let's say it's ok and I try to work around this, how could this be achieved? If I get the same URL with a browser, I can open the image perfectly, it's not that my IP is banned or anything, and I'm testing the scraper one file at a time, so it's not blocking me because I make too many requests too often. From my understanding, it could be that either the site is checking for some cookies that confirm that I'm using a browser and browsing their site before I download a JPG. Or that maybe PHP is using some user agent for the requests that the server can detect and filter out. Anyway, have any idea?

Read the article

Is pdf virus proof?

- by Jonathan

I am creating a secure document portal for a client. One of the things that they are worried about is having someone upload a document with a virus. One solution that might work is to enforce that they only upload pdf files. My question is two fold: Most importantly, is a pdf document virus proof? How can you determine in a *nix environment that a file is a pdf, besides just looking at the extension. Thanks! Jonathan

Read the article

Imagemagick PDF to JPG conversion failing

- by Scott

I'm trying to convert the first page of a PDF to a JPG. I'm pretty sure I got this to work with certain PDFs, but is it really possible that certain PDFs are made incorrectly and cannot be converted? I tried running this first: $ convert 10-03-26.pdf[1] test.jpg And I got the follow: Error: /syntaxerror in readxref Operand stack: Execution stack: %interp_exit .runexec2 --nostringval-- --nostringval-- --nostringval-- 2 %stopped_push --nostringval-- --nostringval-- --nostringval-- false 1 %stopped_push 1 3 %oparray_pop 1 3 %oparray_pop --nostringval-- --nostringval-- --nostringval-- --nostringval-- --nostringval-- --nostringval-- Dictionary stack: --dict:1062/1417(ro)(G)-- --dict:0/20(G)-- --dict:73/200(L)-- --dict:73/200(L)-- --dict:97/127(ro)(G)-- --dict:229/230(ro)(G)-- --dict:14/15(L)-- Current allocation mode is local ESP Ghostscript 7.07.1: Unrecoverable error, exit code 1 convert: Postscript delegate failed `10-03-26.pdf'. Running this instead: $ convert -verbose -colorspace rgb '10-03-26.pdf[1]' test.jpg I get the following: Error: /syntaxerror in readxref Operand stack: Execution stack: %interp_exit .runexec2 --nostringval-- --nostringval-- --nostringval-- 2 %stopped_push --nostringval-- --nostringval-- --nostringval-- false 1 %stopped_push 1 3 %oparray_pop 1 3 %oparray_pop --nostringval-- --nostringval-- --nostringval-- --nostringval-- --nostringval-- --nostringval-- Dictionary stack: --dict:1062/1417(ro)(G)-- --dict:0/20(G)-- --dict:73/200(L)-- --dict:73/200(L)-- --dict:97/127(ro)(G)-- --dict:229/230(ro)(G)-- --dict:14/15(L)-- Current allocation mode is local ESP Ghostscript 7.07.1: Unrecoverable error, exit code 1 "gs" -q -dBATCH -dSAFER -dMaxBitmap=500000000 -dNOPAUSE -dAlignToPixels=0 "-sDEVICE=pnmraw" -dTextAlphaBits=4 -dGraphicsAlphaBits=4 "-g792x1611" "-r72x72" -dFirstPage=2 -dLastPage=2 "-sOutputFile=/tmp/magick-XXU3T44P" "-f/tmp/magick-XXoMKL8Z" "-f/tmp/magic2eec1F"Start of Image Define Huffman Table 0x00 0 1 5 1 1 1 1 1 1 0 0 0 0 0 0 0 Define Huffman Table 0x01 0 3 1 1 1 1 1 1 1 1 1 0 0 0 0 0 Define Huffman Table 0x10 0 2 1 3 3 2 4 3 5 5 4 4 0 0 1 125 Define Huffman Table 0x11 0 2 1 2 4 4 3 4 7 5 4 4 0 1 2 119 End Of Image convert: Postscript delegate failed `10-03-26.pdf'. Why would the conversion fail? Just as an aside, this is happening on a (gs) Grid-Service on (mt) Media Temple hosting. I cannot install programs on the server, but both Imagemagick and Ghostscript are installed Thanks!

Search Results

Search found 4479 results on 180 pages for 'pdf scraping'.

Page 7/180 | < Previous Page | 3 4 5 6 7 8 9 10 11 12 13 14 | Next Page >

- by feydr

- by Mohamed Mohsen

- by SORRYPROFESSEROFYEARNING

- by mech

- by Yttric

- by Dave

- by Daniel S.

- by Tim

- by craigpatik

- by BloodyIron

- by racer_ace

- by Andrei Andre

- by Tom Feiner

- by udaya

- by Luigi

- by WildBill

- by Craig Coston

- by Claes Gustavsson

- by pavun_cool

- by Michael

- by dimassony

- by Petruza

- by Jonathan

- by Scott

< Previous Page | 3 4 5 6 7 8 9 10 11 12 13 14 | Next Page >