Search Results

Search found 4479 results on 180 pages for 'pdf scraping'.

Page 50/180 | < Previous Page | 46 47 48 49 50 51 52 53 54 55 56 57 | Next Page >

Is there a good tutorial for figuring out what a website is doing so your program can do the same th

- by brian d foy

Is there a good guide or tutorial for people who need to programmatically interact with dynamic websites? There's been a rash of Perl questions about that lately, and I haven't found a good resource to point people toward. I'm asking not because I need one but because I don't want to waste my time writing it if it already exists. Although I'm most interested in Perl, the extra tools and techniques are mostly the same. Typically, I see see these problems in people's questions: Handling, setting, and saving cookies Finding and interacting with forms Handling JavaScript inside your user-agent especially things like onLoad, onSumbit, and Ajax Using HTTP sniffer tools Using Web developer plugins in interactive browsers Interacting with DOM, screen scraping, etc. If there's no good tutorial, I'll add it to my list of things to do (unless someone else wants to do it :). Along the way, if you don't have a suggestion for an existing tutorial, please suggest the things that you think should be in a new one, including links, your favorite tools, and your own user-agent development experiences. I don't care about the particular language you use.

Read the article
How to protect/monitor your site from crawling by malicious user

- by deathy

Situation: Site with content protected by username/password (not all controlled since they can be trial/test users) a normal search engine can't get at it because of username/password restrictions a malicious user can still login and pass the session cookie to a "wget -r" or something else. The question would be what is the best solution to monitor such activity and respond to it (considering the site policy is no-crawling/scraping allowed) I can think of some options: Set up some traffic monitoring solution to limit the number of requests for a given user/IP. Related to the first point: Automatically block some user-agents (Evil :)) Set up a hidden link that when accessed logs out the user and disables his account. (Presumably this would not be accessed by a normal user since he wouldn't see it to click it, but a bot will crawl all links.) For point 1. do you know of a good already-implemented solution? Any experiences with it? One problem would be that some false positives might show up for very active but human users. For point 3: do you think this is really evil? Or do you see any possible problems with it? Also accepting other suggestions.

Read the article
Groovy htmlunit getFirstByXPath returning null

- by StartingGroovy

I have had a few issues with HtmlUnit returning nulls lately and am looking for guidance. each of my results for grabbing the first row of a website have returned null. I am wondering if someone can A) explain why they might be returning null B) explain better ways (if there are some) to go about getting the information Here is my current code (URL is in the source): client = new WebClient(BrowserVersion.FIREFOX_3) client.javaScriptEnabled = false def url = "http://www.hidemyass.com/proxy-list/" page = client.getPage(url) IpAddress = page.getFirstByXPath("//html/body/div/div/form/table/tbody/tr/td[2]").getValue() println "IP Address is: $data" //returns null //Port_Number is an Image Country = page.getFirstByXPath("//html/body/div/div/form/table/tbody/tr/td[4][@class='country']/@rel").getValue() println "Country abbreviation is: $Country" //differentiate speed and connection by name of gif? Type = page.getFirstByXPath("//html/body/div/div/form/table/tbody/tr/td[7]").getValue() println "Proxy type is: $Type" Anonymity = page.getFirstByXPath("//html/body/div/div/form/table/tbody/tr/td[8]").getValue() println "Anonymity Level is: $Anonymity" client.closeAllWindows() Right now all of my XPaths return null and .getValue() obviously doesn't work on null. I also have questions as to what I should do about the PORT since it is an image? Is there a better alternative than downloading it and attempting to solve it by OCR? Side Note There is no significance in this site, I was just looking for a site that I could practice scraping on (the last one I ran into issues of fragment identities and couldn't get an answer to: HtmlUnit getByXpath returns null and HtmlUnit and Fragment Identities )

Read the article
Screenscraping and reverse engineering health based web tool

- by ArbInv

Hi There is a publicly available free tool which has been built to help people understand the impact of various risk factors on their health / life expectancy. I am interested in understanding the data that sits behind the tool. To get this out it would require putting in a range of different socio-demographic factors and analyzing the resulting outputs. This would need to be done across many thousand different individual profiles. The tool was probably built on some standard BI platorm. I have no interest in how the tool was built but do want to get to the data within it. The site has a Terms of Use Agreement which includes: Not copying, distribute, adapt, create derivative works of, translate, or otherwise modify the said tool Not decompile, disassemble, reverse assemble, or otherwise reverse engineer the tool. The said institution retains all rights, title and interest in and to the Tool, and any and all modifications thereof, including all copyright, copyright registrations, trade secrets, trademarks, goodwill and confidential and proprietary information related thereto. Would i be in effect breaking the law if i were to point a screen scraping tool which downloaded the data that sits behind the tool in question?? Any advice welcomed? THANKS

Read the article
Modify Okular highlight to automatically copy highlighted text into comment

- by JDD

Despite what old SE questions state, the PDF software Okular can now write annotations directly to the PDF. This makes it very useful in conjunction with Docear for annotating academic literature. However, Docear imports annotations from the comments, rather than importing from highlighted text. In Okular, when you highlight text it can then be clicked to reveal a comments bubble, which is empty by default. Copying the highlighted text into the resulting bubble allows it to be imported into Docear, but this is laborious. How can I modify the highlight tool to automatically copy the highlighted text into the resulting comment bubble?

Read the article
2 pdfs look same on XP, different on Win7

- by David Dai

I have 2 pdf files. I compared them with WinMerge, BeyondCompare, and even compared their checksums. They are exactly the same to me in every way. If I open them with Adobe Reader in Xp, and compare them with my bare eyes, they look the same. But!!! If I open them with Adobe Reader in Win7, and compare them with my bare eyes, they look very different!(particularly border width). I'm sorry I cannot share the 2 pdf files but I will appreciate it if anyone could come up with any idea!

Read the article
pdftk utility and batch file

- by duhaas

I cant for the life of me figure out what I'm missing. I have the following batch file: As you can see, when I run this batch file from my desk against a mapped drive it runs just fine: When I run the same exact batch file on the server itself, the place where the mapped drive is located on, it doesnt run and makes me think I have a syntax problem: I just dont understand whats going on, and my eyes are having a hard time keeping track of what might be diff. The server where it isnt working is windows 2003, my desktop where the same batch file is working is Windows 7. Here is the batch file, nothing crazy: FOR /D /r %%G in ("*") DO pdftk "%%G\*.pdf" cat output "%%G\Report.pdf"

Read the article
Why do my Google sitelinks show gibberish for a PDF link?

- by Tom

I have a website which Google lists nicely along with site links. One of the site links - to a PDF file - shows un-human gibberish e.g 67,8;45:: 56 83 @7<1. (7/0;,*;: /59( (7/0;,;<7, <7)(60:4 (9<7 /+ +2, VU I thought it might be due to the PDF's title property so I changed it. But there hasn't been an improvement to the site link. Other PDF site links are fine and display the title property as desired. Does anyone know how I might rectify this problem or what might be the cause? My uninformed guess is it's some transliteration problem between code and display text which, I suppose, means I ought to recondition the PDF file in some way. Not sure how.

Read the article
Printing the second page from a Calc sheet

- by Luke

I've created a simple invoice document in LibreOffice Calc and it consists of 2 pages. I have defined the print range for my document as: $A$1:$D$33,$D$34:$D$35 My first page holds the actual invoice information the second page is a single merged cell holding terms and conditions text (wrapped to the cell width). The second page is defined by a row break. When I export the sheet as a PDF the first page comes out great but the second page with the terms text is all wrong. On the left hand side I see a portion of text (looks like a single column) and when I select the text inside the PDF I can see it go of the page somewhere to the left. I get the same result in a print preview. I'm at a complete loss on how to approach this problem and any insight is much appreciated.

Read the article
How can I use Font Awesome vectors in Illustrator?

- by david.joyce13

I'm stumped. As directed, I printed the Font Awesome Cheatsheet to PDF. When I open it with Acrobat Reader, it looks fine. However, when I try to open it with Illustrator, I get this warning: The font MuseoSlab-500 is missing. Affected text will be displayed using a substitute font. The font OTS-derived-font is missing. Affected text will be displayed using a substitute font. The font ProximaNova-Regular is missing. Affected text will be displayed using a substitute font. How can I 'fix' the PDF, so that I can see and use the icons in Illustrator?

Read the article
How do I save a Web Page?

- by Remus Rigo

Hi all, I have tried many programs and solutions to save web pages (html, mht, doc, pdf). My favorite software was an addon for browsers from Omnipage (OCR). What I like about this is that it prints the whole page (continuously) and it doesn't write the URL or page numbers on the page, which I find annoying. Does anyone know a software like this one (freeware or not)? I tried CutePDF and it didn't work for me. I want this for my offline use and would prefer a PDF.

Read the article
Reference manager for Ubuntu

- by user36511

I'm in dire need of a reference/citation manager in Ubuntu. The features I need the most are: 1) Metadata extraction/editing of pdf 2) Fetch metadata from online databases such as Google Scholar 3) Attach pdf or other file to reference 4) Tag references and recall those with a given tag or set of tags 5) Provide APA style citation for references (in integration with OOffice and/or Latex) Optional: Would be great if it can annotate/highlight pdfs. Mendeley probably does all of these, but it's behavior has driven me insane, especially when the number of references it's trying to handle is large. It constantly tries to sync with the web and creates duplicate references. I've tried JabRef, and while it looks like a decent piece of freeware, it doesn't do some of the above. I found others like Bibus, Referencer, etc. to be lacking or buggy or inactive development. Is there another option, or should I give up the search.

Read the article
Is there a (free or commercial) print server which print PDFs from networks?

- by Eonil

I'm working in office which uses Windows server for printing. Because our printer supports only Windows driver. But here are Mac OS X also which requires network printing... I'm sure there is no driver of the printer for Mac. So I figured out an idea to do this. On the Mac, a virtual printer driver generates and sends PDF file to print server. Print server, prints PDF files with it's local printer. Is there a solution can do this? (free or commercial)

Read the article
Programmatic Reaction to Receiving New Email

- by vicatcu

I'm interested in automating some reactive work I do when receiving certain emails in one of my email accounts. What I would like to have happen is: On receipt of new email in the account If the new email passes the "Need to React" criteria (based on body content and subject line) 3a. Scrape some content out of the email body and subject lines 3b. Populate a template form (e.g. Excel spreadsheet) with the scraped data 3c. Print the populated form and save the populated form in some folder (e.g. as a pdf) What's the best (defined as easiest to implement by myself) approach / combination of technologies for achieving this automation?

Read the article
Crystal Enterprise PDF Files

- by Vivek Chandraprakash

Hi, I have a website that generates pdf file from CSP. Will this pdf file be generated in the server before showing to the user? WIll it be saved in the server? Is there a way to copy these files? -Vivek

Read the article
android webview pdf

- by shriya

Hi , I want to display pdf contents on webview. Here is my code: WebView webview = new WebView(this); setContentView(webview); webview.getSettings().setJavaScriptEnabled(true); webview.loadUrl("http://www.adobe.com/devnet/acrobat/pdfs/pdf_open_parameters.pdf"); I am getting blnk screen. I have set internet permission also. Thanks

Read the article
iphone PDF view CGPDFDocument

- by i.novice

I am developing an app where I need to show PDF documents. After many hours of googling I was able to build up a view to show the PDF document fetched from a URL. I know only to display a single page. using CGPDFDocumentGetPage(ref, pageNumber). What I would like to have. Pagination function. Zoom Scrolling

Read the article
Password protected PDF/Excel reports using SQL Reporting services

- by balaweblog

I am generating a reports using SQL Reporting services 2005. I need to protect the protect the reports(pdf and excel) with the password while the user is exporting the report to to excel or pdf. Is there any way to protect the report.

Read the article
Display large PDF using iPhone SDK

- by MadJawa

Hello, I was wondering what is the best way to display a big PDF file (it's a map actually) using iPhone SDK (the file is around 5MB), because it's really slow in a UIWebView. I want to be able to scroll through the PDF and zoom in/out. Also do you think that it would be better to convert it to a PNG? Thank in advance

Read the article
Create a table of contents from a pdf file

- by ryan

I'm using quartz to display pdf content, and I need to create a table of contents to navigate through the pdf. From reading Apple's documentation I think I am supposed to use CGPDFDocumentGetCatalog, but I can't find any examples on how to use this anywhere. Any ideas?

Read the article
Does not display the PDF files in listview thumbnail format c# 2008

- by nithi

im trying to display the PDF files in the listview thumnail like windows explorer. I have not any idea about this. How to display the pdf files. plz suggest some idea. thanks in advance.

Read the article
Available alternative libraries in java to generate PDF documents

- by Fazal

I have been using XSL-FO and FOP Engine to generate PDF documents for required data. This works great, but lately I have seen some limitations in FOP especially when it comes to allowing user to enter text in a html editor which can be transformed to XSL-FO and given to FOP driver. This brought me to point to ask this large community of well informed individuals about what are possible Open Source or even non open source libraries to generate PDF documents in Java?

Read the article
iphone : open pdf in safari

- by yakub_moriss

Hi, All can we open pdf in safari from website ? so we can save that pdf in local disk. Thanking in Advance...

Read the article
Store png/pdf file in sqlite

- by milanjansari

Hello, How to store pdf,png file in sqlite. is this possible to store pdf and png file? Please help me out. if you have any idea please share with me. Thanks you,

Read the article
Crawling within a pdf

- by Saubhagya

Hi, I'm developing a tool that searches the keyword entered by the user on a given site. My problem is, it searches the keyword only on html/web pages but not on the PDF/MS-Word files found on the site. Can anyone suggest me some api/tool or provide the code that can search text from the given online PDF/MS-Word/Text file?

Read the article

< Previous Page | 46 47 48 49 50 51 52 53 54 55 56 57 | Next Page >