scrape - Page 5 - Developer IT

How to set timeout with python-mechanize?

- by Michal Cihar

I'm using python-mechanize to scrape some web sites, which sometime simply don't respond to requests and these requests stay open too long, so I need to limit timeout for these requests. While using urlopen method, the timeout can be set using timeout parameter, but I have not found easy way for doing it with high level API such as submit or click methods. Ideally the timeout would be set just once for whole browser class and all calls would honor that. It would be probably possible to customize this by passing custom request_class to every click and submit call, but this would just pollute the code, so I'm looking for nicer solution for setting timeout for mechanize's browser class (and no, I don't want to change default socket timeout using socket.setdefaulttimeout).

Read the article

Jaxer and HTTP proxy requests...

- by rakhavan

Thanks to everyone in advance. I'm using Jaxer.sandbox and making requests just fine. I'd like these requests to go through my http proxy (like squid for example). Here is the code I that is currently working for me. window.onload = function() { //the url to scrape var url = "http://www.cnn.com/"; //our sandboxed browser var sandbox = new Jaxer.Sandbox(); //open optons var openOptions = new Jaxer.Sandbox.OpenOptions(); openOptions.allowJavaScript = false; openOptions.allowMetaRedirects = false; openOptions.allowSubFrames = false; openOptions.allowSubFrames = false; openOptions.onload = function() { //do something onload }; //make the call sandbox.open(url, null, openOptions); //write the response Jaxer.response.setContents(sandbox.toHTML()); }; How can I send this request through a proxy server? Thanks, Reza.

Read the article

A database of questions with unambiguous numeric answers.

- by dreeves

I (and co-hackers) are building a sort of trivia game inspired by this blog post: http://messymatters.com/calibration. The idea is to give confidence intervals and learn how to be calibrated (when you're "90% sure" you should be right 90% of the time). We're thus looking for, ideally, thousands of questions with unambiguous numerical answers. Also, they shouldn't be too boring. There are a lot of random statistics out there -- eg, enclosed water area in different countries -- that would make the game mind-numbing. Things like release dates of classic movies are more interesting (to most people). Other interesting ones we've found include Olympic records, median incomes for different professions, dates of famous inventions, and celebrity ages. Scraping things like above, by the way, was my reason for asking this question: http://stackoverflow.com/questions/2611418/scrape-html-tables So, if you know of other sources of interesting numerical facts (in a parsable form) I'm eager for pointers to them. Thanks!

Read the article

Having trouble scraping an ASP .NET web page

- by Seth

I am trying to scrape an ASP.NET website but am having trouble getting the results from a post. I have the following python code and am using httplib2 and BeautifulSoup: conn = Http() # do a get first to retrieve important values page = conn.request(u"http://somepage.com/Search.aspx", "GET") #event_validation and viewstate variables retrieved from GET here... body = {"__EVENTARGUMENT" : "", "__EVENTTARGET" : "" , "__EVENTVALIDATION": event_validation, "__VIEWSTATE" : viewstate, "ctl00_ContentPlaceHolder1_GovernmentCheckBox" : "On", "ctl00_ContentPlaceHolder1_NonGovernmentCheckBox" : "On", "ctl00_ContentPlaceHolder1_SchoolKeyValue" : "", "ctl00_ContentPlaceHolder1_SchoolNameTextBox" : "", "ctl00_ContentPlaceHolder1_ScriptManager1" : "ctl00_ContentPlaceHolder1_UpdatePanel1|cct100_ContentPlaceHolder1_SearchImageButton", "ct100_ContentPlaceHolder1_SearchImageButton.x" : "375", "ct100_ContentPlaceHolder1_SearchImageButton.y" : "11", "ctl00_ContentPlaceHolder1_SuburbTownTextBox" : "Adelaide,SA,5000", "hiddenInputToUpdateATBuffer_CommonToolkitScripts" : 1} headers = {"Content-type": "application/x-www-form-urlencoded"} resp, content = conn.request(url,"POST", headers=headers, body=urlencode(body)) When I print content I still seem to be getting the same results as the "GET" or is there a fundamental concept I'm missing to retrieve the result values of an ASP .NET post?

Read the article

Legality, terms of service for performing a web crawl

- by Berlin Brown

I was going to crawl a site for some research I was collecting. But, apparently the terms of service is quite clear on the topic. Is it illegal to now "follow" the terms of service. And what can the site normally do? Here is an example clause in the TOS. Also, what about sites that don't provide this particular clause. Restrictions: "use any robot, spider, site search application, or other automated device, process or means to access, retrieve, scrape, or index the site" It is just research? Edit: "OK, from the standpoint of designing an efficient crawler. Should I provide some form of natural language engine to read terms of service and then abide by them."

Read the article

See if any application has a DLL from the GAC loaded

- by rwmnau

I'm trying to deploy new copies of my DLL to the GAC on remote servers, but I need to identify if any processes currently running have a loaded copy of the DLL I'm replacing - I'd like to restart them, or at least tell the user. For example, Biztalk seems to load the DLLs it needs the first time they're used, and then replacing them keeps the old copy in memory until the Host Instances are restarted - something I could easily do as part of my deployment. Is there a way to tell using .NET which processes have loaded a particular DLL from the GAC? UPDATE: Some further investigation shows that both Process Explorer has this functionality, and another Sysinternals tool, ListDLL, does exactly what I want to be able to do. I'd like to know how they do it, since I'd love to replicate this functionality in my application without having to include and screen-scrape ListDLL (if that's even allowed inside the license).

Read the article

Programmaticaly grabbing text from a web page that is dynamically generated.

- by bstullkid

There is a website I am trying to pull information from in perl, however the section of the page I need is being generated using javascript so all you see in the source is <div id="results"></div> I need to somehow pull out the contents of that div and save it to a file using perl/proxies/whatever. e.g. the information I want to save would be document.getElementById('results').innerHTML; I am not sure if this is possible or if anyone had any ideas or a way to do this. I was using a lynx source dump for other pages but since I cant straight forward screen scrape this page I came here to ask about it! If anyone is interested, the page is http://downloadcenter.trendmicro.com/index.php?clk=left_nav&clkval=pattern_file&regs=NABU and the info I am trying to get is the row about the ConsumerOPR

Read the article

Reading Ontology with Jena, feeding it with RDF triples, and producing correct RDF string output.

- by JonB

Hi, I have an ontology, which I read in with Jena to help me scrape some RDFa triples from a website. I don't currently store these triples in a Jena model, but that is fairly straight forward to do, its on my to do next list. The area I am struggling with, though, is to get Jena to output correct RDF for the ontology I have. The ontology uses Owl and RDFS definitions, but when I pass some example triples into the model, they don't appear correctly. Almost as if it doesn't know anything about the ontology. The output is, however, still valid RDF, just it's not coming out in the form I was hoping for. Am I correct in thinking that Jena should be able to produce well written RDF (not just valid) about the triples I have collected, based on the ontology or does this out stretch what it is capable of? Many thanks for any input.

Read the article

How to select nth element of particular type in enlive?

- by Mad Wombat

I am trying to scrape some data from a page with a table based layout. So, to get some of the data I need to get something like 3rd table inside 2nd table inside 5th table inside 1st table inside body. I am trying to use enlive, but cannot figure out how to use nth-of-type and other selector steps. To make matters worse, the page in question has a single top level table inside the body, but (select data [:body : :table]) returns 6 results for some reason. What the hell am I doing wrong?

Read the article

Why Shouldn't I Programmatically Submit Username/Password to Facebook/Twitter/Amazon/etc?

- by viatropos

I wish there was a central, fully customizable, open source, universal login system that allowed you to login and manage all of your online accounts (maybe there is?)... I just found RPXNow today after starting to build a Sinatra app to login to Google, Facebook, Twitter, Amazon, OpenID, and EventBrite, and it looks like it might save some time. But I keep wondering, not being an authentication guru, why couldn't I just have a sleek login page saying "Enter username and password, and check your login service", and then in the background either scrape the login page from say EventBrite and programmatically submit the form with Mechanize, or use an API if there was one? It would be so much cleaner and such a better user experience if they didn't have to go through popups and redirects and they could use any previously existing accounts. My question is: What are the reasons why I shouldn't do something like that? I don't know much about the serious details of cookies/sessions/security, so if you could be descriptive or point me to some helpful links that would be awesome. Thanks!

Read the article

Organizing a lot of models that use STI in rails

- by DavidP6

I have a scenario where I am going to be creating a large number of models that use STI and I'm wondering what the best way to organize this is. I already have other models using STI and I really do not want to add any more files to my models folder. Is there any way to create a folder and add the models using STI there (there could be upwards of 40 b/c each uses its own methods to scrape a different site, but they all save the same data)? This seems like it would be best, or I could add them all to one file but I would rather separate them.

Read the article

Most efficient way to update a MySQL Database on a Linux host with that of an ASP.Net Form on Window

- by NJTechGuy

My kind webhost (1and1) royally asked me to go elsewhere to do something like this. I have 2 sites. One of them was developed by a .Net programmer. Now I am contracted to implement a PHP site and fetch data from the .Net site. There is an ASP.Net form that a customer fills and when they hit submit, the data gets stored in SQL Server DB. How do I also store the same data in MySQL parallelly? I cannot directly use some database connectors with ASP.Net since MySQL connectivity is not supported on 1and1 Windows hosting (biz account, no less!). What I thought of is to publish an RSS feed of entries in ASP.Net site and routinely scrape that data into MySQL on Linux host. It is an overkill, I know. Not efficient. I thought I would pick the best brains on SOF to get a different, efficient opinion. Thanks in advance guys...

Read the article

Screen scraping C application without using OCR or DOM?

- by Mrgreen

We have a legacy system that is essentially a glorified telnet interface. We cannot use an alternative telnet client program to connect to the system since there are special features built into the client software they have provided us. I want to be able to screen scrape from this program, however that's proving very difficult. I have tried using WindowSpy and Spy++ to check the window text and it comes up blank. It's a custom C program written by the vendor (they have even disabled selecting text). I'm really looking for a free option and something I may perhaps be able to use in conjuction with a scripting language. It seems the only ways to grab text is directly from the Windows GDI or from memory, but that seems a little extreme. Can anyone recommend any software/DLLs that might be able to accomplish this? I'd be extremely appreciative.

Read the article

What's the requests/second standard for scraping websites?

- by feydr

This was the closest question to my question and it wasn't really answered very well imo: http://stackoverflow.com/questions/2022030/web-scraping-etiquette I'm looking for the answer to #1: How many requests/second should you be doing to scrape? Right now I pull from a queue of links. Every site that gets scraped has it's own thread and sleeps for 1 second in between requests. I ask for gzip compression to save bandwidth. Are there standards for this? Surely all the big search engines have some set of guidelines they follow in regards to this.

Read the article

How can I use Perl to grab text from a web page that is dynamically generated with JavaScript?

- by bstullkid

There is a website I am trying to pull information from in Perl, however the section of the page I need is being generated using javascript so all you see in the source is: <div id="results"></div> I need to somehow pull out the contents of that div and save it to a file using Perl/proxies/whatever. e.g. the information I want to save would be document.getElementById('results').innerHTML; I am not sure if this is possible or if anyone had any ideas or a way to do this. I was using a lynx source dump for other pages but since I cant straight forward screen scrape this page I came here to ask about it! If anyone is interested, the page is http://downloadcenter.trendmicro.com/index.php?clk=left_nav&clkval=pattern_file&regs=NABU and the info I am trying to get is the row about the ConsumerOPR

Read the article

Best tools to parse reports

- by Andy Schaefer

I have a report that I need to parse/scrape for loading into an alternate or query-able data store. The report looks like something akin to: this. My gut is that PERL would do a decent job, but I have several different permutations of the report and I don't really want to make a script around each form. This report is a pretty stock type report, and I have seen where Monarch Pro can parse these types of reports, but I have had a difficult time finding alternatives to how these could be parsed since I'm looking to do this working primarily in a Linux environment. Any suggestions?

Read the article

Can you detect a 301 redirect with Microsoft.XMLHTTP object?

- by dmb

I'm using VBScript and the Microsoft.XMLHTTP object to scrape some web data. I have a list of URLs to check, but unfortunately some of them 301 redirect to others on the list, so I wind up with redundant data. Is it at all possible to make the XMLHTTP object fail on 301 redirect? Or at least cache the original response header? Or otherwise just let me know what happened? (notes: I have no control over the server I'm requesting data from; when I get new data, I could check if it's redundant, but I'd like to avoid that if possible). Any ideas would be greatly appreciated.

Read the article

Rails architecture questions

- by justinbach

I'm building a Rails site that, among other things, allows users to build their own recipe repository. Recipes are entered either manually or via a link to another site (think epicurious, cooks.com, etc). I'm writing scripts that will scrape a recipe from these sites given a link from a user, and so far (legal issues notwithstanding) that part isn't giving me any trouble. However, I'm not sure where to put the code that I'm writing for these scraper scripts. My first thought was to put it in the recipes model, but it seems a bit too involved to go there; would a library or a helper be more appropriate? Also, as I mentioned, I'm building several different scrapers for different food websites. It seems to me that the elegant way to do this would be to define an interface (or abstract base class) that determines a set of methods for constructing a recipe object given a link, but I'm not sure what the best approach would be here, either. How might I build out these OO relationships, and where should the code go?

Read the article

Programatticaly grabing text from a web page that is dynamically generated.

- by bstullkid

There is a website I am trying to pull information from in perl, however the section of the page I need is being generated using javascript so all you see in the source is <div id="results"></div> I need to somehow pull out the contents of that div and save it to a file using perl/proxies/whatever. e.g. the information I want to save would be document.getElementById('results').innerHTML; I am not sure if this is possible or if anyone had any ideas or a way to do this. I was using a lynx source dump for other pages but since I cant straight forward screen scrape this page I came here to ask about it!

Read the article

Scraping ASP.NET site with Ruby

- by JillianK

I would like to scrape the search results of this ASP.NET site using Ruby and preferably just using Hpricot (I cannot open an instance of Firefox): http://www.ngosinfo.gov.pk/SearchResults.aspx?name=&foa=0 However, I am having trouble figuring out how to go through each page of results. Basically, I need simulate clicking on links like these: <a href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$Pager1$2','')" class="blue_11" id="ctl00_ContentPlaceHolder1_Pager1">2</a> <a href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$Pager1$3','')" class="blue_11" id="ctl00_ContentPlaceHolder1_Pager1">3</a> etc. I tried using Net::HTTP to handle the post, but while that received the correct HTML, there were no search results (I'm probably not doing that correctly). In addition, the URL of the page does not contain any parameters indicating page, so it is not possible to force the results that way. Any help would be greatly appreciated.

Read the article

Controlling a browser from Python

- by Noio

I am looking for a way to control a browser from Python, i.e. fill out form fields and submit them, possibly call JS functions. I've looked around a bit, but as far as I could see PyWebKitGtk only lets you show the browser as a GUI element, not interface with it. Is there a way to do this easily? I wrote my program logic in Python, and I would hate to port it to JS. Besides that, even if I'd use pure JS "bookmarklets", those wouldn't be able to read/write to my local filesystem, would they? P.S. to quell your suspicions, I'm not trying to automatically fill out forum account creation forms or something similarly spammious, though the task is technically similar. I need to crawl/scrape sites for my research project.

Read the article

getting real link from rss feed link

- by pfunc

I am experimenting with scraping certain pages from an RSS feed using curl and php. The page scraping was working fine when I was just using actual links, not links from the rss feeds. However, I realize now that links in rss feeds are usually just redirects to the actual page (at least this is what it seems like). Because now when I scrape a page with the rss link, it doesn't actually get the information I am looking for. Has anyone encountered this and know of a workaround. Is there anyway to see where the rss link is redirecting to and capturing that value?

Read the article

How to get regex split from other var

- by Dean

Hi, $dbLink = mysql_connect('localhost', 'root', 't'); mysql_select_db('pc_spec', $dbLink); $html = file_get_contents('http://localhost/pc_spec/scrape.php?q=amd+955'); echo $html; $html = strip_tags($html); $price = ereg("\$.{6}", $html); $html = mysql_real_escape_string($html); $price = mysql_real_escape_string($price); $insert = mysql_query("INSERT INTO parts(part, price) values('$html','$price')") or var_dump(mysql_error()); How can I get $price to match $.{6} and insert this value (eg. $111.11) into a database and remove it from $html? Do I need to use explode? Thanks

Read the article

utf-8 convertion doesn't work always

- by Marco Piccinni

I searched into other stack before to type here and I didn't find anythong similar. I have to scrape different utf-8 webpages which contain text like "Oggi è una bellissima giornata" the problem is on the characther "è" I extract this text with jtidy and xpath query expression and I convert it with byte[] content = filteredEncodedString.getBytes("utf-8"); String result = new String(content,"utf-8"); where filteredEncodedString contains the text "Oggi è una bellissima giornata". This procedures works on the most webpages analyzed so far but in some case it doesn't extract a utf-8 string. Page encoding is always the same as the text is similar. Any ideas about the problem? thanks Marco

Read the article

BeautifulSoup Parser Confusion - HTML

- by lyngbym

I'm trying to scrape some content off another site and I'm not sure why BeautifulSoup is producing this output. It is only finding a blank space inside the match, but the real HTML contains a large amount of markup. I apologize if this is something stupid on my part. I'm new to python. Here's my code: import sys import os import mechanize import re from BeautifulSoup import BeautifulSoup def scrape_trails(BASE_URL, data): #Get the trail names soup = BeautifulSoup(data) sitesDiv = soup.findAll("div", attrs={"id" : "sitesDiv"}) print sitesDiv def main(): BASE_URL = "http://www.dnr.state.mn.us/skiing/skipass/list.html" br = mechanize.Browser() data = br.open(BASE_URL).get_data() links = scrape_trails(BASE_URL, data) if __name__ == '__main__': main() If you follow that URL you can see the sitesDiv contains a lot of markup. I'm not sure if I'm doing something wrong or if this is just malformed markup that the script can't handle. Thanks!

Search Results

Search found 210 results on 9 pages for 'scrape'.

Page 5/9 | < Previous Page | 1 2 3 4 5 6 7 8 9 | Next Page >

- by Michal Cihar

- by rakhavan

- by dreeves

- by Seth

- by Berlin Brown

- by rwmnau

- by bstullkid

- by JonB

- by Mad Wombat

- by viatropos

- by DavidP6

- by NJTechGuy

- by Mrgreen

- by feydr

- by bstullkid

- by Andy Schaefer

- by dmb

- by justinbach

- by bstullkid

- by JillianK

- by Noio

- by pfunc

- by Dean

- by Marco Piccinni

- by lyngbym

< Previous Page | 1 2 3 4 5 6 7 8 9 | Next Page >