Search Results

Search found 346 results on 14 pages for 'scraping'.

Page 3/14 | < Previous Page | 1 2 3 4 5 6 7 8 9 10 11 12  | Next Page >

  • What's the best way to write a maintainable web scraping app?

    - by Benj
    I wrote a perl script a while ago which logged into my online banking and emailed me my balance and a mini-statement every day. I found it very useful for keeping track of my finances. The only problem is that I wrote it just using perl and curl and it was quite complicated and hard to maintain. After a few instances of my bank changing their webpage I got fed up of debugging it to keep it up to date. So what's the best way of writing such a program in such a way that it's easy to maintain? I'd like to write a nice well engineered version in either Perl or Java which will be easy to update when the bank inevitably fiddle with their web site.

    Read the article

  • Scraping *.aspx content using Python

    - by tomato
    I'm having difficulties scraping dynamically generated table in ASPX. Trying to scrape the gas prices from a site like this GasPrices. I can extract all the information in the gas price table (address, time submitted etc.), except for the actual gas price. Is there a way I could scrape the gas prices? i.e. somehow get a text representation of it. I'm not very familiar with ASP/ASPX - but what's being generated now is not showing up in the final HTML. I'm using Python to do the scraping, but that's irrelevant unless there's a specific library... Thanks in advance.

    Read the article

  • Scraping with multiple IP, in java.

    - by Titi Wangsa bin Damhore
    Well basically I have a scraping application. It scrapes around n items per minute. currently i have only one IP. The site i'm scraping allows me 3 connections per IP. I'm thinking about getting another IP. so i'll be able to get 6 connections. in theory i should be able to get n items in 40 seconds, more or less. currently i'm using java (commons-httpcore) to get the job done. I'm not sure if this is java question or an OS question. my machine has IP 1 and IP 2 how do i connect to, say, www.microsoft.com, using IP 1 and using IP2? how can i specify, which ip i want to use to do a connection?

    Read the article

  • scrapy cannot find div on this website [on hold]

    - by Jaspal Singh Rathour
    I am very new at this and have been trying to get my head around my first selector can somebody help? i am trying to extract data from page http://groceries.asda.com/asda-webstore/landing/home.shtml?cmpid=ahc--ghs-d1--asdacom-dsk-_-hp#/shelf/1215337195041/1/so_false all the info under div class = listing clearfix shelfListing but i cant seem to figure out how to format response.xpath(). I have managed to launch scrapy console but no matter what I type in response.xpath() i cant seem to select the right node. I know it works because when I type response.xpath('//div[@class="container"]') I get a response but don't know how to navigate to the listsing cleardix shelflisting. I am hoping that once i get this bit I can continue working my way through the spider. Thank you in advance! PS I wonder if it is not possible to scan this site - is it possible for the owners to block spiders?

    Read the article

  • Issue in Webscrapping in C# : Downloading and parsing zipped text files

    - by user64094
    I am writing an webscrapper, to do the download content from a website. Traversing to the website/URL, triggers the creation of a temporary URL. This new URL has a zipped text file. This zipped file is to be downloaded and parsed. I have written a scrapper in C# using WebClient and its function - DownloadFileAsync(). The zipped file is read from the designated location on a trapped DownloadFileCompleted event. My issue : The Windows 'Open/Save dialog is triggered". This requires user input and the automation is disrupted. Can you suggest a way to bypass the issue ? I am cool with rewriting the code using any alternate libraries. :) Thanks for reading,

    Read the article

  • Source for Names to use in web scraping

    - by PyNEwbie
    Can anyone suggest a good source of names that I can use to help analyze some tables on web pages. The first column of the tables I am scraping have names alone, names and titles or just titles. The names can be as varied as John Smith to Vikram Saksena. I have been poking around for a compiled list of words that can be found in proper names.

    Read the article

  • Web scraping with Python

    - by Jack
    I'm currently trying to scrape a website that has fairly poorly-formatted HTML (often missing closing tags, no use of classes or ids so it's incredibly difficult to go straight to the element you want, etc.). I've been using BeautifulSoup with some success so far but every once and a while (though quite rarely), I run into a page where BeautifulSoup creates the HTML tree a bit differently from (for example) Firefox or Webkit. While this is understandable as the formatting of the HTML leaves this ambiguous, if I were able to get the same parse tree as Firefox or Webkit produces I would be able to parse things much more easily. The problems are usually something like the site opens a <b> tag twice and when BeautifulSoup sees the second <b> tag, it immediately closes the first while Firefox and Webkit nest the <b> tags. Is there a web scraping library for Python (or even any other language (I'm getting desperate)) that can reproduce the parse tree generated by Firefox or WebKit (or at least get closer than BeautifulSoup in cases of ambiguity).

    Read the article

  • Web scraping with Python

    - by Jack
    I'm currently trying to scrape a website that has fairly poorly-formatted HTML (often missing closing tags, no use of classes or ids so it's incredibly difficult to go straight to the element you want, etc.). I've been using BeautifulSoup with some success so far but every once and a while (though quite rarely), I run into a page where BeautifulSoup creates the HTML tree a bit differently from (for example) Firefox or Webkit. While this is understandable as the formatting of the HTML leaves this ambiguous, if I were able to get the same parse tree as Firefox or Webkit produces I would be able to parse things much more easily. The problems are usually something like the site opens a <b> tag twice and when BeautifulSoup sees the second <b> tag, it immediately closes the first while Firefox and Webkit nest the <b> tags. Is there a web scraping library for Python (or even any other language (I'm getting desperate)) that can reproduce the parse tree generated by Firefox or WebKit (or at least get closer than BeautifulSoup in cases of ambiguity).

    Read the article

  • a question on webpage data scraping using Java

    - by Gemma
    Hi there. I am now trying to implement a simple HTML webpage scraper using Java.Now I have a small problem. Suppose I have the following HTML fragment. <div id="sr-h-left" class="sr-comp"> <a class="link-gray-underline" id="compare_header" rel="nofollow" href="javascript:i18nCompareProd('/serv/main/buyer/ProductCompare.jsp?nxtg=41980a1c051f-0942A6ADCF43B802'); " Compare Showing 1 - 30 of 1,439 matches, The data I am interested is the integer 1.439 shown at the bottom.I am just wondering how can I get that integer out of the HTML. I am now considering using a regular expression,and then use the java.util.Pattern to help get the data out,but still not very clear about the process. I would be grateful if you guys could give me some hint or idea on this data scraping. Thanks a lot.

    Read the article

  • Scraping paginated items from a website using scrapy

    - by Mridang Agarwalla
    I'm using scrapy to scrape items from a site. I'm not being able to implement this scraping pattern. The site I'm trying to scrape is a forum and I scrape the site once a day. Each page has a table containing posts. New posts are added to the top of the table and as more and more posts are posted to the site, the older posts go further into the pages due to pagination. This is a very simple scenario and we will assume that the order of the posts never change. I would like to scrape this site and scrape all the "new" records until the last scraped post from yesterday is encountered. I have configured my spider to paginate endlessly and when it encounters yesterday's last scraped post, it should stop. How can implement this? (My Scrapy installation works with my Django installation using django-dynamic-scraper )

    Read the article

  • Webpage data scraping using Java

    - by Gemma
    I am now trying to implement a simple HTML webpage scraper using Java.Now I have a small problem. Suppose I have the following HTML fragment. <div id="sr-h-left" class="sr-comp"> <a class="link-gray-underline" id="compare_header" rel="nofollow" href="javascript:i18nCompareProd('/serv/main/buyer/ProductCompare.jsp?nxtg=41980a1c051f-0942A6ADCF43B802');"> <span style="cursor: pointer;" class="sr-h-o">Compare</span> </a> </div> <div id="sr-h-right" class="sr-summary"> <div id="sr-num-results"> <div class="sr-h-o-r">Showing 1 - 30 of 1,439 matches, The data I am interested is the integer 1.439 shown at the bottom.I am just wondering how can I get that integer out of the HTML. I am now considering using a regular expression,and then use the java.util.Pattern to help get the data out,but still not very clear about the process. I would be grateful if you guys could give me some hint or idea on this data scraping. Thanks a lot.

    Read the article

  • What's the fastest way to scrape a lot of pages in php?

    - by Yegor
    I have a data aggregator that relies on scraping several sites, and indexing their information in a way that is searchable to the user. I need to be able to scrape a vast number of pages, daily, and I have ran into problems using simple curl requests, that are fairly slow when executed in rapid sequence for a long time (the scraper runs 24/7 basically). Running a multi curl request in a simple while loop is fairly slow. I speeded it up by doing individual curl requests in a background process, which works faster, but sooner or later the slower requests start piling up, which ends up crashing the server. Are there more efficient ways of scraping data? perhaps command line curl?

    Read the article

  • scraping website with javascript cookie with c#

    - by erwin
    Hi all, I want to scrap some things from the following site: http://www.conrad.nl/modelspoor This is my function: public string SreenScrape(string urlBase, string urlPath) { CookieContainer cookieContainer = new CookieContainer(); HttpWebRequest httpWebRequest = (HttpWebRequest)WebRequest.Create(urlBase + urlPath); httpWebRequest.CookieContainer = cookieContainer; httpWebRequest.UserAgent = "Mozilla/6.0 (Windows; U; Windows NT 7.0; en-US; rv:1.9.0.8) Gecko/2009032609 Firefox/3.0.9 (.NET CLR 3.5.30729)"; WebResponse webResponse = httpWebRequest.GetResponse(); string result = new System.IO.StreamReader(webResponse.GetResponseStream(), Encoding.Default).ReadToEnd(); webResponse.Close(); if (result.Contains("<frame src=")) { Regex metaregex = new Regex("http:[a-z:/._0-9!?=A-Z&]*",RegexOptions.Multiline); result = result.Replace("\r\n", ""); Match m = metaregex.Match(result); string key = m.Groups[0].Value; foreach (Match match in metaregex.Matches(result)) { HttpWebRequest redirectHttpWebRequest = (HttpWebRequest)WebRequest.Create(key); redirectHttpWebRequest.CookieContainer = cookieContainer; webResponse = redirectHttpWebRequest.GetResponse(); string redirectResponse = new System.IO.StreamReader(webResponse.GetResponseStream(), Encoding.Default).ReadToEnd(); webResponse.Close(); return redirectResponse; } } return result; } But when i do this i get a string with an error from the website that it use javascript. Does anybody know how to fix this? Kind regards Erwin

    Read the article

  • scraping text from multiple html files into a single csv file

    - by Lulu
    I have just over 1500 html pages (1.html to 1500.html). I have written a code using Beautiful Soup that extracts most of the data I need but "misses" out some of the data within the table. My Input: e.g file 1500.html My Code: #!/usr/bin/env python import glob import codecs from BeautifulSoup import BeautifulSoup with codecs.open('dump2.csv', "w", encoding="utf-8") as csvfile: for file in glob.glob('*html*'): print 'Processing', file soup = BeautifulSoup(open(file).read()) rows = soup.findAll('tr') for tr in rows: cols = tr.findAll('td') #print >> csvfile,"#".join(col.string for col in cols) #print >> csvfile,"#".join(td.find(text=True)) for col in cols: print >> csvfile, col.string print >> csvfile, "===" print >> csvfile, "***" Output: One CSV file, with 1500 lines of text and columns of data. For some reason my code does not pull out all the required data but "misses" some data, e.g the Address1 and Address 2 data at the start of the table do not come out. I modified the code to put in * and === separators, I then use perl to put into a clean csv file, unfortunately I'm not sure how to work my code to get all the data I'm looking for!

    Read the article

  • Automating scraping of table data to XML

    - by thewinchester
    Problem I have a YQL query result that I'm trying to get converted and sort into a clean XML file. Background Being the pains that they are, information from the World Cup isn't freely available in an easy to reuse format. So, after a bit of finessing with YQL I have managed to liberate the required table rows which contain the data I'm after. The YQL query can be viewed at: http://query.yahooapis.com/v1/public/yql/ravingbeefsteak/worldcup2010groupliberator?diagnostics=true I'd like to now convert this information into XML, and being an absolute n00b I don't know where to start or what to look for. I'm also needing to do a find and replace on the data to get the URL's working as they should without manual changes, and hopefully an initial sorting of the data. If anyone can point me in the right direction of what I need to be doing to make my needs a reality it would be greatly appreciated.

    Read the article

  • Difficulty screen scraping http://www.momondo.com using nokogiri

    - by Khai Kiong
    I have some difficulty to extract the total price (css selector = '.total') from the flight result. http://www.momondo.com/multicity/?Search=true&TripType=oneway&SegNo=1&SO0=KUL&SD0=KBR&SDP0=31-12-2012&AD=2&CA=0,0&DO=false&NA=false#Search=true&TripType=oneway&SegNo=1&SO0=KUL&SD0=KBR&SDP0=31-12-2012&AD=2&CA=0,0&DO=false&NA=false I get the error "undefined method `text' for nil:NilClass nokogiri ". My code desc "Fetch product prices" task :fetch_details => :environment do require 'nokogiri' require 'open-uri' include ERB::Util OneWayFlight.find_all_by_money(nil).each do |flight| url = "http://www.momondo.com/multicity/Search=true&TripType=oneway&SegNo=1&SO0=KUL&SD0=KBR&SDP0=31-12-2012&AD=2&CA=0,0&DO=false&NA=false#Search=true&TripType=oneway&SegNo=1&SO0=KUL&SD0=KBR&SDP0=31-12-2012&AD=2&CA=0,0&DO=false&NA=false" doc = Nokogiri::HTML(open(url)) price = doc.at_css(".total").text[/[0-9\.]+/] flight.update_attribute(:price, price) end end

    Read the article

  • Scraping *.aspx content using Python

    - by tomato
    I'm having difficulties scrapping dynamically generated table in ASPX. Trying to scrap the gas prices from a site like these GasPrices. I can extract all the information in the gas price table (address, time submitted etc.), except for the actual gas price. Is there a way I could scrap the gas prices? i.e. somehow get a text representation of it. I'm not very familiar with ASP/ASPX - but what's being generated now is not showing up in the final HTML. I'm using Python to do the scrapping, but that's irrelevant unless there's a specific library...

    Read the article

  • Having trouble scraping an ASP .NET web page

    - by Seth
    I am trying to scrape an ASP.NET website but am having trouble getting the results from a post. I have the following python code and am using httplib2 and BeautifulSoup: conn = Http() # do a get first to retrieve important values page = conn.request(u"http://somepage.com/Search.aspx", "GET") #event_validation and viewstate variables retrieved from GET here... body = {"__EVENTARGUMENT" : "", "__EVENTTARGET" : "" , "__EVENTVALIDATION": event_validation, "__VIEWSTATE" : viewstate, "ctl00_ContentPlaceHolder1_GovernmentCheckBox" : "On", "ctl00_ContentPlaceHolder1_NonGovernmentCheckBox" : "On", "ctl00_ContentPlaceHolder1_SchoolKeyValue" : "", "ctl00_ContentPlaceHolder1_SchoolNameTextBox" : "", "ctl00_ContentPlaceHolder1_ScriptManager1" : "ctl00_ContentPlaceHolder1_UpdatePanel1|cct100_ContentPlaceHolder1_SearchImageButton", "ct100_ContentPlaceHolder1_SearchImageButton.x" : "375", "ct100_ContentPlaceHolder1_SearchImageButton.y" : "11", "ctl00_ContentPlaceHolder1_SuburbTownTextBox" : "Adelaide,SA,5000", "hiddenInputToUpdateATBuffer_CommonToolkitScripts" : 1} headers = {"Content-type": "application/x-www-form-urlencoded"} resp, content = conn.request(url,"POST", headers=headers, body=urlencode(body)) When I print content I still seem to be getting the same results as the "GET" or is there a fundamental concept I'm missing to retrieve the result values of an ASP .NET post?

    Read the article

  • Top techniques to avoid 'data scraping' from a website database

    - by Addsy
    I am setting up a site using PHP and MySQL that is essentially just a web front-end to an existing database. Understandably my client is very keen to prevent anyone from being able to make a copy of the data in the database yet at the same time wants everything publicly available and even a "view all" link to display every record in the db. Whilst I have put everything in place to prevent attacks such as SQL injection attacks, there is nothing to prevent anyone from viewing all the records as html and running some sort of script to parse this data back into another database. Even if I was to remove the "view all" link, someone could still, in theory, use an automated process to go through each record one by one and compile these into a new database, essentially pinching all the information. Does anyone have any good tactics for preventing or even just dettering this that they could share. Thanks

    Read the article

  • grabbing a substring while scraping with Python2.6

    - by Diego
    Hey can someone help with the following? I'm trying to scrape a site that has the following information.. I need to pull just the number after the </strong> tag.. [<li><strong>ISBN-13:</strong> 9780375853401</li>, <li><strong>Pub. Date: </strong> 05/11/2010</li>] [<li><strong>UPC:</strong> 490355000372</li>, <li><strong>Catalog No:</strong> 15024/25</li>, <li><strong>Label:</strong> CAMERATA</li>] here's a piece of the code I've been using to grab the above data using mechanize and BeautifulSoup. I'm stuck here as it won't let me use the find() function for a list br_results = mechanize.urlopen(br_results) html = br_results.read() soup = BeautifulSoup(html) local_links = soup.findAll("a", {"class" : "down-arrow csa"}) upc_code = soup.findAll("ul", {"class" : "bc-meta3"}) for upc in upc_code: upc_text = upc.contents.contents print upc_text

    Read the article

  • scraping blog contents

    - by goh
    Hi lads, After obtaining the urls for various blogspots, tumblr and wordpress pages, I faced some problems processing the html pages. The thing is, i wish to distinguish between the content,title and date for each blog post. I might be able to get the date through regex, but there are so many custom scripts people are using now that the html classes and structure is so different. Does anyone has a solution that may help?

    Read the article

  • rcurl web scraping timeout exits program

    - by user1742368
    I am using a loop and rcurl scrape data from multiple pages which seems to work fine at certain times but fails when there is a timeout due to the server not responding. I am using a timeout=30 which traps the timeout error however the program stops after the timeout. i would like the progrm to continue to the next page when the timeout occurrs but cant figureout how to do this? url = getCurlHandle(cookiefile = "", verbose = TRUE) Here is the statement I am using that causes the timeout. I am happy to share the code if there is interest. webpage = getURLContent(url, followlocation=TRUE, curl = curl,.opts=list( verbose = TRUE, timeout=90, maxredirs = 2)) woodwardjj

    Read the article

< Previous Page | 1 2 3 4 5 6 7 8 9 10 11 12  | Next Page >