scrape - Developer IT

How can I scrape the current webpage with php/javascript?

- by Robert

I have made the following webpage for generating interactive todo lists: http://robert-kent.com/todo/todo.php Basically, the user pastes a numbered todo list and each task is placed into it's own div with a unique id. Users can add notes to the tasks (done with javascript) and can click the green check when the task is done to hide it. I'd like to add an Export button which would generate a report of which tasks were completed and which were not, along with the user-entered notes. After a bit of searching I understand that what I want to do is scrape the page, but I haven't the faintest idea of the best way to do it. Many of the articles and tutorials I have found with Google involve scraping other sites and don't really explain how I could iterate over each div on the page. Full source here: http://pastebin.com/r7V3P5jK Any suggestions?

Read the article

How to scrape Google SERP based on copyright year?

- by Michael Mao

Hi all: I know there must be ways to do this sort of things. I am not pro in RoR or Python, not even an expert in PHP. So my solution tends to be quite dumb: It uses a FireFox add-on called imarcos to scrape the target urls from Google SERP, and use PHP to store info into the database. At the very core of my workaround there lies a problem: How to specifically find target urls based on their copyright year? I mean, something like "copyright 1998-2006" in the footer is to be considered a target, but my search results are not 100% accurate. I used the following url to search : http://www.google.com.au/#hl=en&q=inurl:.com.au+intext:copyright+1995..2007+--2008+--2009&start=0&cad=b&fp=6a8119b094529f00 It reads : search for pages that have .com.au in URL and a copyright range from 1995 to 2007 exclude the year of 2008 or 2009. Starting position is 0, of course the offset can be changed. I've already done a dummy list and honestly I am not pleased with the result. That's mostly because I cannot find a way to restrict search terms in the exact order as they are entered into the search url. copyright can appear in anywhere on page and doesn't necessarily before the years, that's the current story. Is there a more clear way to sort out this? Oh, almost forgot to say the client doesn't wanna spent too much in this - I cannot persuade him simply buy some cool software, unfortunately. I hope there is a way to use clever Google search operators or similar things to go around this issue. Many thanks in advance!

Read the article

How do I scrape information off ASP.NET websites when paging and JavaScript links are being used?

- by Ian Roke

I have been given a staff list which is supposed to be up to date but it doesn't match an intranet People Finder which is written in ASP.NET. As the information is sensitive I am not able to access the database the People Finder is using so the only way I can get at the information is by scraping the structure starting at the top brass at the top and then going through each tier in turn. Each person has a Staff number which then forms the URL http://intranet/peoplefinder/index.aspx?srn=ABC1234 and then all the people who report to them are listed underneth in the format <a id="gvEmployees_ctl03_lnkFullName" href="index.aspx?srn=ABC4321" target="_self"> where each URL indicates the Staff number and provides a link to their team. The trouble arises when the teams are big as paging is implemented in the GridView with an URL such as <a href="javascript:__doPostBack('gvEmployees','Page$2')">2</a>. How would I scrape this page, capture the SRN and other details along with the people who report to the person on all pages of the GridView then loop through each reportee and do the same process until the whole list is complete?

Read the article

Screen scrape a web page that uses javaScript and frames

- by Mello

Hi, I want to scrape data from www.marktplaats.nl . I want to analyze the scraped description, price, date and views in Excel/Access. I tried to scrape data with Ruby (nokogiri, scrapi) but nothing worked. (on other sites it worked well) The main problem is that for example selectorgadget and the add-on firebug (Firefox) don’t find any css I can use to scrape the page. On other sites I can extract the css with selectorgadget or firebug and use it with nokogiri or scrapi. Due to lack of experience it is difficult to identify the problem and therefore searching for a solution isn’t easy. Can you tell me where to start solving this problem and where I maybe can find more info about a similar scraping process? Thanks in advance!

Read the article

How can I scrape specific data from a website

- by Stoney

I'm trying to scrape data from a website for research. The urls are nicely organized in an example.com/x format, with x as an ascending number and all of the pages are structured in the same way. I just need to grab certain headings and a few numbers which are always in the same locations. I'll then need to get this data into structured form for analysis in Excel. I have used wget before to download pages, but I can't figure out how to grab specific lines of text. Excel has a feature to grab data from the web (Data-From Web) but from what I can see it only allows me to download tables. Unfortunately, the data I need is not in tables.

Read the article

perl script to scrape out sentences

- by kivien

Perl script that would scrape out sentences that mention 'Calvein Klein' in articles in a file named by $file. (Sentences can cross zero or more CR/LF characters.) Create an array of sentences scraped and print it at the end. Please anyone help me with that.

Read the article

I want to scrape a site using GAE and post the results into a Google Entity

- by cozza

I want to scrape this URL : https://www.xstreetsl.com/modules.php?searchSubmitImage_x=0&searchSubmitImage_y=0&SearchLocale=0&name=Marketplace&SearchKeyword=business&searchSubmitImage.x=0&searchSubmitImage.y=0&SearchLocale=0&SearchPriceMin=&SearchPriceMax=&SearchRatingMin=&SearchRatingMax=&sort=&dir=asc Go into each of the links and extract out various pieces of information e.g. permissions, prims etc then post the results into a Entity on google app engine. I want to know the best way to go about it? Chris

Read the article

What movie website allows people to scrape it?

- by Sergio Tapia

I've wanted to make a C# library to scrape movie information and return it to the application, but someone told me that it's against the TOS. RottenTomatoes seems to have no problems with it from what I've read on their licensing page, but I'm not quite sure. Where could I aquire movie information legally and without cost? It's for an open source application hosted here: LINK

Read the article

How do I get data from the iTunes app store

- by Bodie

I'm trying to scrape the entire iTunes App Store so that I can store it in a database for a project I'm working on. I'm having a hard time finding the best way to do this. I know there are ways to get specific information about price changes but I can't find anything that describes how to scrape the entire app store. Any additional info is appreciated.

Read the article

Writing a program to scrape forums

- by seanieb

Hi, I need to write a program to scrape forums. Should I write the program in Python using the Scrapy framework or should I use Php cURL? Also is there a Php equivalent to Scrapy? Thanks

Read the article

How can I screen scrape with Perl?

- by Sakthivel

I need to display some values that are stored in a website, for that I need to scrape the website and fetch the content from the table. Any ideas?

Read the article

scrape data from a website and post it on the blog (wordpress)

- by Pennf0lio

This could be in DocType But I'm looking for a software or just a plugin for wordpress. I wanted to fetch those data from a website and automatically post it on my blog (Wordpress powered). It doesn't have rss or api to get those data, so I need to manually copy and paste it one-by-one and post it on wordpress. Do you know an alternative options on my process? or you know a software or a plugin that does the job? Thanks!

Read the article

Scrape HTML tables from a given URL into CSV

- by dreeves

I seek a tool that can be run on the command line like so: tablescrape 'http://someURL.foo.com' [n] If n is not specified and there's more than one HTML table on the page, it should summarize them (header row, total number of rows) in a numbered list. If n is specified or if there's only one table, it should parse the table and spit it to stdout as CSV or TSV. Potential additional features: To be really fancy you could parse a table within a table, but for my purposes -- fetching data from wikipedia pages and the like -- that's overkill. The Perl module HTML::TableExtract can do this and may be good place to start for writing the tool I have in mind. An option to asciify any unicode. An option to apply an arbitrary regex substitution for fixing weirdnesses in the parsed table. Related questions: http://stackoverflow.com/questions/259091/how-can-i-scrape-an-html-table-to-csv http://stackoverflow.com/questions/1403087/how-can-i-convert-an-html-table-to-csv http://stackoverflow.com/questions/2861/options-for-html-scraping

Read the article

Using Nokogiri to scrape groupon deal

- by hyngyn

I'm following the Nokogiri railscast to write a scraper for Groupon. I keep on getting the following error when I run my rb file. traveldeal_scrape.rb:10: warning: regular expression has ']' without escape: /\[0-9 \.]+/ Flamingo Conference Resort and Spa Deal of the Day | Groupon Napa / Sonoma traveldeal_scrape.rb:9:in `block in <main>': undefined local variable or method `item' for main:Object (NameError) Here is my scrape file. require 'rubygems' require 'nokogiri' require 'open-uri' url = "http://www.groupon.com/deals/ga-flamingo-conferences-resort-spa?c=all&p=0" doc = Nokogiri::HTML(open(url)) puts doc.at_css("title").text doc.css(".deal").each do |deal| title = deal.at_css("#content//a").text price = deal.at_css("#amount").text[/\[0-9\.]+/] puts "#{title} - #{price}" puts deal.at_css(".deal")[:href] end I used the exact same rubular expression as the tutorial. I am also unsure of whether or not my CSS tags are correct. Thanks!

Read the article

How to scrape a _private_ google group?

- by John

Hi there, I'd like to scrape the discussion list of a private google group. It's a multi-page list and I might have to this later again so scripting sounds like the way to go. Since this is a private group, I need to login in my google account first. Unfortunately I can't manage to login using wget or ruby Net::HTTP. Surprisingly google groups is not accessible with the Client Login interface, so all the code samples are useless. My ruby script is embedded at the end of the post. The response to the authentication query is a 200-OK but no cookies in the response headers and the body contains the message "Your browser's cookie functionality is turned off. Please turn it on." I got the same output with wget. See the bash script at the end of this message. I don't know how to workaround this. am I missing something? Any idea? Thanks in advance. John Here is the ruby script: # a ruby script require 'net/https' http = Net::HTTP.new('www.google.com', 443) http.use_ssl = true path = '/accounts/ServiceLoginAuth' email='[email protected]' password='topsecret' # form inputs from the login page data = "Email=#{email}&Passwd=#{password}&dsh=7379491738180116079&GALX=irvvmW0Z-zI" headers = { 'Content-Type' => 'application/x-www-form-urlencoded', 'user-agent' => "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.2 (KHTML, like Gecko) Chrome/6.0"} # Post the request and print out the response to retrieve our authentication token resp, data = http.post(path, data, headers) puts resp resp.each {|h, v| puts h+'='+v} #warning: peer certificate won't be verified in this SSL session Here is the bash script: # A bash script for wget CMD="" CMD="$CMD --keep-session-cookies --save-cookies cookies.tmp" CMD="$CMD --no-check-certificate" CMD="$CMD --post-data='[email protected]&Passwd=topsecret&dsh=-8408553335275857936&GALX=irvvmW0Z-zI'" CMD="$CMD --user-agent='Mozilla'" CMD="$CMD https://www.google.com/accounts/ServiceLoginAuth" echo $CMD wget $CMD wget --load-cookies="cookies.tmp" http://groups.google.com/group/mygroup/topics?tsc=2

Read the article

Client side page call/scrape?

- by Silvre

Here is the problem: I have a web application - a frequently changing notification system - that runs on a series of local computers. The application refreshes every couple of seconds to display the new information. The computers only display info, and do not have keyboards or ANY input device. The issue is that if the connection to the server is lost (say updates are installed and a server must be rebooted), a page not found error is displayed). We must then either reboot all computers that are running this app, OR add a keyboard and refresh the browser, OR try to access each computer remotely and refresh the browser. None of these are good options and result in a lot of frustration. I cannot change the actual application OR server environment. So what I need is some way to test the call to the application, and if an error is returned or it times out, continue trying every minute or so until the connection is reestablished. My idea is to create a client-side page scraper, that makes a JS request to the application (which displays basic HTML), and can run locally on the machine, no server required. If the scrape returns the correct content, it displays it. If not it continues to request the page until the actual page content is returned. Is this possible? What is the best way to do it?

Read the article

Scrape data from HTML pages using Java, output to database

- by Tanith

I need to know how to create a scraper (in Java) to gather data from HTML pages and output to a database...do not have a clue where to start so any information you can give me on this would be great. Also, you can't be too basic or simple here...thanks :)

Read the article

Scrape zipcode table for different urls based on county

- by Dr.Venkman

I used lxml and ran into a wall as my new computer wont install lxml and the code doesnt work. I know this is simple - maybe some one can help with a beautiful soup script. this is my code: import codecs import lxml as lh from selenium import webdriver import time import re results = [] city = [ 'amador'] state = [ 'CA'] for state in states: for city in citys: browser = webdriver.Firefox() link2 = 'http://www.getzips.com/cgi-bin/ziplook.exe?What=3&County='+ city +'&State=' + state + '&Submit=Look+It+Up' browser.get(link2) bcontent = browser.page_source zipcode = bcontent[bcontent.find('<td width="15%"'):bcontent.find('<p>')+0] if len(zipcode) > 0: print zipcode else: print 'none' browser.quit() Thanks for the help

Read the article

scrape a user's entire tweets

- by whitman

I'd like to pull all of a user's tweets. I could do this the hard way (manually scraping twitter) or the easy way: using their api. The problem with the easy (api) way is that I seem to be limited to the 200 most recent tweets. What's a simple way to get all tweets? Thanks

Read the article

How to work around a site forbidding me to scrape their images with PHP

- by Petruza

I'm scraping a site, searching for JPGs to download. Scraping the site's HTML pages works fine. But when I try getting the JPGs with CURL, copy(), fopen(), etc., I get a 403 forbiden status. I know that's because the site owners don't want their images scraped, so I understand a good answer would be just don't do it, because they don't want you to. Ok, but let's say it's ok and I try to work around this, how could this be achieved? If I get the same URL with a browser, I can open the image perfectly, it's not that my IP is banned or anything, and I'm testing the scraper one file at a time, so it's not blocking me because I make too many requests too often. From my understanding, it could be that either the site is checking for some cookies that confirm that I'm using a browser and browsing their site before I download a JPG. Or that maybe PHP is using some user agent for the requests that the server can detect and filter out. Anyway, have any idea?

Read the article

rcurl web scraping timeout exits program

- by user1742368

I am using a loop and rcurl scrape data from multiple pages which seems to work fine at certain times but fails when there is a timeout due to the server not responding. I am using a timeout=30 which traps the timeout error however the program stops after the timeout. i would like the progrm to continue to the next page when the timeout occurrs but cant figureout how to do this? url = getCurlHandle(cookiefile = "", verbose = TRUE) Here is the statement I am using that causes the timeout. I am happy to share the code if there is interest. webpage = getURLContent(url, followlocation=TRUE, curl = curl,.opts=list( verbose = TRUE, timeout=90, maxredirs = 2)) woodwardjj

Read the article

Help with PHP simplehtmldom - Modifiying a form.

- by onemyndseye

Ive gotten some great help here and I am so close to solving my problem that I can taste it. But I seem to be stuck. I need to scrape a simple form from a local webserver and only return the lines that match a users local email (i.e. onemyndseye@localhost). simplehtmldom makes easy work of extracting the correct form element: foreach($html->find('form[action*="delete"]') as $form) echo $form; Returns: <form action="/delete" method="post"> <input type="checkbox" id="D1" name="D1" /><a href="http://www.linux.com/rss/feeds.php"> http://www.linux.com/rss/feeds.php </a> [email: onemyndseye@localhost (Default) ]<br /> <input type="checkbox" id="D2" name="D2" /><a href="http://www.ubuntu.com/rss.xml"> http://www.ubuntu.com/rss.xml </a> [email: onemyndseye@localhost (Default) ]<br /> However I am having trouble making the next step. Which is returning lines that contain 'onemyndseye@localhost' and removing it so that only the following is returned: <input type="checkbox" id="D1" name="D1" /><a href="http://www.linux.com/rss/feeds.php">http://www.linux.com/rss/feeds.php</a> <br /> <input type="checkbox" id="D2" name="D2" /><a href="http://www.ubuntu.com/rss.xml">http://www.ubuntu.com/rss.xml</a> <br /> Thanks to the wonderful users of this site Ive gotten this far and can even return just the links but I am having trouble getting the rest... Its important that the complete <input> tags are returned EXACTLY as shown above as the id and name values will need to be passed back to the original form in post data later on. Thanks in advance!

Read the article

Why are Facebook Likes Insisting on using Wrong Product Image...?

- by Joan Kent

Firstly, I'm not a web developer so please be patient. I have read the other posts but I think i have everything covered. My website http://www.joaniesgifts.co.uk includes the like button on the product pages. However, I've found that certain product pages are using the incorrect image when a user likes the page. For example - http://www.joaniesgifts.co.uk/terramundi-money-pots/terramundi-money-pot-holiday-fund I think this may have been down to an original incorrect setup which is now corrected. However, the problem remains... The only thing I have to go on :- if i use the facebook url linter (developers.facebook.com/tools/debug) on the above product page, I receive the following error :- Object at URL 'http://www.joaniesgifts.co.uk/terramundi-money-pot-holiday-fund' of type '213689662010141:product' is invalid because the domain 'www.joaniesgifts.co.uk' is not allowed for the application id '213689662010141' which owns the specified object type. If you are the owner of this application, you can verify your configured 'Site Domain' at developers.facebook.com/apps/213689662010141. (I have verified my site's domain) Everything else appears fine except it is also showing the wrong image!! However, under Raw Open Graph Document Information it has the correct link! If I then click graph api - graph.facebook.com/10150450766583352 it again shows the wrong image was linked! I've no idea what else to do - can you help me? Kind Regards, Joan PS Graph API shows the incorrect image after a scrape only minutes ago { "url": "http://www.joaniesgifts.co.uk/terramundi-money-pot-holiday-fund", "type": "website", "title": "Terramundi Money Pot - Holiday Fund", "image": [ { "url": "http://www.joaniesgifts.co.uk/index.php?route=product\u00252Fproduct\u00252Fcaptcha" } ], "updated_time": "2011-11-11T18:54:38+0000", "id": "10150450766583352" }

Read the article

Screen Scraping Twitter

- by BRADINO

I got an email today asking for help to scrape Twitter. In particular, to be able to login. So I am going to show everyone, NOT to encourage anyone to violate Twitters terms of use but as an educational blog post about how PHP and cURL can be used to post variables and store cookies. Again, I am using the cScrape class I wrote, which you can download. Step 1 First go to twitter.com and look at the source code of the login to get the form field names and the form post location. You will see that the form posts to https://twitter.com/session and the username and password fields are session[username_or_email] and session[password] respectively. Step 2 Now you are ready to login. So using the fetch function in the Scrape class you create an associative array to contain the form values you want to post. The other thing you will need to do is uncomment the lines for CURLOPT_COOKIEFILE and CURLOPT_COOKIEJAR. Cookies will be required to stay logged in and scrape around. The paths to the cookie files need to be writable by your app. Also you will need to uncomment the line about CURLOPT_FOLLOWLOCATION. $data = array('session[username_or_email]' => "bradino", 'session[password]' => "secret"); $scrape->fetch('https://twitter.com/sessions',$data); Step 1.5 Oops that didn't work. All I got back was 403 Forbidden: The server understood the request, but is refusing to fulfill it. Ahhh I see another variable called authenticity_token I bet Twitter was looking for that. So let's back up and first hit twitter.com to get the authenticity_token variable, and then make the login post request with that variable included in our array of parameters. $scrape->fetch('https://twitter.com'); $data = array('session[username_or_email]' => "bradino", 'session[password]' => "secret"); $data['authenticity_token'] = $scrape->fetchBetween('name="authenticity_token" type="hidden" value="','"',$scrape->result); $scrape->fetch('https://twitter.com/sessions',$data); echo $scrape->result; So that's basically it. Now you are logged in and can scrape around and request other pages as you normally would. Sorry it wasn't a longer post. I really do enjoy this kind of stuff so if anyone has a request, hit me up. Errors? 1) Make sure that you are properly parsing the token variable 2) Make sure that you uncommented the lines about CURLOPT_COOKIEFILE and CURLOPT_COOKIEJAR, those options need to be enabled and be sure the path set is writable by your application 3) Make sure that the path to the cookie file is writable and that it is getting data written to it 4) If you get a message about being redirected you need to uncomment the line about CURLOPT_FOLLOWLOCATION, that option needs to be enabled true

Read the article

PHP DOMDocument Error Handling Problem

- by Jon

I'm having trouble trying to write an if statement for DOM that will check if $html is blank. However whenever the html page does end up blank, it just removes everything that would be below DOM (including what I had to check if it was blank). $html = file_get_contents("http://example.com/"); $dom = new DOMDocument; @$dom->loadHTML($html); $links = $dom->getElementById('dividhere')->getElementsByTagName('img'); foreach ($links as $link) { echo $link->getAttribute('src'); } All this does is grab an image url in the specified div, which works perfectly until the page is a blank html page. I've tried using SimpleHTMLDOM, which didn't work either (it didn't even fetch the image on working pages). Did I happen to miss something with this one or am I just missing something in both? include_once('simple_html_dom.php') $html = file_get_html("http://example.com/"); foreach($html->find('div[id="dividhere"]') as $div) { if(empty($div->src)) { continue; } echo $div->src; }

Search Results

Search found 210 results on 9 pages for 'scrape'.

Page 1/9 | 1 2 3 4 5 6 7 8 9 | Next Page >

- by Robert

- by Michael Mao

- by Ian Roke

- by Mello

- by Stoney

- by kivien

- by cozza

- by Sergio Tapia

- by Bodie

- by seanieb

- by Sakthivel

- by Pennf0lio

- by dreeves

- by hyngyn

- by John

- by Silvre

- by Tanith

- by Dr.Venkman

- by whitman

- by Petruza

- by user1742368

- by onemyndseye

- by Joan Kent

- by BRADINO

- by Jon

1 2 3 4 5 6 7 8 9 | Next Page >