scrape - Page 2 - Developer IT

Ruby - RegEx problem or maybe another solution altogether

- by r3nrut

Ok the problem I'm having is that I have a block of javascript I've successfully scraped out of a websites source and now I have to sift through the js to get the specific values I'm looking for. Below is the chunk i'm needing to deal with. I need to find "flvFileName" and get all the file names listed. In this case its: trailer1,trailer2,trailer3. At first I started using regex to match the start and end tags and them match the file names and extract them to an array but the problem is that there isn't always 3 videos in the list. Could be 0, 1, 2, 3, 4 etc. So matching doesn't work. Any thoughts on a way to approach this that won't make me continue to abuse my laptop? ["", "\r\n", "\n", "\r\n function IgnoreEnter(e) {\r\n var code;\r\n if (!e) // IE\r\n {\r\n var e = window.event;\r\n }\r\n if (e.keyCode) {\r\n code = e.keyCode;\r\n }\r\n else if (e.which) // Firefox, Opera\r\n {\r\n code = e.which;\r\n }\r\n\r\n if (code == 13) {\r\n e.cancelBubble = true;\r\n e.returnValue = false;\r\n }\r\n }\r\n\r\n function ResetDefault() {\r\n __defaultFired = false;\r\n }\r\n", "", "\r\n// <![CDATA[\r\n$(doc).ready(function () { $('#VideoObject').flash({ swf: '/scinema/video.swf', height: 300, width: 480, hasVersion: 8, menu: false, wmode: 'transparent', bgcolor: '#000',flashvars: {flvFileName: 'trailer1,trailer2,trailer3', age: 'no', isForced: 'true'} }); });

Read the article

what is the best method or tool to scrape web sites ?

- by user63898

Hello all i need to scrape (with approval) web sites before I start to write my own what is the best tool/way to scrape web sites, which is both fast (multithreaded) and easy to learn?

Read the article

how to scrape html generated by javascript using python?

- by wong2

I want to scrape the html generated by javascript , just like what you can see in Firebug.

Read the article

How can I use Perl to scrape a website that reveals its content with JavaScript?

- by AmbroseChapel

I need to write a Perl script to scrape a website. The website can only be scraped with JavaScript, and the user is on Windows. I got some way with Win32::IE::Mechanize on my work machine, which has IE6, but then I moved to my netbook which has IE8, and can't even get as far as fetching a simple page. Is Win32::IE::Mechanize up to date with the latest versions of IE? But, more to the point, given a recent WinXP machine, what's the quickest, easiest way to scrape a site which only reveals its content via JavaScript?

Read the article

How do I screen scrape a website and get data within div?

- by user272899

How can I screen scrape a website using cURL and show the data within a specific div?

Read the article

How to get (scrape) the contents of a site that requires logging in through YQL?

- by Nok Imchen

Is it possible to get (scrape) data from a site that requires logging in using YQL? If yes, please tell the procedure.

Read the article

What's the fastest way to scrape a lot of pages in php?

- by Yegor

I have a data aggregator that relies on scraping several sites, and indexing their information in a way that is searchable to the user. I need to be able to scrape a vast number of pages, daily, and I have ran into problems using simple curl requests, that are fairly slow when executed in rapid sequence for a long time (the scraper runs 24/7 basically). Running a multi curl request in a simple while loop is fairly slow. I speeded it up by doing individual curl requests in a background process, which works faster, but sooner or later the slower requests start piling up, which ends up crashing the server. Are there more efficient ways of scraping data? perhaps command line curl?

Read the article

Trying to scrape a page with php/curl, trouble with cookies, post vars, and hidden fields.

- by Patrick

Im trying to use CURL to scrape the results from the page http://adlab.msn.com/Online-Commercial-Intention/Default.aspx My understanding, is that i visit this page, it places cookie info and sets a few variables. I enter my query, select the query radio option, and click go. The problem, is that its not working the way that it does through the website as im trying to get it to using the code below. Ive tweaked several things, but im posting here in hopes someone can find what im missing. This is my code as it stands now: include("simple_html_dom.php"); $ckfile = tempnam ("/tmp", "CURLCOOKIE"); $query = $_GET['query']; $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, "http://adlab.msn.com/Online-Commercial-Intention/default.aspx"); curl_setopt($ch, CURLOPT_HEADER, 1); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1); curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false); curl_setopt($ch, CURLOPT_COOKIEJAR, $ckfile); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.3) Gecko/20070309 Firefox/2.0.0.3"); $pagetext1 = curl_exec($ch); curl_exec($ch); $html2 = str_get_html($pagetext1); $viewstate = $html2->find('input[id=__VIEWSTATE]', 1)->plaintext; echo $query."<br>".$viewstate."<br>"; $params = array( '__EVENTTARGET' => "", '__EVENTARGUMENT' => "", '__LASTFOCUS' => "", '__VIEWSTATE' => "$viewstate", 'MyMaster%3ADemoPageContent%3AtxtQuery' => "$query", 'MyMaster%3ADemoPageContent%3Alan' => "QueryRadio", 'MyMaster%3ADemoPageContent%3AgoButton.x' => "17", 'MyMaster%3ADemoPageContent%3AgoButton.y' => "12", 'MyMaster%3ADemoPageContent%3AidQuery' => "$query", 'MyMaster%3AHiddenKeywordTextBox' => "", ); $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, 'http://adlab.msn.com/Online-Commercial-Intention/default.aspx'); curl_setopt($ch, CURLOPT_POST, 1); curl_setopt($ch, CURLOPT_REFERER, 'http://adlab.msn.com/Online-Commercial-Intention/default.aspx'); curl_setopt($ch, CURLOPT_POSTFIELDS, '$params'); curl_setopt($ch, CURLOPT_HEADER, 1); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1); curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false); curl_setopt ($ch, CURLOPT_COOKIEFILE, $ckfile); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.3) Gecko/20070309 Firefox/2.0.0.3"); $pagetext = curl_exec($ch); curl_exec($ch); // echo $ckfile; $html = str_get_html($pagetext); $ret = $html->find('.intentionLabel', 1)->plaintext; echo $ret."<br><br><br>"; echo $pagetext;

Read the article

Scraping paginated items from a website using scrapy

- by Mridang Agarwalla

I'm using scrapy to scrape items from a site. I'm not being able to implement this scraping pattern. The site I'm trying to scrape is a forum and I scrape the site once a day. Each page has a table containing posts. New posts are added to the top of the table and as more and more posts are posted to the site, the older posts go further into the pages due to pagination. This is a very simple scenario and we will assume that the order of the posts never change. I would like to scrape this site and scrape all the "new" records until the last scraped post from yesterday is encountered. I have configured my spider to paginate endlessly and when it encounters yesterday's last scraped post, it should stop. How can implement this? (My Scrapy installation works with my Django installation using django-dynamic-scraper )

Read the article

How to scrape the first paragraph from a wikipedia page?

- by David

Let's say I want to grab the first paragraph in this wikipedia page. How do I get the principal text between the title and contents box using XPath or DOM & PHP or something similar? Is there any php library for that? I don't want to use the api because it's a bit complex. Note: i just need that to add a widget under my pages that displays related info from Wikipedia.

Read the article

What is the fastest way to scrape HTML webpage in Android?

- by kunjaan

I need to extract information from an unstructured web page in Android. The information I want is embedded in a table that doesn't have an id. <table> <tr><td>Description</td><td></td><td>I want this field next to the description cell</td></tr> </table> Should I use Pattern Matching? Use BufferedReader to extract the information? Or are there faster way to get that information?

Read the article

How to scrape user's data without being banned by the server?

- by embedded

I'm developing a site which monitors user's date. It uses the cURL over PHP. It first gets authorized using cookie and then parses the required data. My problem is that it needs to fire multiple requests to the server (for all registered users) and this may Get me banned by the remote server. I would like to know if there is something I could do to prevent being banned. (This activity is legal - the users have provided their login information) Thanks

Read the article

Use the Django ORM in a standalone script (again)

- by Rishabh Manocha

I'm trying to use the Django ORM in some standalone screen scraping scripts. I know this question has been asked before, but I'm unable to figure out a good solution for my particular problem. I have a Django project with defined models. What I would like to do is use these models and the ORM in my scraping script. My directory structure is something like this: project scrape #scraping scripts ... test.py web django_project settings.py ... #Django files I tried doing the following in project/scrape/test.py: print os.path.join(os.path.abspath('..'), 'web', 'django_project') sys.path.append(os.path.join(os.path.abspath('..'), 'web', 'django_project')) print sys.path print "-------" os.environ['DJANGO_SETTINGS_MODULE'] = 'django_project.settings' #print os.environ from django_project.myapp.models import MyModel print MyModel.objects.count() However, I get an ImportError when I try to run test.py: Traceback (most recent call last): File "test.py", line 12, in <module> from django_project.myapp.models import MyModel ImportError: No module named django_project.myapp.models One solution I found around this problem is to create a symbolic link to ../web/govcheck in the scrape folder: :scrape rmanocha$ ln -s ../web/govcheck ./govcheck With this, I can then run test.py just fine. However, this seems like a hack, and more importantly, is not very portable (I will have to create this symbolic link everywhere I run this code). So, I was wondering if anyone has any better solutions for my problem?

Read the article

Is Screen-scraping a windows application with ruby possible?

- by Gerhard

I want to scrape text data from a windows application to do additional processing using existing ruby code. Would it be possible to scrape the data as it is updated in the windows application using Ruby and where do I start?

Read the article

YQL How to use wildcard in XPath

- by uku

Hello, I have a malformed page to scrape, and have had a hard time getting the correct XPath for YQL. I can scrape individual fields that I need using, for example: //*[@id="cell_12345"] But what I really need to do is return all elements who's ID begins with cell_. Something like: //*[@id="cell_"*] How do I do this? Also, if anybody can point me to a good XPath reference it would be very helpful. Thanks!

Read the article

Is Screen-scraping a windows application in ruby possible?

- by Gerhard

I want to scrape text data from a windows application to do additional processing using existing ruby code. Would it be possible to scrape the data as it is updated in the windows application using Ruby and where do I start?

Read the article

Can I the result that did FetchPage and can do FetchPage more in Yahoo!Pipes ?

- by ffffff

I do scrape of a page with Yahoo!Pipes and want about scrape doing other pages with this result more. For example [FetchPage] -[Regex] Based on it [URLBuilder] I want to do input of URLBuilder in Path, but will such a thing be possible?

Read the article

Can YQL parse web sites requiring cookie-based authentication?

- by user249488

First, my use case: I'm trying to use YQL's built in XPATH capabilities to scrape content from Yahoo! Fantasy Sports. It uses some sort of cookie-based authentication scheme. Basically, the sequence is: 1) Do an HTTP GET on the Yahoo! Login page 2) Parse the hidden inputs from the response and do an HTTP PUT with your Yahoo! Login on the form URL 3) Use the cookies returned from step 2 to GET any of the Fantasy Sports! websites that you have access to My question is, does YQL support doing this to scrape data? The only authentication based examples I've seen use OAuth, but I haven't seen any examples of using YQL to parse websites with cookie-based authentication schemes

Read the article

Scraping *.aspx content using Python

- by tomato

I'm having difficulties scraping dynamically generated table in ASPX. Trying to scrape the gas prices from a site like this GasPrices. I can extract all the information in the gas price table (address, time submitted etc.), except for the actual gas price. Is there a way I could scrape the gas prices? i.e. somehow get a text representation of it. I'm not very familiar with ASP/ASPX - but what's being generated now is not showing up in the final HTML. I'm using Python to do the scraping, but that's irrelevant unless there's a specific library... Thanks in advance.

Read the article

Screen Scraping

- by Sambo

Hi I'm trying to implement a screen scraping scenario on my website and have the following set so far. What I'm ultimately trying to do is replace all links in the $results variable that have "ResultsDetails.aspx?" to "results-scrape-details/" then output again. Can anyone point me in the right direction? <?php $url = "http://mysite:90/Testing/label/stuff/ResultsIndex.aspx"; $raw = file_get_contents($url); $newlines = array("\t","\n","\r","\x20\x20","\0","\x0B"); $content = str_replace($newlines, "", html_entity_decode($raw)); $start = strpos($content,"<div id='pageBack'"); $end = strpos($content,'</body>',$start) + 6; $results = substr($content,$start,$end-$start); $pattern = 'ResultsDetails.aspx?'; $replacement = 'results-scrape-details/'; preg_replace($pattern, $replacement, $results); echo $results;

Read the article

Perl modules for controlling browsers

- by AmbroseChapel

I need to write a perl script to scrape a website. The website can only be scraped with JavaScript, and the user is on Windows. I got some way with Win32::IE::Mechanize on my work machine, which has IE6, but then I moved to my netbook which has IE8, and can't even get as far as fetching a simple page. Is Win32::IE::Mechanize up to date with the latest versions of IE? But, tl,dr -- more to the point, given a recent WinXP machine, what's the quickest, easiest way to scrape a site which only reveals its content via JavaScript?

Read the article

Selenium navigation inside foreach loop

- by smudgedlens

I am having an issue with navigation. I get a list of rows from an html table. I iterate over the rows and scrape information from them. But there is also a link on the row that I click to go to more information related to the row to scrape. Then I navigate back to the page with the original table. This works for the first row, but for the subsequent rows, it throws an exception. I look at my row collection after the first time the link inside a row is clicked, and none of them have the correct values like they did before I clicked the link. I believe that there is something going on when I navigate to a different URL that I'm not getting. My code is below. How do I get this working so I can iterate over the parent table, click the links in each row, navigate to the child table, but still continue iterating over the rows in the parent table? private List<Document> getResults() { var documents = new List<Document>(); //Results IWebElement docsTable = this.webDriver.FindElements(By.TagName("table")) .Where(table => table.Text.Contains("Document List")) .FirstOrDefault(); var validDocRowRegex = new Regex(@"^(\d{3}\s+)"); var docRows = docsTable.FindElements(By.TagName("tr")) .Where(row => //It throws an exception with .FindElement() when there isn't one. row.FindElements(By.TagName("td")).FirstOrDefault() != null && //Yeah, I don't get this one either. I negate the match and so it works?? !validDocRowRegex.IsMatch( row.FindElement(By.TagName("td")).Text)) .ToList(); foreach (var docRow in docRows) { //Todo: find out why this is crashing on some documents. var cells = docRow.FindElements(By.TagName("td")); var document = new Document { DocID = Convert.ToInt32(cells.First().Text), PNum = Convert.ToInt32(cells[1].Text), AuthNum = Convert.ToInt32(cells[2].Text) }; //Go to history for the current document. cells.Where(cell => cell.FindElements(By.TagName("a")).FirstOrDefault() != null) .FirstOrDefault().Click(); //Todo: scrape child table. this.webDriver.Navigate().Back(); } return documents; }

Read the article

Is there an SO API which can fetch all Questions & Answers for a particluar Keywords

- by user4203

I am looking for an API which helps in fetching all the Questions & Answers from SO and other Stack Exchange sites only on a particular "keyword". Later using XML RPC these questions will be posted as blog post and answers to this post's answers. Just wondering whether it's possible with an API. One of my friend suggested that we should Scrape but i don't want screen scraping instead i am looking for API requests which should handle this.

Read the article

What icon would you use to denote an XML (not rss) feed available [closed]

- by mplungjan

Given two sites - one aimed at regular users and one for automated access. The first site is the best known, so many are (still) screen scraping that site for data. It is preferable to have move to the other site where the same data is available in xml format. What icon (+text/title) on a page you are about to screen scrape, would make you pay attention and decide to see what that was about? Examples from Google Image search for xml icon

Read the article

R: extracting "clean" UTF-8 text from a web page scraped with RCurl

- by SlowLearner

Using R, I am trying to scrape a web page save the text, which is in Japanese, to a file. Ultimately this needs to be scaled to tackle hundreds of pages on a daily basis. I already have a workable solution in Perl, but I am trying to migrate the script to R to reduce the cognitive load of switching between multiple languages. So far I am not succeeding. Related questions seem to be this one on saving csv files and this one on writing Hebrew to a HTML file. However, I haven't been successful in cobbling together a solution based on the answers there. The pages are from Yahoo! Japan Finance and my Perl code that looks like this. use strict; use HTML::Tree; use LWP::Simple; #use Encode; use utf8; binmode STDOUT, ":utf8"; my @arr_links = (); $arr_links[1] = "http://stocks.finance.yahoo.co.jp/stocks/detail/?code=7203"; $arr_links[2] = "http://stocks.finance.yahoo.co.jp/stocks/detail/?code=7201"; foreach my $link (@arr_links){ $link =~ s/"//gi; print("$link\n"); my $content = get($link); my $tree = HTML::Tree->new(); $tree->parse($content); my $bar = $tree->as_text; open OUTFILE, ">>:utf8", join("","c:/", substr($link, -4),"_perl.txt") || die; print OUTFILE $bar; } This Perl script produces a CSV file that looks like the screenshot below, with proper kanji and kana that can be mined and manipulated offline: My R code, such as it is, looks like the following. The R script is not an exact duplicate of the Perl solution just given, as it doesn't strip out the HTML and leave the text (this answer suggests an approach using R but it doesn't work for me in this case) and it doesn't have the loop and so on, but the intent is the same. require(RCurl) require(XML) links <- list() links[1] <- "http://stocks.finance.yahoo.co.jp/stocks/detail/?code=7203" links[2] <- "http://stocks.finance.yahoo.co.jp/stocks/detail/?code=7201" txt <- getURL(links, .encoding = "UTF-8") Encoding(txt) <- "bytes" write.table(txt, "c:/geturl_r.txt", quote = FALSE, row.names = FALSE, sep = "\t", fileEncoding = "UTF-8") This R script generates the output shown in the screenshot below. Basically rubbish. I assume that there is some combination of HTML, text and file encoding that will allow me to generate in R a similar result to that of the Perl solution but I cannot find it. The header of the HTML page I'm trying to scrape says the chartset is utf-8 and I have set the encoding in the getURL call and in the write.table function to utf-8, but this alone isn't enough. The question How can I scrape the above web page using R and save the text as CSV in "well-formed" Japanese text rather than something that looks like line noise? Edit: I have added a further screenshot to show what happens when I omit the Encoding step. I get what look like Unicode codes, but not the graphical representation of the characters. So it may be some kind of locale-related issue, but in the exact same locale the Perl script does provide useful output. So this is still puzzling.

Search Results

Search found 210 results on 9 pages for 'scrape'.

Page 2/9 | < Previous Page | 1 2 3 4 5 6 7 8 9 | Next Page >

- by r3nrut

- by user63898

- by wong2

- by AmbroseChapel

- by user272899

- by Nok Imchen

- by Yegor

- by Patrick

- by Mridang Agarwalla

- by David

- by kunjaan

- by embedded

- by Rishabh Manocha

- by Gerhard

- by uku

- by Gerhard

- by ffffff

- by user249488

- by tomato

- by Sambo

- by AmbroseChapel

- by smudgedlens

- by user4203

- by mplungjan

- by SlowLearner

< Previous Page | 1 2 3 4 5 6 7 8 9 | Next Page >