Search Results

Search found 346 results on 14 pages for 'scraping'.

Page 6/14 | < Previous Page | 2 3 4 5 6 7 8 9 10 11 12 13  | Next Page >

  • How would you protect a database of links from being scraped?

    - by Yegor
    I have a large database of links, which are all sorted in specific ways and are attached to other information, which is valuable (to some people). Currently my setup (which seems to work) simply calls a php file like link.php?id=123, it logs the request with a timestamp into the DB. Before it spits out the link, it checks how many requests were made from that IP in the last 5 minutes. If its greater than x, it redirects you to a captcha page. That all works fine and dandy, but the site has been getting really popular (as well as been getting DDOsed for about 6 weeks), so php has been getting floored, so Im trying to minimize the times I have to hit up php to do something. I wanted to show links in plain text instead of thru link.php?id= and have an onclick function to simply add 1 to the view count. Im still hitting up php, but at least if it lags, it does so in the background, and the user can see the link they requested right away. Problem is, that makes the site REALLY scrapable. Is there anything I can do to prevent this, but still not rely on php to do the check before spitting out the link?

    Read the article

  • Nokogiri Doc Element Not Returning Correctly

    - by TenJack
    I am trying to scrape a wiktionary entry: uri = URI.parse("http://en.wiktionary.org/wiki/" + CGI.escape('abjure')) doc = Nokogiri::HTML(open(uri, 'User-Agent' => 'ruby')) but the doc shows no elements for this word. The other words work fine and this word used to work. I have no idea what changed. Anyone see anything wrong with this?

    Read the article

  • How to export scrubyt extractor?

    - by robintw
    I've written a scrubyt extractor based on the 'learning' technique - that is, specifying the current text on the page and getting it to work out the XPath expressions itself. However, I now want to export the extractor so that it can be used even when the page has changed. The documentation for scrubyt seems to be all over the place now, but from what I can find I should be able to put the line extractor.export(__FILE__) and it should work. It doesn't - I just get an error saying that there is the wrong number of arguments for export, it should have 0. I've tried it without any arguments and it still fails. I would ask on the scrubyt forum, but it seems like no-one's been there for ages! Any ideas what to do here?

    Read the article

  • Calling UIGetScreenImage() on manually-spawned thread prints "_NSAutoreleaseNoPool():" message to lo

    - by jtrim
    This is the body of the selector that is specified in NSThread +detachNewThreadSelector:(SEL)aSelector toTarget:(id)aTarget withObject:(id)anArgument NSAutoreleasePool *pool = [[NSAutoreleasePool alloc] init]; while (doIt) { if (doItForSure) { NSLog(@"checking"); doItForSure = NO; (void)gettimeofday(&start, NULL); /* do some stuff */ // the next line prints "_NSAutoreleaseNoPool():" message to the log CGImageRef screenImage = UIGetScreenImage(); /* do some other stuff */ (void)gettimeofday(&end, NULL); elapsed = ((double)(end.tv_sec) + (double)(end.tv_usec) / 1000000) - ((double)(start.tv_sec) + (double)(start.tv_usec) / 1000000); NSLog(@"Time elapsed: %e", elapsed); [pool drain]; } } [pool release]; Even with the autorelease pool present, I get this printed to the log when I call UIGetScreenImage(): 2010-05-03 11:39:04.588 ProjectName[763:5903] *** _NSAutoreleaseNoPool(): Object 0x15a2e0 of class NSCFNumber autoreleased with no pool in place - just leaking Has anyone else seen this with UIGetScreenImage() on a separate thread?

    Read the article

  • What I must to learn to write php graber (parser)?

    - by butteff
    What I must to learn to write php web-site graber (parser)? It's just must collect any other information from other websites, as weather forecast, wiki "on this day", some news and other useful and interesting "every day" information! and what i must to read for writing m3u player on php? sorry for my bad english

    Read the article

  • What movie website allows people to scrape it?

    - by Sergio Tapia
    I've wanted to make a C# library to scrape movie information and return it to the application, but someone told me that it's against the TOS. RottenTomatoes seems to have no problems with it from what I've read on their licensing page, but I'm not quite sure. Where could I aquire movie information legally and without cost? It's for an open source application hosted here: LINK

    Read the article

  • Saving HttpResponse/Request to file system

    - by chrisjlong
    Here is my scenario. User fills out this large page which is dynamically created based off DB values. Those values can change. When the user fills out the page and hits submit we want to save a copy of the page as html on the server, this way if the text or wording changes, when they go back to view their posted information, it is historically accurate. So I basically need to do this protected void buttonSave_Click(object sender, EventArgs e) { //collect information into an object to save it in the db bool result = BusinessLogic.Save(myBusinessObject); if (result) //!!! Here is where I need to save this page as an html file on my servers IFS!!!! else //whatever Response.Redirect("~/SomeOtherPage.aspx"); } Any help is greatly apprciated. Also I CANNOT just request the data from the url because query string parameters are a big no no in this case. The key to pull the database info up (at its highest level) is all in session so I cant just request a url and save it. Thanks!

    Read the article

  • How to open URL's in rails?

    - by yuval
    I'm trying to read in the html of a certain website. Trying @something = open("http://www.google.com/") fails with the following error: Errno::ENOENT in testController#show No such file or directory - http://www.google.com/ Going to http://www.google.com/, I obviously see the site. What am I doing wrong? Thanks!

    Read the article

  • I want to scrape a site using GAE and post the results into a Google Entity

    - by cozza
    I want to scrape this URL : https://www.xstreetsl.com/modules.php?searchSubmitImage_x=0&searchSubmitImage_y=0&SearchLocale=0&name=Marketplace&SearchKeyword=business&searchSubmitImage.x=0&searchSubmitImage.y=0&SearchLocale=0&SearchPriceMin=&SearchPriceMax=&SearchRatingMin=&SearchRatingMax=&sort=&dir=asc Go into each of the links and extract out various pieces of information e.g. permissions, prims etc then post the results into a Entity on google app engine. I want to know the best way to go about it? Chris

    Read the article

  • Programmaticaly grabbing text from a web page that is dynamically generated.

    - by bstullkid
    There is a website I am trying to pull information from in perl, however the section of the page I need is being generated using javascript so all you see in the source is <div id="results"></div> I need to somehow pull out the contents of that div and save it to a file using perl/proxies/whatever. e.g. the information I want to save would be document.getElementById('results').innerHTML; I am not sure if this is possible or if anyone had any ideas or a way to do this. I was using a lynx source dump for other pages but since I cant straight forward screen scrape this page I came here to ask about it! If anyone is interested, the page is http://downloadcenter.trendmicro.com/index.php?clk=left_nav&clkval=pattern_file&regs=NABU and the info I am trying to get is the row about the ConsumerOPR

    Read the article

  • Yahoo Web Scrapes: What are the limits?

    - by bvandrunen
    We are using a web scraper and have it set up to have a sleep function which has a random function set up (so that it isn't the same time between each scrape) but we are still getting blocked from Yahoo after 20-30 requests. Does any one know if there is a limit (i.e: 20 requests per minutes, 200 an hour) Right now our average between each request is around 3-6 seconds. Thanks for any help

    Read the article

  • How can I get all content within <td> tag using a HTML Agility Pack?

    - by Bob Dylan
    So I'm writing an application that will do a little screen scrapping. I'm using the HTML Agility Pack to load an entire HTML page into an instance of HtmlDocoument called doc. Now I want to parse that doc, looking for this: <table border="0" cellspacing="3"> <tr><td>First rows stuff</td></tr> <tr> <td> The data I want is in here <br /> and it's seperated by these annoying <br /> 's. No id's, classes, or even a single <p> tag. </p> Just a bunch of <br /> tags. </td> </tr> </table> So I just need to get the data within the 2nd row. How can I do this? Should I use a regex or something else?

    Read the article

  • Help converting code using httlib2 to use urllib2

    - by ThinkCode
    What am I trying to do? Visit a site, retrieve cookie, visit the next page by sending in the cookie info. It all works but httplib2 is giving me one too many problems with socks proxy on one site. http = httplib2.Http() main_url = 'http://mywebsite.com/get.aspx?id='+ id +'&rows=25' response, content = http.request(main_url, 'GET', headers=headers) main_cookie = response['set-cookie'] referer = 'http://google.com' headers = {'Content-type': 'application/x-www-form-urlencoded', 'Cookie': main_cookie, 'User-Agent' : USER_AGENT, 'Referer' : referer} How to do the same exact thing using urllib2 (cookie retrieving, passing to the next page on the same site)? Thank you.

    Read the article

  • Why Shouldn't I Programmatically Submit Username/Password to Facebook/Twitter/Amazon/etc?

    - by viatropos
    I wish there was a central, fully customizable, open source, universal login system that allowed you to login and manage all of your online accounts (maybe there is?)... I just found RPXNow today after starting to build a Sinatra app to login to Google, Facebook, Twitter, Amazon, OpenID, and EventBrite, and it looks like it might save some time. But I keep wondering, not being an authentication guru, why couldn't I just have a sleek login page saying "Enter username and password, and check your login service", and then in the background either scrape the login page from say EventBrite and programmatically submit the form with Mechanize, or use an API if there was one? It would be so much cleaner and such a better user experience if they didn't have to go through popups and redirects and they could use any previously existing accounts. My question is: What are the reasons why I shouldn't do something like that? I don't know much about the serious details of cookies/sessions/security, so if you could be descriptive or point me to some helpful links that would be awesome. Thanks!

    Read the article

  • Voting on Hacker News stories programmatically?

    - by igorgue
    I decided to write an app like: http://michaelgrinich.com/hackernews/ but for Android devices, my idea will use a web application backend (because I rather code in Python and for the web than completely in Java for Android devices). What I have right now implemented is something like this: $ curl -i http://localhost:8080/stories.json?page=1\&stories=1 HTTP/1.0 200 OK Date: Sun, 25 Apr 2010 07:59:37 GMT Server: WSGIServer/0.1 Python/2.6.5 Content-Length: 296 Content-Type: application/json [{"title": "Don\u2019t talk to aliens, warns Stephen Hawking", "url": "http://www.timesonline.co.uk/tol/news/science/space/article7107207.ece?", "unix_time": 1272175177, "comments": 15, "score": 38, "user": "chaostheory", "position": 1, "human_time": "Sun Apr 25 01:59:37 2010", "id": "1292241"}] The next step (and final I think) is voting, my design is doing something like this: $ curl -i http://localhost:8080/stories/1?vote=up -u username:password Will vote up and: $ curl -i http://localhost:8080/stories/1?vote=down -u username:password Down. I have no idea how to do it though... I was planning to use Twill but the login link is always different, e.g.: http://news.ycombinator.com/x?fnid=7u89ccHKln Later the Android app will consume this API. Any experience with programmatically browsing Hacker News?

    Read the article

  • Is selling a "website screen scraper" illegal?

    - by Yatendra Goel
    I have coded a "website screen scraper" and want to sell it commercially. I know that webpages scraped by the screen scraper are restricted to be scraped by the webmaser of that website. The robots.txt file of the website says that its webpages must not be scraped. So my question is whether selling that screen scraper is a crime or using that screen scraper is a crime in legal terms. I know that this question is related to law but I thought the software experts on SO must also have answer to this question.

    Read the article

  • Reading Ontology with Jena, feeding it with RDF triples, and producing correct RDF string output.

    - by JonB
    Hi, I have an ontology, which I read in with Jena to help me scrape some RDFa triples from a website. I don't currently store these triples in a Jena model, but that is fairly straight forward to do, its on my to do next list. The area I am struggling with, though, is to get Jena to output correct RDF for the ontology I have. The ontology uses Owl and RDFS definitions, but when I pass some example triples into the model, they don't appear correctly. Almost as if it doesn't know anything about the ontology. The output is, however, still valid RDF, just it's not coming out in the form I was hoping for. Am I correct in thinking that Jena should be able to produce well written RDF (not just valid) about the triples I have collected, based on the ontology or does this out stretch what it is capable of? Many thanks for any input.

    Read the article

  • form submitting with mechanize and Python

    - by MATELIN Alexis
    I'm trying to scrap a website that requires to submit two forms : a first one to loggin and a second one to specify my research. I'm using Python and the mechanize package. No problem with the first one, but i just can't figure out how to pass through the second one. Here is the part of my code related to the firm above-mentionned agemin=18 agemax=25 by='region' country='France' region=2 newcustomers=1 browser.select_form(nr=0) browser['age[min]']=agemin browser['age[max]']=agemax browser['country']=country browser['region']=region browser['by']=by browser['new-customers']=newcustomers response=browser.submit() content=response.read() but when I submit the variable 'age[min]' by example, I get the following error message : TypeError: object of type 'int' has no len() to give you some more informations, here is what I get with 'print br.form' <POST http://www.adopteunmec.com/qsearch/ajax_quick application/x-www-form-urlencoded <SelectControl(age[min]=[, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, *30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99])> <SelectControl(age[max]=[, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, *45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99])> <SelectControl(by=[*region, distance])> <SelectControl(country=[*fr, be, ch, ca])> <SelectControl(region=[*1, 2, 3, 4, 5, 6, 7, 8, 22, 23, 9, 10, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 11])> <SelectControl(distance[min]=[*, 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200, 210, 220, 230, 240, 250, 260, 270, 280, 290, 300, 310, 320, 330, 340, 350, 360, 370, 380, 390, 400, 410, 420, 430, 440, 450, 460, 470, 480, 490, 500, 510, 520, 530, 540, 550, 560, 570, 580, 590, 600, 610, 620, 630, 640, 650, 660, 670, 680, 690, 700, 710, 720, 730, 740, 750, 760, 770, 780, 790, 800, 810, 820, 830, 840, 850, 860, 870, 880, 890, 900, 910, 920, 930, 940, 950, 960, 970, 980, 990, 1000])> <SelectControl(distance[max]=[, 0, 10, 20, 30, 40, 50, 60, 70, *80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200, 210, 220, 230, 240, 250, 260, 270, 280, 290, 300, 310, 320, 330, 340, 350, 360, 370, 380, 390, 400, 410, 420, 430, 440, 450, 460, 470, 480, 490, 500, 510, 520, 530, 540, 550, 560, 570, 580, 590, 600, 610, 620, 630, 640, 650, 660, 670, 680, 690, 700, 710, 720, 730, 740, 750, 760, 770, 780, 790, 800, 810, 820, 830, 840, 850, 860, 870, 880, 890, 900, 910, 920, 930, 940, 950, 960, 970, 980, 990, 1000])> <CheckboxControl(new=[*1])>> My guess is that the form needs an object (like a list) containing all the variables to accept it ; that's why it refuses the variables submited one by one. Thank you in advance for any help ! Alexis

    Read the article

  • Automated download of website content using ASP.net

    - by Yaaqov
    Using ASP.net, what methods can I use to do the following: Open up a connection to a given URL to read HTML content Parse the given URL for hyperlinks, and place them in an array Loop through each hyperlink (only 1 level down), opening each one, saving the HTML contents in a table, and move to the next hyperlink until done. If ASP.net is not up to the task, other languages or free scripts/toolkits would be acceptable. Thanks.

    Read the article

  • Is there a way to programmatically extract the feed of a podcast from the iTunes page?

    - by J. Pablo Fernández
    From an iTunes page, like http://itunes.apple.com/us/podcast/this-week-in-tech-mp3-edition/id73329404, is there a way to extract the corresponding feed address? In this case it would be http://leoville.tv/podcasts/twit.xml. I know that if you open on iTunes you can extract it manually, but I want to do it programmatically. There's a link to the website of the podcast, but it may not be accurate. In this case it points to a web site with 20 podcasts on it.

    Read the article

  • How can I get all content within <table></table> tags using a regex?

    - by Bob Dylan
    So I'm writing an application that will do a little screen scrapping. All the pages (about 1000 or so) contain this line: <table border="0" cellspacing="3"> <tr><td>First rows stuff</td></tr> <tr> <td> The data I want is in here <br /> and it's seperated by these annoying <br /> 's. No id's, classes, or even a single <p> tag. Just a bunch of <br /> tags. </td> </tr> </table> So I just need to get the data within the 2nd row out. How can I do this? Should I use a regex or something else?

    Read the article

  • R: extracting "clean" UTF-8 text from a web page scraped with RCurl

    - by SlowLearner
    Using R, I am trying to scrape a web page save the text, which is in Japanese, to a file. Ultimately this needs to be scaled to tackle hundreds of pages on a daily basis. I already have a workable solution in Perl, but I am trying to migrate the script to R to reduce the cognitive load of switching between multiple languages. So far I am not succeeding. Related questions seem to be this one on saving csv files and this one on writing Hebrew to a HTML file. However, I haven't been successful in cobbling together a solution based on the answers there. The pages are from Yahoo! Japan Finance and my Perl code that looks like this. use strict; use HTML::Tree; use LWP::Simple; #use Encode; use utf8; binmode STDOUT, ":utf8"; my @arr_links = (); $arr_links[1] = "http://stocks.finance.yahoo.co.jp/stocks/detail/?code=7203"; $arr_links[2] = "http://stocks.finance.yahoo.co.jp/stocks/detail/?code=7201"; foreach my $link (@arr_links){ $link =~ s/"//gi; print("$link\n"); my $content = get($link); my $tree = HTML::Tree->new(); $tree->parse($content); my $bar = $tree->as_text; open OUTFILE, ">>:utf8", join("","c:/", substr($link, -4),"_perl.txt") || die; print OUTFILE $bar; } This Perl script produces a CSV file that looks like the screenshot below, with proper kanji and kana that can be mined and manipulated offline: My R code, such as it is, looks like the following. The R script is not an exact duplicate of the Perl solution just given, as it doesn't strip out the HTML and leave the text (this answer suggests an approach using R but it doesn't work for me in this case) and it doesn't have the loop and so on, but the intent is the same. require(RCurl) require(XML) links <- list() links[1] <- "http://stocks.finance.yahoo.co.jp/stocks/detail/?code=7203" links[2] <- "http://stocks.finance.yahoo.co.jp/stocks/detail/?code=7201" txt <- getURL(links, .encoding = "UTF-8") Encoding(txt) <- "bytes" write.table(txt, "c:/geturl_r.txt", quote = FALSE, row.names = FALSE, sep = "\t", fileEncoding = "UTF-8") This R script generates the output shown in the screenshot below. Basically rubbish. I assume that there is some combination of HTML, text and file encoding that will allow me to generate in R a similar result to that of the Perl solution but I cannot find it. The header of the HTML page I'm trying to scrape says the chartset is utf-8 and I have set the encoding in the getURL call and in the write.table function to utf-8, but this alone isn't enough. The question How can I scrape the above web page using R and save the text as CSV in "well-formed" Japanese text rather than something that looks like line noise? Edit: I have added a further screenshot to show what happens when I omit the Encoding step. I get what look like Unicode codes, but not the graphical representation of the characters. So it may be some kind of locale-related issue, but in the exact same locale the Perl script does provide useful output. So this is still puzzling.

    Read the article

< Previous Page | 2 3 4 5 6 7 8 9 10 11 12 13  | Next Page >