pdf scraping - Page 56 - Developer IT

How to open URL's in rails?

- by yuval

I'm trying to read in the html of a certain website. Trying @something = open("http://www.google.com/") fails with the following error: Errno::ENOENT in testController#show No such file or directory - http://www.google.com/ Going to http://www.google.com/, I obviously see the site. What am I doing wrong? Thanks!

Read the article

How to export scrubyt extractor?

- by robintw

I've written a scrubyt extractor based on the 'learning' technique - that is, specifying the current text on the page and getting it to work out the XPath expressions itself. However, I now want to export the extractor so that it can be used even when the page has changed. The documentation for scrubyt seems to be all over the place now, but from what I can find I should be able to put the line extractor.export(__FILE__) and it should work. It doesn't - I just get an error saying that there is the wrong number of arguments for export, it should have 0. I've tried it without any arguments and it still fails. I would ask on the scrubyt forum, but it seems like no-one's been there for ages! Any ideas what to do here?

Read the article

Saving HttpResponse/Request to file system

- by chrisjlong

Here is my scenario. User fills out this large page which is dynamically created based off DB values. Those values can change. When the user fills out the page and hits submit we want to save a copy of the page as html on the server, this way if the text or wording changes, when they go back to view their posted information, it is historically accurate. So I basically need to do this protected void buttonSave_Click(object sender, EventArgs e) { //collect information into an object to save it in the db bool result = BusinessLogic.Save(myBusinessObject); if (result) //!!! Here is where I need to save this page as an html file on my servers IFS!!!! else //whatever Response.Redirect("~/SomeOtherPage.aspx"); } Any help is greatly apprciated. Also I CANNOT just request the data from the url because query string parameters are a big no no in this case. The key to pull the database info up (at its highest level) is all in session so I cant just request a url and save it. Thanks!

Read the article

I want to scrape a site using GAE and post the results into a Google Entity

- by cozza

I want to scrape this URL : https://www.xstreetsl.com/modules.php?searchSubmitImage_x=0&searchSubmitImage_y=0&SearchLocale=0&name=Marketplace&SearchKeyword=business&searchSubmitImage.x=0&searchSubmitImage.y=0&SearchLocale=0&SearchPriceMin=&SearchPriceMax=&SearchRatingMin=&SearchRatingMax=&sort=&dir=asc Go into each of the links and extract out various pieces of information e.g. permissions, prims etc then post the results into a Entity on google app engine. I want to know the best way to go about it? Chris

Read the article

What I must to learn to write php graber (parser)?

- by butteff

What I must to learn to write php web-site graber (parser)? It's just must collect any other information from other websites, as weather forecast, wiki "on this day", some news and other useful and interesting "every day" information! and what i must to read for writing m3u player on php? sorry for my bad english

Read the article

What movie website allows people to scrape it?

- by Sergio Tapia

I've wanted to make a C# library to scrape movie information and return it to the application, but someone told me that it's against the TOS. RottenTomatoes seems to have no problems with it from what I've read on their licensing page, but I'm not quite sure. Where could I aquire movie information legally and without cost? It's for an open source application hosted here: LINK

Read the article

Help converting code using httlib2 to use urllib2

- by ThinkCode

What am I trying to do? Visit a site, retrieve cookie, visit the next page by sending in the cookie info. It all works but httplib2 is giving me one too many problems with socks proxy on one site. http = httplib2.Http() main_url = 'http://mywebsite.com/get.aspx?id='+ id +'&rows=25' response, content = http.request(main_url, 'GET', headers=headers) main_cookie = response['set-cookie'] referer = 'http://google.com' headers = {'Content-type': 'application/x-www-form-urlencoded', 'Cookie': main_cookie, 'User-Agent' : USER_AGENT, 'Referer' : referer} How to do the same exact thing using urllib2 (cookie retrieving, passing to the next page on the same site)? Thank you.

Read the article

Yahoo Web Scrapes: What are the limits?

- by bvandrunen

We are using a web scraper and have it set up to have a sleep function which has a random function set up (so that it isn't the same time between each scrape) but we are still getting blocked from Yahoo after 20-30 requests. Does any one know if there is a limit (i.e: 20 requests per minutes, 200 an hour) Right now our average between each request is around 3-6 seconds. Thanks for any help

Read the article

Why Shouldn't I Programmatically Submit Username/Password to Facebook/Twitter/Amazon/etc?

- by viatropos

I wish there was a central, fully customizable, open source, universal login system that allowed you to login and manage all of your online accounts (maybe there is?)... I just found RPXNow today after starting to build a Sinatra app to login to Google, Facebook, Twitter, Amazon, OpenID, and EventBrite, and it looks like it might save some time. But I keep wondering, not being an authentication guru, why couldn't I just have a sleek login page saying "Enter username and password, and check your login service", and then in the background either scrape the login page from say EventBrite and programmatically submit the form with Mechanize, or use an API if there was one? It would be so much cleaner and such a better user experience if they didn't have to go through popups and redirects and they could use any previously existing accounts. My question is: What are the reasons why I shouldn't do something like that? I don't know much about the serious details of cookies/sessions/security, so if you could be descriptive or point me to some helpful links that would be awesome. Thanks!

Read the article

form submitting with mechanize and Python

- by MATELIN Alexis

I'm trying to scrap a website that requires to submit two forms : a first one to loggin and a second one to specify my research. I'm using Python and the mechanize package. No problem with the first one, but i just can't figure out how to pass through the second one. Here is the part of my code related to the firm above-mentionned agemin=18 agemax=25 by='region' country='France' region=2 newcustomers=1 browser.select_form(nr=0) browser['age[min]']=agemin browser['age[max]']=agemax browser['country']=country browser['region']=region browser['by']=by browser['new-customers']=newcustomers response=browser.submit() content=response.read() but when I submit the variable 'age[min]' by example, I get the following error message : TypeError: object of type 'int' has no len() to give you some more informations, here is what I get with 'print br.form' <POST http://www.adopteunmec.com/qsearch/ajax_quick application/x-www-form-urlencoded <SelectControl(age[min]=[, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, *30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99])> <SelectControl(age[max]=[, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, *45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99])> <SelectControl(by=[*region, distance])> <SelectControl(country=[*fr, be, ch, ca])> <SelectControl(region=[*1, 2, 3, 4, 5, 6, 7, 8, 22, 23, 9, 10, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 11])> <SelectControl(distance[min]=[*, 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200, 210, 220, 230, 240, 250, 260, 270, 280, 290, 300, 310, 320, 330, 340, 350, 360, 370, 380, 390, 400, 410, 420, 430, 440, 450, 460, 470, 480, 490, 500, 510, 520, 530, 540, 550, 560, 570, 580, 590, 600, 610, 620, 630, 640, 650, 660, 670, 680, 690, 700, 710, 720, 730, 740, 750, 760, 770, 780, 790, 800, 810, 820, 830, 840, 850, 860, 870, 880, 890, 900, 910, 920, 930, 940, 950, 960, 970, 980, 990, 1000])> <SelectControl(distance[max]=[, 0, 10, 20, 30, 40, 50, 60, 70, *80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200, 210, 220, 230, 240, 250, 260, 270, 280, 290, 300, 310, 320, 330, 340, 350, 360, 370, 380, 390, 400, 410, 420, 430, 440, 450, 460, 470, 480, 490, 500, 510, 520, 530, 540, 550, 560, 570, 580, 590, 600, 610, 620, 630, 640, 650, 660, 670, 680, 690, 700, 710, 720, 730, 740, 750, 760, 770, 780, 790, 800, 810, 820, 830, 840, 850, 860, 870, 880, 890, 900, 910, 920, 930, 940, 950, 960, 970, 980, 990, 1000])> <CheckboxControl(new=[*1])>> My guess is that the form needs an object (like a list) containing all the variables to accept it ; that's why it refuses the variables submited one by one. Thank you in advance for any help ! Alexis

Read the article

Voting on Hacker News stories programmatically?

- by igorgue

I decided to write an app like: http://michaelgrinich.com/hackernews/ but for Android devices, my idea will use a web application backend (because I rather code in Python and for the web than completely in Java for Android devices). What I have right now implemented is something like this: $ curl -i http://localhost:8080/stories.json?page=1\&stories=1 HTTP/1.0 200 OK Date: Sun, 25 Apr 2010 07:59:37 GMT Server: WSGIServer/0.1 Python/2.6.5 Content-Length: 296 Content-Type: application/json [{"title": "Don\u2019t talk to aliens, warns Stephen Hawking", "url": "http://www.timesonline.co.uk/tol/news/science/space/article7107207.ece?", "unix_time": 1272175177, "comments": 15, "score": 38, "user": "chaostheory", "position": 1, "human_time": "Sun Apr 25 01:59:37 2010", "id": "1292241"}] The next step (and final I think) is voting, my design is doing something like this: $ curl -i http://localhost:8080/stories/1?vote=up -u username:password Will vote up and: $ curl -i http://localhost:8080/stories/1?vote=down -u username:password Down. I have no idea how to do it though... I was planning to use Twill but the login link is always different, e.g.: http://news.ycombinator.com/x?fnid=7u89ccHKln Later the Android app will consume this API. Any experience with programmatically browsing Hacker News?

Read the article

Reading Ontology with Jena, feeding it with RDF triples, and producing correct RDF string output.

- by JonB

Hi, I have an ontology, which I read in with Jena to help me scrape some RDFa triples from a website. I don't currently store these triples in a Jena model, but that is fairly straight forward to do, its on my to do next list. The area I am struggling with, though, is to get Jena to output correct RDF for the ontology I have. The ontology uses Owl and RDFS definitions, but when I pass some example triples into the model, they don't appear correctly. Almost as if it doesn't know anything about the ontology. The output is, however, still valid RDF, just it's not coming out in the form I was hoping for. Am I correct in thinking that Jena should be able to produce well written RDF (not just valid) about the triples I have collected, based on the ontology or does this out stretch what it is capable of? Many thanks for any input.

Read the article

Is selling a "website screen scraper" illegal?

- by Yatendra Goel

I have coded a "website screen scraper" and want to sell it commercially. I know that webpages scraped by the screen scraper are restricted to be scraped by the webmaser of that website. The robots.txt file of the website says that its webpages must not be scraped. So my question is whether selling that screen scraper is a crime or using that screen scraper is a crime in legal terms. I know that this question is related to law but I thought the software experts on SO must also have answer to this question.

Read the article

How can I get all content within <td> tag using a HTML Agility Pack?

- by Bob Dylan

So I'm writing an application that will do a little screen scrapping. I'm using the HTML Agility Pack to load an entire HTML page into an instance of HtmlDocoument called doc. Now I want to parse that doc, looking for this: <table border="0" cellspacing="3"> <tr><td>First rows stuff</td></tr> <tr> <td> The data I want is in here <br /> and it's seperated by these annoying <br /> 's. No id's, classes, or even a single <p> tag. </p> Just a bunch of <br /> tags. </td> </tr> </table> So I just need to get the data within the 2nd row. How can I do this? Should I use a regex or something else?

Read the article

Programmaticaly grabbing text from a web page that is dynamically generated.

- by bstullkid

There is a website I am trying to pull information from in perl, however the section of the page I need is being generated using javascript so all you see in the source is <div id="results"></div> I need to somehow pull out the contents of that div and save it to a file using perl/proxies/whatever. e.g. the information I want to save would be document.getElementById('results').innerHTML; I am not sure if this is possible or if anyone had any ideas or a way to do this. I was using a lynx source dump for other pages but since I cant straight forward screen scrape this page I came here to ask about it! If anyone is interested, the page is http://downloadcenter.trendmicro.com/index.php?clk=left_nav&clkval=pattern_file&regs=NABU and the info I am trying to get is the row about the ConsumerOPR

Read the article

What I must to learn to parse dynamic HTML sites with PHP?

- by butteff

What I must to learn to write php web-site grabber (parser)? It must collect information from other websites, such as as weather forecast, wiki "on this day", some news and other useful and interesting "every day" information! what i must to read for writing m3u player on php? sorry for my bad english

Read the article

How can I download information from a website if it returns XML/JSON in its response?

- by Sergio Tapia

Does Python3 have a built in method to do this? Any guidance at all would be great! :) The website in question exposes all of its information and even gives you an API key to use.

Read the article

Is there a way to programmatically extract the feed of a podcast from the iTunes page?

- by J. Pablo Fernández

From an iTunes page, like http://itunes.apple.com/us/podcast/this-week-in-tech-mp3-edition/id73329404, is there a way to extract the corresponding feed address? In this case it would be http://leoville.tv/podcasts/twit.xml. I know that if you open on iTunes you can extract it manually, but I want to do it programmatically. There's a link to the website of the podcast, but it may not be accurate. In this case it points to a web site with 20 podcasts on it.

Read the article

CasperJS Load next page in loop

- by SquiresSquire

I've been working on a script which collates the scores for a list of user from a website. One problem is though, I'm trying to load the next page in the while loop, but the function is not being loaded... this.thenOpen("http://www.url.com?ul=" + currentName + "&sortdir=desc&sort=lastfound", function (id) { return function () { this.capture("Screenshots/" + json.username[id] + ".png"); if (!casper.exists(x("//*[contains(text(), 'That username does not exist in the system')]"))) { if (casper.exists(x('//*[@id="ctl00_ContentBody_ResultsPanel"]/table[2]'))){ this.thenEvaluate(tgsagc.tagNextLink); tgsagc.cacheCount = 0; tgsagc.continue = true; this.echo("------------ " + json.username[id] + " ------------"); while (tgsagc.continue) { this.then(function(){ this.evaluate(tgsagc.tagNextLink); var findDates, pageNumber; pageNumber = this.evaluate(tgsagc.pageNumber); findDates = this.evaluate(tgsagc.getFindDates); this.echo("Found " + findDates.length + " on page " + pageNumber); tgsagc.checkFinds(findDates); this.echo(tgsagc.cacheCount + " Caches for " + json.username[id]); this.echo("Continue? " + tgsagc["continue"]); return this.click("#tgsagc-link-next"); }); } leaderboard[json.username[id]] = tgsagc.cacheCount; console.log("Final Count: " + leaderboard[json.username[id]]); console.log(JSON.stringify(leaderboard)); } else { this.echo("------------ " + json.username[id] + " ------------"); this.echo("0 Caches Found"); leaderboard[json.username[id]] = 0; console.log(JSON.stringify(leaderboard)); } } else { this.echo("------------ " + json.username[id] + " ------------"); this.echo("No User found with that Username"); leaderboard[json.username[id]] = null; console.log(JSON.stringify(leaderboard)); }

Read the article

C# WebClient - View source question

- by Jim

I'm using a C# WebClient to post login details to a page and read the all the results. The page I am trying to load includes flash (which, in the browser, translates into HTML). I'm guessing it's flash to avoid being picked up by search engines??? The flash I am interested in is just text (not an image/video) etc and when I "View Selection Source" in firefox I do actually see the text, within HTML, that I want to see. (Interestingly when I view the source for the whole page I do not see the text, within HTML, that I want to see. Could this be related?) Currently after I have posted my login details, and loaded the HTML back, I see the page which does NOT show the flash HTML (as if I had viewed source for the whole page). Thanks in advance, Jim PS: I should point out that the POST is actually working, my log in is successful.

Read the article

R: extracting "clean" UTF-8 text from a web page scraped with RCurl

- by SlowLearner

Using R, I am trying to scrape a web page save the text, which is in Japanese, to a file. Ultimately this needs to be scaled to tackle hundreds of pages on a daily basis. I already have a workable solution in Perl, but I am trying to migrate the script to R to reduce the cognitive load of switching between multiple languages. So far I am not succeeding. Related questions seem to be this one on saving csv files and this one on writing Hebrew to a HTML file. However, I haven't been successful in cobbling together a solution based on the answers there. The pages are from Yahoo! Japan Finance and my Perl code that looks like this. use strict; use HTML::Tree; use LWP::Simple; #use Encode; use utf8; binmode STDOUT, ":utf8"; my @arr_links = (); $arr_links[1] = "http://stocks.finance.yahoo.co.jp/stocks/detail/?code=7203"; $arr_links[2] = "http://stocks.finance.yahoo.co.jp/stocks/detail/?code=7201"; foreach my $link (@arr_links){ $link =~ s/"//gi; print("$link\n"); my $content = get($link); my $tree = HTML::Tree->new(); $tree->parse($content); my $bar = $tree->as_text; open OUTFILE, ">>:utf8", join("","c:/", substr($link, -4),"_perl.txt") || die; print OUTFILE $bar; } This Perl script produces a CSV file that looks like the screenshot below, with proper kanji and kana that can be mined and manipulated offline: My R code, such as it is, looks like the following. The R script is not an exact duplicate of the Perl solution just given, as it doesn't strip out the HTML and leave the text (this answer suggests an approach using R but it doesn't work for me in this case) and it doesn't have the loop and so on, but the intent is the same. require(RCurl) require(XML) links <- list() links[1] <- "http://stocks.finance.yahoo.co.jp/stocks/detail/?code=7203" links[2] <- "http://stocks.finance.yahoo.co.jp/stocks/detail/?code=7201" txt <- getURL(links, .encoding = "UTF-8") Encoding(txt) <- "bytes" write.table(txt, "c:/geturl_r.txt", quote = FALSE, row.names = FALSE, sep = "\t", fileEncoding = "UTF-8") This R script generates the output shown in the screenshot below. Basically rubbish. I assume that there is some combination of HTML, text and file encoding that will allow me to generate in R a similar result to that of the Perl solution but I cannot find it. The header of the HTML page I'm trying to scrape says the chartset is utf-8 and I have set the encoding in the getURL call and in the write.table function to utf-8, but this alone isn't enough. The question How can I scrape the above web page using R and save the text as CSV in "well-formed" Japanese text rather than something that looks like line noise? Edit: I have added a further screenshot to show what happens when I omit the Encoding step. I get what look like Unicode codes, but not the graphical representation of the characters. So it may be some kind of locale-related issue, but in the exact same locale the Perl script does provide useful output. So this is still puzzling.

Read the article

Automated download of website content using ASP.net

- by Yaaqov

Using ASP.net, what methods can I use to do the following: Open up a connection to a given URL to read HTML content Parse the given URL for hyperlinks, and place them in an array Loop through each hyperlink (only 1 level down), opening each one, saving the HTML contents in a table, and move to the next hyperlink until done. If ASP.net is not up to the task, other languages or free scripts/toolkits would be acceptable. Thanks.

Read the article

Add a script tag to HTML with JSoup preserve path tags

- by Bhuvnesh Pratap

I am adding a new script tag to my DOM with JSoup like this : rootElement.after("<script src='<?=base_url()?>js/read.js'></script>") but what I end up getting is this : <script src="<?=base_url()?>js/read.js"></script> I intend to preserve this "<?=base_url()?>" and achieve something like: <script src="<?=base_url()>js/read.js"></script> string here , How could I do this ?

Read the article

How can I get all content within <table></table> tags using a regex?

- by Bob Dylan

So I'm writing an application that will do a little screen scrapping. All the pages (about 1000 or so) contain this line: <table border="0" cellspacing="3"> <tr><td>First rows stuff</td></tr> <tr> <td> The data I want is in here <br /> and it's seperated by these annoying <br /> 's. No id's, classes, or even a single <p> tag. Just a bunch of <br /> tags. </td> </tr> </table> So I just need to get the data within the 2nd row out. How can I do this? Should I use a regex or something else?

Read the article

linux piping ( convert -> pdf2ps -> lp)

- by Bor

Ok, so I can print a pdf doing: pdf2ps file.pdf - | lp -s But now I want to use convert to merge several pdf files, I can do this with: convert file1.pdf file2.pdf merged.pdf which merges file1.pdf and file2.pdf into merged.pdf, target can be replaced with '-'. Question How could I pipe convert into pdf2ps and then into lp though?

Search Results

Search found 4479 results on 180 pages for 'pdf scraping'.

Page 56/180 | < Previous Page | 52 53 54 55 56 57 58 59 60 61 62 63 | Next Page >

- by yuval

- by robintw

- by chrisjlong

- by cozza

- by butteff

- by Sergio Tapia

- by ThinkCode

- by bvandrunen

- by viatropos

- by MATELIN Alexis

- by igorgue

- by JonB

- by Yatendra Goel

- by Bob Dylan

- by bstullkid

- by butteff

- by Sergio Tapia

- by J. Pablo Fernández

- by SquiresSquire

- by Jim

- by SlowLearner

- by Yaaqov

- by Bhuvnesh Pratap

- by Bob Dylan

- by Bor

< Previous Page | 52 53 54 55 56 57 58 59 60 61 62 63 | Next Page >