Search Results

Search found 4479 results on 180 pages for 'pdf scraping'.

Page 55/180 | < Previous Page | 51 52 53 54 55 56 57 58 59 60 61 62 | Next Page >

How to scrape a _private_ google group?

- by John

Hi there, I'd like to scrape the discussion list of a private google group. It's a multi-page list and I might have to this later again so scripting sounds like the way to go. Since this is a private group, I need to login in my google account first. Unfortunately I can't manage to login using wget or ruby Net::HTTP. Surprisingly google groups is not accessible with the Client Login interface, so all the code samples are useless. My ruby script is embedded at the end of the post. The response to the authentication query is a 200-OK but no cookies in the response headers and the body contains the message "Your browser's cookie functionality is turned off. Please turn it on." I got the same output with wget. See the bash script at the end of this message. I don't know how to workaround this. am I missing something? Any idea? Thanks in advance. John Here is the ruby script: # a ruby script require 'net/https' http = Net::HTTP.new('www.google.com', 443) http.use_ssl = true path = '/accounts/ServiceLoginAuth' email='[email protected]' password='topsecret' # form inputs from the login page data = "Email=#{email}&Passwd=#{password}&dsh=7379491738180116079&GALX=irvvmW0Z-zI" headers = { 'Content-Type' => 'application/x-www-form-urlencoded', 'user-agent' => "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.2 (KHTML, like Gecko) Chrome/6.0"} # Post the request and print out the response to retrieve our authentication token resp, data = http.post(path, data, headers) puts resp resp.each {|h, v| puts h+'='+v} #warning: peer certificate won't be verified in this SSL session Here is the bash script: # A bash script for wget CMD="" CMD="$CMD --keep-session-cookies --save-cookies cookies.tmp" CMD="$CMD --no-check-certificate" CMD="$CMD --post-data='[email protected]&Passwd=topsecret&dsh=-8408553335275857936&GALX=irvvmW0Z-zI'" CMD="$CMD --user-agent='Mozilla'" CMD="$CMD https://www.google.com/accounts/ServiceLoginAuth" echo $CMD wget $CMD wget --load-cookies="cookies.tmp" http://groups.google.com/group/mygroup/topics?tsc=2

Read the article
Scrapy issue with iTunes' AppStore

- by Eric

I am using Scrapy to fetch some data from iTunes' AppStore database. I start with this list of apps: http://itunes.apple.com/us/genre/mobile-software-applications/id36?mt=8 In the following code I have used the simplest regex which targets all apps in the US store. from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.contrib.spiders import CrawlSpider, Rule class AppStoreSpider(CrawlSpider): domain_name = 'itunes.apple.com' start_urls = ['http://itunes.apple.com/us/genre/mobile-software-applications/id6015?mt=8'] rules = ( Rule(SgmlLinkExtractor(allow='itunes\.apple\.com/us/app'), 'parse_app', follow=True, ), ) def parse_app(self, response): .... SPIDER = AppStoreSpider() When I run it I receive the following: [itunes.apple.com] DEBUG: Crawled (200) <GET http://itunes.apple.com/us/genre/mobile-software-applications/id6015?mt=8> (referer: None) [itunes.apple.com] DEBUG: Filtered offsite request to 'itunes.apple.com': <GET http://itunes.apple.com/us/app/bloomberg/id281941097?mt=8> As you can see, when it starts crawling the first page it says: "Filtered offsite request to 'itunes.apple.com'". and then the spider stops.. it also returns this message: [ScrapyHTTPPageGetter,client] /usr/lib/python2.5/cookielib.py:1577: exceptions.UserWarning: cookielib bug! Traceback (most recent call last): File "/usr/lib/python2.5/cookielib.py", line 1575, in make_cookies parse_ns_headers(ns_hdrs), request) File "/usr/lib/python2.5/cookielib.py", line 1532, in _cookies_from_attrs_set cookie = self._cookie_from_cookie_tuple(tup, request) File "/usr/lib/python2.5/cookielib.py", line 1451, in _cookie_from_cookie_tuple if version is not None: version = int(version) ValueError: invalid literal for int() with base 10: '"1"' I have used the same script for other website and I didn't have this problem. Any suggestion?

Read the article
A good web data extraction/screen scraper program?

- by Taylor

I need to capture product data from a site on a regular basis and wondered if any one knows of a good software program? I've trialed Mozenda but its a monthly subscription and pricey in the long term. Obviously something thats free would be best but I don't mind paying either. Just need a decent program thats reliable and doesn't require much programming knowledge.

Read the article
How to quickly acquire and process real time screen output

- by Akusete

I am trying to write a program to play a full screen PC game for fun (as an experiment in Computer Vision and Artificial Intelligence). For this experiment I am assuming the game has no underlying API for AI players (nor is the source available) so I intend to process the visual information rendered by the game on the screen. The game runs in full screen mode on a win32 system (direct-X I assume). Currently I am using the win32 functions #include <windows.h> #include <cvaux.h> class Screen { public: HWND windowHandle; HDC windowContext; HBITMAP buffer; HDC bufferContext; CvSize size; uchar* bytes; int channels; Screen () { windowHandle = GetDesktopWindow(); windowContext = GetWindowDC (windowHandle); size = cvSize (GetDeviceCaps (windowContext, HORZRES), GetDeviceCaps (windowContext, VERTRES)); buffer = CreateCompatibleBitmap (windowContext, size.width, size.height); bufferContext = CreateCompatibleDC (windowContext); SelectObject (bufferContext, buffer); channels = 4; bytes = new uchar[size.width * size.height * channels]; } ~Screen () { ReleaseDC(windowHandle, windowContext); DeleteDC(bufferContext); DeleteObject(buffer); delete[] bytes; } void CaptureScreen (IplImage* img) { BitBlt(bufferContext, 0, 0, size.width, size.height, windowContext, 0, 0, SRCCOPY); int n = size.width * size.height; int imgChannels = img->nChannels; GetBitmapBits (buffer, n * channels, bytes); uchar* src = bytes; uchar* dest = (uchar*) img->imageData; uchar* end = dest + n * imgChannels; while (dest < end) { dest[0] = src[0]; dest[1] = src[1]; dest[2] = src[2]; dest += imgChannels; src += channels; } } The rate at which I can process frames using this approach is much to slow. Is there a better way to acquire screen frames?

Read the article
How does Blippy get its data

- by Ali

I was wondering how Blippy is able to get my data? It requires me to put in my bank name, bank card number and password, so is it doing a simple website scrape by logging in? My bank, however also requires a seperate passphrase as well. How does it get around that? Can urllib and such libraries be used in Python to replicate Blippy functionality? site:blippy.com

Read the article
Get coordinates of google maps markers

- by App_beginner

I am creating a databse containing the names and coordinates of all bus stops in my local area. I have all the names stored in my database, and now I need to add the coordinates. I am trying to get these of a website that contains them all as placemarks on a google map. It seems to me like they are being generated from a local server, and then added to the map. However I am unable to find exactly where the server is queried for the coordinates. I am hoping to collect these coordinates through the use of a screen scraper. However unless I am able to find where in the source code the coordinates are created this seems to be impossible. I can of course search and collect all these placemarks manually, but that will be very time consuming. So I am hoping that someone in here can help me. This is the website I am trying to scrape. The placemarks are marked by the blue bus sign: http://reiseplanlegger.skyss.no/scripts/travelmagic/TravelMagicWE.dll/?from=Brimnes+ferjekai+%28Eidfjord%29&to= You can also get the coordinates of a single placemark by writing the name of the stop in the search field and pushing the "Vis i kart" button. I hope someone can help me with this.

Read the article
How do I automate navigation to a website that requires authentication?

- by Wiz

Here's what I'm trying to achieve. I would like to write a script that will navigate to a website that requires me to be authenticated as myself, say Facebook, Live Spaces, Twitter or any other, and then have that script search for certain information on one of the pages of the website. I've done something similar in the past with the Windows.Forms WebBrowser control, which is a full blown implementation of IE that can be controlled through code and will store whatever cookies you get once you're authenticated, but it was very unfriendly to modify and I was hoping to use a scripting language instead, maybe Powershell or something of that sort. Are there maybe some good tutorials about this out there on the web? Thanks!

Read the article
Howto: Download local copy of Google's Pacman game

- by macek

It looks like this is HTML+JavaScript. Is there a way I can download a copy so I can continue playing after they take it down? Thanks for any help :) Edit Ok, ok, I wasn't completely forthcoming. Not only would I like to continue playing it, I kinda want to look at the source code, too... I was able to find this: Google pacman10-hp.2.js See it reformatted on Github here. Thanks @SteD Github repo I setup a github repo: macek/google_pacman. Check out the README, I think we're very close! Send me pull request if you make any progress. Put any useful details in the README. Let's get this working! :)

Read the article
Get Mechanize to handle cookies from an arbitrary POST (to log into a website programmatically)

- by Horace Loeb

I want to log into https://www.t-mobile.com/ programmatically. My first idea was to use Mechanize to submit the login form: However, it turns out that this isn't even a real form. Instead, when you click "Log in" some javascript grabs the values of the fields, creates a new form dynamically, and submits it. "Log in" button HTML: <button onclick="handleLogin(); return false;" class="btnBlue" id="myTMobile-login"><span>Log in</span></button> The handleLogin() function: function handleLogin() { if (ValidateMsisdnPassword()) { // client-side form validation logic var a = document.createElement("FORM"); a.name = "form1"; a.method = "POST"; a.action = mytmoUrl; // defined elsewhere as https://my.t-mobile.com/Login/LoginController.aspx var c = document.createElement("INPUT"); c.type = "HIDDEN"; c.value = document.getElementById("myTMobile-phone").value; // the value of the phone number input field c.name = "txtMSISDN"; a.appendChild(c); var b = document.createElement("INPUT"); b.type = "HIDDEN"; b.value = document.getElementById("myTMobile-password").value; // the value of the password input field b.name = "txtPassword"; a.appendChild(b); document.body.appendChild(a); a.submit(); return true } else { return false } } I could simulate this form submission by POSTing the form data to https://my.t-mobile.com/Login/LoginController.aspx with Net::HTTP#post_form, but I don't know how to get the resultant cookie into Mechanize so I can continue to scrape the UI available when I'm logged in. Any ideas?

Read the article
Nokogiri and Special Characters

- by Moe

I'm using Nokogiri to grab the contents of the title tag on a webpage, but am having trouble with accented characters. What's the best way to deal with these? Here's what I'm doing: require 'open-uri' require 'nokogiri' doc = Nokogiri::HTML(open(link)) title = doc.at_css("title") At this point, the title looks like this: Rag\303\271 Instead of: Ragù How can I have nokogiri return the proper character (e.g. ù in this case)?

Read the article
How to submit a form automatically using HttpWebResponse

I am looking for an application that can do the following a) Programmatically auto login to a page(login.asxp) using HttpWebResponse by using already specified username and password. b) Detect the redirect URL if the login is successful. c) Submit another form (settings.aspx) to update certain fields in the database. The required coding needs to be using asp.net The application needs to complete this entire process in the same session cookie.

Read the article
Capture ASP output for monitoring

- by scourge.zero

How do I Capture ASP.NET output and then store it as temp memory so that I can use them in an application to do comparison. example. there's this site which has ASP output. Sorry I do not have server access, what I can do is view the output. The site by the way is a monitor for all users logged in and in which ever channel. output e.g. Channel 1 Username logged in (0 / 1) Username 1 1 John Smith 1 George B 0 Channel 2 Username logged in (0 / 1) Username 1 1 John Smith 0 George B 0 what I wanted to do is to capture this output and then show them this way. Username Channel 1 Channel 2 Total Username 1 1 1 2 John Smith 1 0 1 George B 0 0 0 I dont knw where to start.

Read the article
Headless, scriptable Firefox/Webkit on linux?

- by Parand

I'm looking to automate some web interactions, namely periodic download of files from a secure website. This basically involves entering my username/password and navigating to the appropriate URL. I tried simple scripting in Python, followed by more sophisticated scripting, only to discover this particular website is using some obnoxious javascript and flash based mechanism for login, rendering my methods useless. I then tried HTMLUnit, but that doesn't seem to want to work either. I suspect use of Flash is the issue. I don't really want to think about it any more, so I'm leaning towards scripting an actual browser to log in and grab the file I need. Requirements are: Run on linux server (ie. no X running). If I really need to have X I can make that happen, but I won't be happy. Be reliable. I want to start this thing and never think about it again. Be scriptable. Nothing too sophisticated, but I should be able to tell the browser the various steps to take and pages to visit. Are there any good toolkits for a headless, X-less scriptable browser? Have you tried something like this and if so do you have any words of wisdom?

Read the article
Extracting Window Contents

- by user293392

I need to extract window content if this is based on text, or at least the file path associated to that window. To-date, I have considered: 1. win32api 2. 3rd party libraries 3. wrapper classes However, I am not satisfied with the solutions. So any ideas how this can be done in a clean way?

Read the article
Get Mechanize to handle cookies from an arbitrary POST (to log into https://www.t-mobile.com/ progra

- by Horace Loeb

I want to log into https://www.t-mobile.com/ programmatically. My first idea was to use Mechanize to submit the login form: However, it turns out that this isn't even a real form. Instead, when you click "Log in" some javascript grabs the values of the fields, creates a new form dynamically, and submits it. "Log in" button HTML: <button onclick="handleLogin(); return false;" class="btnBlue" id="myTMobile-login"><span>Log in</span></button> The handleLogin() function: function handleLogin() { if (ValidateMsisdnPassword()) { // client-side form validation logic var a = document.createElement("FORM"); a.name = "form1"; a.method = "POST"; a.action = mytmoUrl; // defined elsewhere as https://my.t-mobile.com/Login/LoginController.aspx var c = document.createElement("INPUT"); c.type = "HIDDEN"; c.value = document.getElementById("myTMobile-phone").value; // the value of the phone number input field c.name = "txtMSISDN"; a.appendChild(c); var b = document.createElement("INPUT"); b.type = "HIDDEN"; b.value = document.getElementById("myTMobile-password").value; // the value of the password input field b.name = "txtPassword"; a.appendChild(b); document.body.appendChild(a); a.submit(); return true } else { return false } } I could simulate this form submission by POSTing the form data to https://my.t-mobile.com/Login/LoginController.aspx with Net::HTTP#post_form, but I don't know how to get the resultant cookie into Mechanize so I can continue to scrape the UI available when I'm logged in. Any ideas?

Read the article
Is there something in C# that lets me keep a .XML file (with its content and everything) in memory,

- by Sergio Tapia

I'm going to be doing some webscraping and my plan is to have something like this: public class Searcher { public void Search(string searchTerm) { } private void Search(string term) { //Some HTMLAgilityPack Voodoo here } private void SaveResults() { //Actually save the results as .XML file. } } Thanks!

Read the article
Howto: Download local copy Google's Pacman game

- by macek

It looks like this is HTML+JavaScript. Is there a way I can download a copy so I can continue playing after they take it down? Thanks for any help :)

Read the article
Using Scrapy Spider in a CGI Script or Web Framework, ie: Django

- by Israel ANY

How do I use a Python library such as Scrapy in a CGI script or even in a framework such as Django if I need to do so later? Here is the documentation I've consulted thus far, but it doesn't seem to meet the concern I have. http://doc.scrapy.org/topics/spiders.html http://doc.scrapy.org/topics/webconsole.html Critique and suggestions are welcomed!

Read the article
Copying data across tabs

- by Guillermo

Hello, I got two different forms in two different tabs. One has data from our system and the other one is an interface of another, external, system in wich we need to copy data into (XML or API integration not an option here) The this is that, having open both forms - in two different tabs - i need a greasemonkey script or something similar that allows my to copy data from one form to the other (using the getValue method in Javascript). The problem right now is that I cannot figure out how to reference with a greasemonkey script one particular tab or window (to rad data from or write data to). Do you think it would be possible to do what I'm thinking to do? THANKS

Read the article
Why is python decode replacing more than the invalid bytes from an encoded string?

- by dangra

Trying to decode an invalid encoded utf-8 html page gives different results in python, firefox and chrome. The invalid encoded fragment from test page looks like 'PREFIX\xe3\xabSUFFIX' >>> fragment = 'PREFIX\xe3\xabSUFFIX' >>> fragment.decode('utf-8', 'strict') ... UnicodeDecodeError: 'utf8' codec can't decode bytes in position 6-8: invalid data What follows is the summary of replacement policies used to handle decoding errors by python, firefox and chrome. Note how the three differs, and specially how python builtin removes the valid S (plus the invalid sequence of bytes). by Python The builtin replace error handler replaces the invalid \xe3\xab plus the S from SUFFIX by U+FFFD >>> fragment.decode('utf-8', 'replace') u'PREFIX\ufffdUFFIX' >>> print _ PREFIX?UFFIX The python implementation builtin replace error handler looks like: >>> python_replace = lambda exc: (u'\ufffd', exc.end) As expected, trying this gives same result than builtin: >>> codecs.register_error('python_replace', python_replace) >>> fragment.decode('utf-8', 'python_replace') u'PREFIX\ufffdUFFIX' >>> print _ PREFIX?UFFIX by Firefox Firefox replaces each invalid byte by U+FFFD >>> firefox_replace = lambda exc: (u'\ufffd', exc.start+1) >>> codecs.register_error('firefox_replace', firefox_replace) >>> test_string.decode('utf-8', 'firefox_replace') u'PREFIX\ufffd\ufffdSUFFIX' >>> print _ PREFIX??SUFFIX by Chrome Chrome replaces each invalid sequence of bytes by U+FFFD >>> chrome_replace = lambda exc: (u'\ufffd', exc.end-1) >>> codecs.register_error('chrome_replace', chrome_replace) >>> fragment.decode('utf-8', 'chrome_replace') u'PREFIX\ufffdSUFFIX' >>> print _ PREFIX?SUFFIX The main question is why builtin replace error handler for str.decode is removing the S from SUFFIX. Also, is there any unicode's official recommended way for handling decoding replacements?

Read the article
Is selling a "website screen scraper" is illegal?

- by Yatendra Goel

I have coded a "website screen scraper" and want to sell it commercially. I know that webpages scraped by the screen scraper are restricted to be scraped by the webmaser of that website. The robots.txt file of the website says that its webpages must not be scraped. So my question is whether selling that screen scraper is a crime or using that screen scraper is a crime in legal terms. I know that this question is related to law but I thought the software experts on SO must also have answer to this question.

Read the article
Are there any free .NET OCR libraries that will perform OCR on an application window directly?

- by Kelsey

I am looking for a free .NET OCR library that will be able to do OCR on a given application window or even a image in memory (I can take a snapshot of the application window myself). I have looked at tessnet2 and MODI but both require an image located on disk. I need to use OCR because the application I am trying to write a script for does some wacky stuff that cannot be read using windows API and I need to scrape data from the screen. I have tested both of tessnet2 and MODI and they both can read the text mostly but because this has to run in an enviroment that will not be able to write to disk, I need it to be able to read directly from the applciation window or some type of memory stream. I am thinking OCR is my only soution but there could be other methods that I am not thinking of. Suggestions?

Read the article
Nokogiri Doc Element Not Returning Correctly

- by TenJack

I am trying to scrape a wiktionary entry: uri = URI.parse("http://en.wiktionary.org/wiki/" + CGI.escape('abjure')) doc = Nokogiri::HTML(open(uri, 'User-Agent' => 'ruby')) but the doc shows no elements for this word. The other words work fine and this word used to work. I have no idea what changed. Anyone see anything wrong with this?

Read the article
How would you protect a database of links from being scraped?

- by Yegor

I have a large database of links, which are all sorted in specific ways and are attached to other information, which is valuable (to some people). Currently my setup (which seems to work) simply calls a php file like link.php?id=123, it logs the request with a timestamp into the DB. Before it spits out the link, it checks how many requests were made from that IP in the last 5 minutes. If its greater than x, it redirects you to a captcha page. That all works fine and dandy, but the site has been getting really popular (as well as been getting DDOsed for about 6 weeks), so php has been getting floored, so Im trying to minimize the times I have to hit up php to do something. I wanted to show links in plain text instead of thru link.php?id= and have an onclick function to simply add 1 to the view count. Im still hitting up php, but at least if it lags, it does so in the background, and the user can see the link they requested right away. Problem is, that makes the site REALLY scrapable. Is there anything I can do to prevent this, but still not rely on php to do the check before spitting out the link?

Read the article
Calling UIGetScreenImage() on manually-spawned thread prints "_NSAutoreleaseNoPool():" message to lo

- by jtrim

This is the body of the selector that is specified in NSThread +detachNewThreadSelector:(SEL)aSelector toTarget:(id)aTarget withObject:(id)anArgument NSAutoreleasePool *pool = [[NSAutoreleasePool alloc] init]; while (doIt) { if (doItForSure) { NSLog(@"checking"); doItForSure = NO; (void)gettimeofday(&start, NULL); /* do some stuff */ // the next line prints "_NSAutoreleaseNoPool():" message to the log CGImageRef screenImage = UIGetScreenImage(); /* do some other stuff */ (void)gettimeofday(&end, NULL); elapsed = ((double)(end.tv_sec) + (double)(end.tv_usec) / 1000000) - ((double)(start.tv_sec) + (double)(start.tv_usec) / 1000000); NSLog(@"Time elapsed: %e", elapsed); [pool drain]; } } [pool release]; Even with the autorelease pool present, I get this printed to the log when I call UIGetScreenImage(): 2010-05-03 11:39:04.588 ProjectName[763:5903] *** _NSAutoreleaseNoPool(): Object 0x15a2e0 of class NSCFNumber autoreleased with no pool in place - just leaking Has anyone else seen this with UIGetScreenImage() on a separate thread?

Read the article

< Previous Page | 51 52 53 54 55 56 57 58 59 60 61 62 | Next Page >