scraper - Page 2 - Developer IT

How do I deal with content scrapers? [closed]

- by aem

Possible Duplicate: How to protect SHTML pages from crawlers/spiders/scrapers? My Heroku (Bamboo) app has been getting a bunch of hits from a scraper identifying itself as GSLFBot. Googling for that name produces various results of people who've concluded that it doesn't respect robots.txt (eg, http://www.0sw.com/archives/96). I'm considering updating my app to have a list of banned user-agents, and serving all requests from those user-agents a 400 or similar and adding GSLFBot to that list. Is that an effective technique, and if not what should I do instead? (As a side note, it seems weird to have an abusive scraper with a distinctive user-agent.)

Read the article

problem with MANIFEST.MF in jar

- by dhananjay

I have created my jar file in the following folder: /usr/local/bin/niidle.jar And I have one jarfile which is in the following folder /Projects/EnwelibDatedOct13/Niidle/lib/hector-0.6.0-17.jar And this file 'hector-0.6.0-17.jar' I have to include in MANIFEST.MF in jar. And when I mention class path in MANIFEST.MF as follows: Manifest-Version: 1.0 Main-Class: com.ensarm.niidle.web.scraper.NiidleScrapeManager Class-Path: /Projects/EnwelibDatedOct13/Niidle/lib/hector-0.6.0-17.jar When I run this using command: java -jar /usr/local/bin/niidle.jar It works properly.. But I dont want to give full Class-Path name, I have to give Class-Path as follows: Manifest-Version: 1.0 Main-Class: com.ensarm.niidle.web.scraper.NiidleScrapeManager Class-Path: lib/hector-0.6.0-17.jar And when I run this using command: java -jar /usr/local/bin/niidle.jar It is showing error message: Exception in thread "main" java.lang.NoClassDefFoundError: me/prettyprint/hector/api/Serializer at com.ensarm.niidle.web.scraper.NiidleScrapeManager.main(NiidleScrapeManager.java:21) Caused by: java.lang.ClassNotFoundException: me.prettyprint.hector.api.Serializer at java.net.URLClassLoader$1.run(URLClassLoader.java:200) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:188) at java.lang.ClassLoader.loadClass(ClassLoader.java:307) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:252) at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320) ... 1 more Please tell me solution for that...

Read the article

ImportError and Django driving me crazy

- by John Peebles

OK, I have the following directory structure (it's a django project): - project -- app and within the app folder, there is a scraper.py file which needs to reference a class defined within models.py I'm trying to do the following: import urllib2 import os import sys import time import datetime import re import BeautifulSoup sys.path.append('/home/userspace/Development/') os.environ['DJANGO_SETTINGS_MODULE'] = 'project.settings' from project.app.models import ClassName and this code just isn't working. I get an error of: Traceback (most recent call last): File "scraper.py", line 14, in from project.app.models import ClassName ImportError: No module named project.app.models This code above used to work, but broke somewhere along the line and I'm extremely confused as to why I'm having problems. On SnowLeopard using python2.5.

Read the article

Scraping html WITHOUT uniquie identifiers using python

- by Nicholas Law

I would like to design an algorithm using python that scrapes thousands of pages like this one and this one, gathers all the data and inserts it into a MySQL database. The script will be run on a weekly or bi-weekly basis to update the database of any new information added to each individual page. Ideally I would like a scraper that is easy to work with for table structured data but also data that does not have unique identifiers (ie. id and classes attributes). Which scraper add-on should I use? BeautifulSoup, Scrapy or Mechanize? Are there any particular tutorials/books I should be looking at for this desired result? In the long-run I will be implementing a mobile app that works with all this data through querying the database.

Read the article

PHP Magento Screen Scraping

- by Grant unwin

I am trying to scrape a suppliers magento site in an effort to save some time because of there being around 2000 products I need to gather info for. I'm totally OK with writing a screen scraper for pretty much anything but i've encountered a major problem. Im using get_file_contentsto gather the html of the product page. The problem is: You need to be logged in, to view the product page. Its a standard magento login, so how can I get round this in my screen scraper? I don't require a full script, just advice on a method. Thanks

Read the article

MANIFEST.MF issue

- by dhananjay

hi I have created a jar inside this folder: '/usr/local/bin/niidle.jar' in eclipse. And I have another jar inside /usr/local/bin/niidle.jar. In my niidle.jar file,there is one 'lib' folder and in that 'lib' folder,there is another jar file 'hector-0.6.0-17.jar'. I have added this 'hector-0.6.0-17.jar' file in MANIFEST.MF as follows: Manifest-Version: 1.0 Main-Class: com.ensarm.niidle.web.scraper.NiidleScrapeManager Class-Path: hector-0.6.0-17.jar But when I run this using command: >>java -jar /usr/local/bin/niidle.jar arguments... It is not working.. It is showing error message:- Exception in thread "main" java.lang.NoClassDefFoundError: me/prettyprint/hector/api/Serializer at com.ensarm.niidle.web.scraper.NiidleScrapeManager.main(NiidleScrapeManager.java:21) Caused by: java.lang.ClassNotFoundException: me.prettyprint.hector.api.Serializer at java.net.URLClassLoader$1.run(URLClassLoader.java:200) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:188) at java.lang.ClassLoader.loadClass(ClassLoader.java:307) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:252) at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320) ... 1 more What is the problem,Please tell me solution for this Exception...

Read the article

problm with MANIFEST.MF in jar

- by Atul

hi I have created jar in the following folder: /usr/local/bin/niidle.jar. And my MANIFEST.MF file is as follows: Manifest-Version: 1.0 Main-Class: com.ensarm.niidle.web.scraper.NiidleScrapeManager Class-Path: hector-0.6.0-17.jar And I verified that,this 'hector-0.6.0-17.jar' file is also present in the folder: /Projects/EnwelibDatedOct13/Niidle/lib/hector-0.6.0-17.jar I don't want to give full class-path name in MANIFEST.MF file,because I have to run this jar on other's machine,so I gave only jar file name 'Class-Path=hector-0.6.0-17.jar' in MANIFEST.MF file. Inspite of mentioning the Class-Path in MANIFEST.MF file, when I run this using command: java -jar /usr/local/bin/niidle.jar arguments... It is showing error massage: --Exception in thread "main" java.lang.NoClassDefFoundError: me/prettyprint/hector/api/Serializer at com.ensarm.niidle.web.scraper.NiidleScrapeManager.main(NiidleScrapeManager.java:21) Caused by: java.lang.ClassNotFoundException: me.prettyprint.hector.api.Serializer at java.net.URLClassLoader$1.run(URLClassLoader.java:200) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:188) at java.lang.ClassLoader.loadClass(ClassLoader.java:307) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:252) at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320) ... 1 more Please give me solution for this error message..

Read the article

Accessing a persistent ssh tunnel

- by woowaa

How do I pass commands (shell) to a persistent SSH tunnel rather than open a connection for every instance? I have a Python scraper running on a client server which passes URL variables and shell commands to a remote host via a reverse tunnel (forwarded port), so that the URL's are then executed on the host (python fabric ssh localhost:12345 'browser open URL'). I could make the reverse tunnel persistent but how do I echo the url/command to the session? Update - ControlMaster (built into SSH) solves this one.

Read the article

Migrating from register.com webmail

- by mgb

Anyone know of any tools to migrate emails from register.com's webmail to anything else (gmail, local mbox, anything!), I have nearly 15 years of mails and folders I want to keep. Other than spending ages writing some Perl scraper code I wondered if anyone else had already done this. sorry - It's impossible to search Google for the name of a registrar!

Read the article

Best tool for DOM manipulation ?

- by Olivier Lalonde

I'm working on a web scraper which will aggregate data from various websites. I've started using PHP's built in DOM functions but after running into a couple of issues (especially regarding malformed markup and character encoding), I've chosen to ditch PHP. I was thinking of server side Javascript but am open to other suggestions. If I go with Javascript, which interpreter should I use?

Read the article

Programmatically Submit form and loop through paging (C#.NET)

- by Mikos

I need to write a custom web-scraper to mine some data. ?I know how to submit a form using HttpWebRequest class Post method. My challenge is to loop through the resulting pages and retrieve the records from each page. Does anyone have a code sample or article to point to? Thanks

Read the article

Yahoo Web Scrapes: What are the limits?

- by bvandrunen

We are using a web scraper and have it set up to have a sleep function which has a random function set up (so that it isn't the same time between each scrape) but we are still getting blocked from Yahoo after 20-30 requests. Does any one know if there is a limit (i.e: 20 requests per minutes, 200 an hour) Right now our average between each request is around 3-6 seconds. Thanks for any help

Read the article

Download image file from the HTML page source using python?

- by Mohit Ranka

I am writing a scraper that downloads all the image files from a HTML page and saves them to a specific folder. all the images are the part of the HTML page.

Read the article

What is the correct terminology to describe a visual display that is about the size of a living room

- by JW

I'm thinking that, as flat screens get bigger and cheaper it won't be too long before 'digital wallpaper'-like screens become popular in people's living rooms with a host of applications that could take advantage of this particular screen size/resolution. Is there a proper name for this size of screen? 'Wall Screen' - is too ambiguous 'Massive Screen' - is probably best reserved for something you'd put on the side of a sky scraper 'Small Screen' - nabbed by the mobiles 'Large Screen' - kind of means desktop I'm thinking of the kind of screen used in 'Minority Report'.

Read the article

[UNIX] Is it safe to pipe the output of several parallel processes to one file using >> ?

- by conradlee

I'm scraping data from the web, and I have several processes of my scraper running in parallel. I want the output of each of these processes to end up in the same file. As long as lines of text remain intact, I the order of the lines does not matter. In UNIX, can I just pipe the output of each process to the same file using the operator?

Read the article

PHP Curl Not formatting quotes properly; producing weird character set for single/double quotes

- by user595052

I wrote a html scraper to scrape my various social identites, so I can make a real time 'biography' website. However after using php curl_exec, I find that texts that I have quoted, end up being formatted in a weird character set. ex: "I love dogs" gets formatted to â€™I love dogs â€™ "I hate cheese" gets formatted to â€œI hate cheeseâ€? How do I either scrub these characters, or set curl not to format quotes like this. Also, I have turned off magic_quotes.

Read the article

Is there a way to flush html to the wire in Sinatra

- by thismatt

I have a Sinatra app with a long running process (a web scraper). I'd like the app flush the results of the crawler's progress as the crawler is running instead of at the end. I've considered forking the request and doing something fancy with ajax but this is a really basic one-pager app that really just needs to output a log to a browser as it's happening. Any suggestions?

Read the article

What does your Lisp workflow look like?

- by Duncan Bayne

I'm learning Lisp at the moment, coming from a language progression that is Locomotive BASIC - Z80 Assembler - Pascal - C - Perl - C# - Ruby. My approach is to simultaneously: write a simple web-scraper using SBCL, QuickLisp, closure-html, and drakma watch the SICP lectures I think this is working well; I'm developing good 'Lisp goggles', in that I can now read Lisp reasonably easily. I'm also getting a feel for how the Lisp ecosystem works, e.g. Quicklisp for dependencies. What I'm really missing, though, is a sense of how a seasoned Lisper actually works. When I'm coding for .NET, I have Visual Studio set up with ReSharper and VisualSVN. I write tests, I implement, I refactor, I commit. Then when I'm done enough of that to complete a story, I write some AUATs. Then I kick off a Release build on TeamCity to push the new functionality out to the customer for testing & hopefully approval. If it's an app that needs an installer, I use either WiX or InnoSetup, obviously building the installer through the CI system. So, my question is: as an experienced Lisper, what does your workflow look like? Do you work mostly in the REPL, or in the editor? How do you do unit tests? Continuous integration? Packaging & deployment? When you sit down at your desk, steaming mug of coffee to one side and a framed photo of John McCarthy to the other, what is it that you do? Currently, I feel like I am getting to grips with Lisp coding, but not Lisp development ...

Read the article

What does your Lisp workflow look like?

- by Duncan Bayne

I'm learning Lisp at the moment, coming from a language progression that is Locomotive BASIC - Z80 Assembler - Pascal - C - Perl - C# - Ruby. My approach is to simultaneously: write a simple web-scraper using SBCL, QuickLisp, closure-html, and drakma watch the SICP lectures I think this is working well; I'm developing good 'Lisp goggles', in that I can now read Lisp reasonably easily. I'm also getting a feel for how the Lisp ecosystem works, e.g. Quicklisp for dependencies. What I'm really missing, though, is a sense of how a seasoned Lisper actually works. When I'm coding for .NET, I have Visual Studio set up with ReSharper and VisualSVN. I write tests, I implement, I refactor, I commit. Then when I'm done enough of that to complete a story, I write some AUATs. Then I kick off a Release build on TeamCity to push the new functionality out to the customer for testing & hopefully approval. If it's an app that needs an installer, I use either WiX or InnoSetup, obviously building the installer through the CI system. So, my question is: as an experienced Lisper, what does your workflow look like? Do you work mostly in the REPL, or in the editor? How do you do unit tests? Continuous integration? Packaging & deployment? When you sit down at your desk, steaming mug of coffee to one side and a framed photo of John McCarthy to the other, what is it that you do? Currently, I feel like I am getting to grips with Lisp coding, but not Lisp development ...

Read the article

The Iron Bird Approach

- by David Paquette

It turns out that designing software is not so different than designing commercial aircraft. I just finished watching a video that talked about the approach that Bombardier is taking in designing the new C Series aircraft. I was struck by the similarities to agile approaches to software design. In the video, Bombardier describes how they are using an Iron Bird to work through a number of design questions in advance of ever having a version of the aircraft that can ever be flown. The Iron Bird is a life size replica of the plane. Based on the name, I would assume the plane is built in a very heavy material that could never fly. Using this replica, Bombardier is able to valid certain assumptions such as the length of each wire in the electric system. They are also able to confirm that some parts are working properly (like the rudders). They even go as far as to have a complete replica of the cockpit. This allows Bombardier to put pilots in the cockpit to run through simulated take-off and landing sequences. The basic tenant of the approach seems to be Validate your design early with working prototypes Get feedback from users early, well in advance of finishing the end product In software development, we tend to think of ourselves as special. I often tell people that it is difficult to draw comparisons to building items in the physical world (“Building software is nothing like building a sky scraper”). After watching this video, I am wondering if designing/building software is actually a lot like designing/building commercial aircraft. Watch the video here (http://www.theglobeandmail.com/report-on-business/video/video-selling-the-c-series/article4400616/)

Read the article

Optimising news fetching

- by aceBox

I have a web scraper for scraping news from different sources in wp7. My current appraoch for doing this is: load newspapers information from xml file. go to the specified sections and fetch the urls of the news items. go to each url and fetch headline, image, publisher. display using a MVVM architecture of windows phone. The whole thing takes place asynchronously...meaning as soon as url from a section of a newspaper is fetched it is added to the queue, and the second stage consisting of fetching headline, image etc starts... and as soon this is fetched even for one article, it is displayed. Later on as more articles are fetched, they are added on to the list. For the fetching purpose I am using a SmartThreadPool(http://www.codeproject.com/Articles/7933/Smart-Thread-Pool) for windows phone. My problem is that...even for fetching around 80 items (in total) from 9 publications, it is taking more than a minute. How can i speed up the procedure? Note: I have a two stage approach because many times the images are not available with headlines, and are only found in the article.

Read the article

Getting BeautifulSoup to find a specific <p>

- by Ryan

I'm trying to put together a basic HTML scraper for a variety of scientific journal websites, specifically trying to get the abstract or introductory paragraph. The current journal I'm working on is Nature, and the article I've been using as my sample can be seen at http://www.nature.com/nature/journal/v463/n7284/abs/nature08715.html. I can't get the abstract out of that page, however. I'm searching for everything between the <p class="lead">...</p> tags, but I can't seem to figure out how to isolate them. I thought it would be something simple like from BeautifulSoup import BeautifulSoup import re import urllib2 address="http://www.nature.com/nature/journal/v463/n7284/full/nature08715.html" html = urllib2.urlopen(address).read() soup = BeautifulSoup(html) abstract = soup.find('p', attrs={'class' : 'lead'}) print abstract Using Python 2.5, BeautifulSoup 3.0.8, running this returns 'None'. I have no option of using anything else that needs to be compiled/installed (like lxml). Is BeautifulSoup confused, or am I?

Read the article

problem in loading class from 'me.prettyprint.hector.api.Serializer'

- by dhananjay patil

I have created executable jar but having some problem with Class not found Exception. When I type command: java -jar JarFileName.jar arguments.. I get error message, Exception in thread "main" java.lang.NoClassDefFoundError: me/prettyprint/hector/api/Serializer at com.ensarm.niidle.web.scraper.NiidleScrapeManager.main(NiidleScrapeManager.java:21) Caused by: java.lang.ClassNotFoundException: me.prettyprint.hector.api.Serializer at java.net.URLClassLoader$1.run(URLClassLoader.java:200) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:188) at java.lang.ClassLoader.loadClass(ClassLoader.java:307) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:252) at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320) ... 1 more please tell me solution for this,class is not getting loaded from the external jar

Read the article

a question on webpage data scraping using Java

- by Gemma

Hi there. I am now trying to implement a simple HTML webpage scraper using Java.Now I have a small problem. Suppose I have the following HTML fragment. <div id="sr-h-left" class="sr-comp"> <a class="link-gray-underline" id="compare_header" rel="nofollow" href="javascript:i18nCompareProd('/serv/main/buyer/ProductCompare.jsp?nxtg=41980a1c051f-0942A6ADCF43B802'); " Compare Showing 1 - 30 of 1,439 matches, The data I am interested is the integer 1.439 shown at the bottom.I am just wondering how can I get that integer out of the HTML. I am now considering using a regular expression,and then use the java.util.Pattern to help get the data out,but still not very clear about the process. I would be grateful if you guys could give me some hint or idea on this data scraping. Thanks a lot.

Read the article

Scraping paginated items from a website using scrapy

- by Mridang Agarwalla

I'm using scrapy to scrape items from a site. I'm not being able to implement this scraping pattern. The site I'm trying to scrape is a forum and I scrape the site once a day. Each page has a table containing posts. New posts are added to the top of the table and as more and more posts are posted to the site, the older posts go further into the pages due to pagination. This is a very simple scenario and we will assume that the order of the posts never change. I would like to scrape this site and scrape all the "new" records until the last scraped post from yesterday is encountered. I have configured my spider to paginate endlessly and when it encounters yesterday's last scraped post, it should stop. How can implement this? (My Scrapy installation works with my Django installation using django-dynamic-scraper )

Search Results

Search found 82 results on 4 pages for 'scraper'.

Page 2/4 | < Previous Page | 1 2 3 4 | Next Page >

- by aem

- by dhananjay

- by John Peebles

- by Nicholas Law

- by Grant unwin

- by dhananjay

- by Atul

- by woowaa

- by mgb

- by Olivier Lalonde

- by Mikos

- by bvandrunen

- by Mohit Ranka

- by JW

- by conradlee

- by user595052

- by thismatt

- by Duncan Bayne

- by Duncan Bayne

- by David Paquette

- by aceBox

- by Ryan

- by dhananjay patil

- by Gemma

- by Mridang Agarwalla

< Previous Page | 1 2 3 4 | Next Page >