crawler - Page 9 - Developer IT

Prevent bot from crawling certain areas of site.

- by Skoder

Hey, I don't know much about SEO and how web spiders work, so forgive my ignorance here. I'm creating a site (using ASP.NET-MVC) which has areas that displays information retrieved from the database. The data is unique to the user, so there's no real server-side output caching going on. However, since the data can contain things the user may not wish to have displayed from search engine results, I'd like to prevent any spiders from accessing the search results page. Are there any special actions I should take to ensure that the search result directory isn't crawled? Also, would a spider even crawl a page that's dynamically generated and would any actions preventing certain directories being search mess up my search engine rankings? edit: I should add, I'm reading up on robots.txt protocol, but it relies on co-operation from the web crawler. However, I'd also like to prevent any data-mining users who will ignore the robots.txt file. I appreciate any help!

Read the article

fuzzy implementaion for capture specific strings

- by kasun-456

I am going to develop a web crawler using java to capture hotel room prices from hotel websites. In this case i want to capture room price with the room type and the meal type, so my algorithm should intelligent for that. as an example: Room type: Delux Meal type: HalfBoad price : $20.00 The main problem is room prices can be in different different ways in different different hotel sites. so my algorithm should independent from hotel sites. I am plan to use above room types and meal types as a fuzzy sets and compare the words in webpage with above fuzzy sets using a suitable membership function. any one experienced with this??? or have an Idea for my problem??

Read the article

Creating a spider using Scrapy, Spider generation error.

- by Nacari

I just downloaded Scrapy (web crawler) on Windows 32 and have just created a new project folder using the "scrapy-ctl.py startproject dmoz" command in dos. I then proceeded to created the first spider using the command: scrapy-ctl.py genspider myspider myspdier-domain.com but it did not work and returns the error: Error running: scrapy-ctl.py genspider, Cannot find project settings module in python path: scrapy_settings. I know I have the path set right (to python26/scripts), but I am having difficulty figuring out what the problem is. I am new to both scrapy and python so there is a good possibility that I have failled to do something important. Also, I have been using eclipse with the Pydev plugin to edit the code if that might cause some problems.

Read the article

ruby 1.9: invalid byte sequence in UTF-8

- by Marc Seeger

I'm writing a crawler in ruby (1.9) that consumes lots of HTML from a lot of random sites. When trying to extract links, I decided to just use .scan(/href="(.*?)"/i) instead of nokogiri/hpricot (major speedup). The problem is that I now receive a lot of "invalid byte sequence in UTF-8" errors. From what I understood, the net/http library doesn't have any encoding specific options and the stuff that comes in is basically not properly tagged. What would be the best way to actually work with that incoming data? I tried .encode with the replace and invalid options set, but no success so far...

Read the article

Django : proper way to use model, duplicates!

- by llazzaro

Hello, I have a question about the proper, best way to manage the model. I am relative newbie to django, so I think I need to read more docs, tutorials,etc (suggestions for this would be cool!). Anyway, this is my question : I have a python web crawler, that is "connected" with django model. Crawling is done once a day, so its really common to find "duplicates". To avoid duplicates I do this : cars = Car.Objects.filter(name=crawledItem['name']) if len(cars) 0: #object already exists, update it car = cars[0] else: car = Car() #some non-relevant code here car.save() I want to know, if this is the proper/correct way to do it or its any "automatic" way to do it. Its possible to put the logic inside the Car() constructor also, should I do that? Thanks a lot!

Read the article

Programmatically get screenshot of page

- by Ender

I'm writing a specialized crawler and parser for internal use and I require the ability to take a screenshot of a web page in order to check what colours are being used throughout. The program will take in around ten web addresses and will save them as a bitmap image, from there I plan to use LockBits in order to create a list of the five most used colours within the image. To my knowledge it's the easiest way to get the colours used within a web page but if there is an easier way to do it please chime in with your suggestions. Anyway, I was going to use this program until I saw the price tag. I'm also fairly new to C#, having only used it for a few months. Can anyone provide me with a solution to my problem of taking a screenshot of a web page in order to extract the colour scheme?

Read the article

How to free virtual memory ?

- by Mehdi Amrollahi

I have a crawler application (with C#) that downloads pages from web . The application take more virtual memory , even i dispose every object and even use GC.Collect() . This , have 10 thread and each thread has a socket that downloads pages . I use dispose method and even use GC.Collect() in my application , but in 3 hour my application take 500 MB on virtual memory (500 MB on private bytes in Process explorer) . Then my system will be hang and i should restart my pc . Is there any way that i use to free vitual memory ? Thanks .

Read the article

Best way to implement a 404 in ASP.NET

- by Ben Mills

I'm trying to determine the best way to implement a 404 page in a standard ASP.NET web application. I currently catch 404 errors in the Application_Error event in the Global.asax file and redirect to a friendly 404.aspx page. The problem is that the request sees a 302 redirect followed by a 404 page missing. Is there a way to bypass the redirect and respond with an immediate 404 containing the friendly error message? Does a web crawler such as Googlebot care if the request for a non existing page returns a 302 followed by a 404?

Read the article

Can EC2 instances be set up to come from different IP ranges?

- by Joshua Frank

I need to run a web crawler and I want to do it from EC2 because I want the HTTP requests to come from different IP ranges so I don't get blocked. So I thought distributing this on EC2 instances might help, but I can't find any information about what the outbound IP range will be. I don't want to go to the trouble of figuring out the extra complexity of EC2 and distributed data, only to find that all the instances use the same address block and I get blocked by the server anyway. NOTE: This isn't for a DoS attack or anything. I'm trying to harvest data for a legitimate business purpose, I'm respecting robots.txt, and I'm only making one request per second, but the host is still shutting me down. Edit: Commenter Paul Dixon suggests that the act of blocking even my modest crawl indicates that the host doesn't want me to crawl them and therefore that I shouldn't do it (even assuming I can work around the blocking). Do people agree with this?

Read the article

how to install python-spidermonkey on windows

- by paul

Hello all, im making some script with python mechanize, one of problem is it really hard to find which support javascript supported web client scraping or crawler. actually i was found some such as python-spidermonkey and pykhtml and so on. but most of all only support on linux . i want to make my python script with exe file. so definitely i have to install on windows platform. my question is ..are there any method to can install python-spidermonkey or pykhtml on windows platform? i really need to support windows platform. if anyone can hint or help really appreicate! thanks in advance Paul

Read the article

Guidelines for good webcrawler 'Etiquette'

- by Harry

I'm building a search engine (for fun) and it has just struck me that potentially my little project might wreak havok by clicking on ads and all sorts of problems. So what are the guidelines for good webcrawler 'Etiquette'? Things that spring to mind: Observe Robot.txt instructions Limit the number of simultaneous requests to the same domain Don't follow ad links? Stopping the crawler from clicking on ads - This one is particularly on my mind at the moment... how do i stop my bot from 'clicking' on ads? if it is going straight to the url in the ad is it counted as a click?

Read the article

clear html webpage?

- by noname

i am currently developing a crawler that crawls all links on the web and displays them in the web browser (and saving it of course). but after some hours there will be a huge list displayed on the web browser and i want to only display lets say 1000 links at the same time. then i clear the html and display another 1000 links. this is also good for the RAM or it will eat up all memory. how do i clear the web browser screen? EDIT: i have seen some scripts using some flush buffer functions. has this anything to do with my case?

Read the article

Error in Decompression?

- by user595606

I am writing a crawler for a website. Its response is gzip encoded. I am not able to parse correctly a particular field, though the decompression is successful. I am also using htmlagilitypack to parse it, the parsed value of the field is only a part of the original value as an example : I am getting only /wEWAwKc04vTCQKb86mzBwKln/PuCg== whereas the firebug shows the actual value as much longer: /wEWBgKj7IuJCgKb86mzBwKln/PuCgLT250qAtC0+8cMAvimiNYD what does the '==' at the end means? I am assuming it that its a error on decompressors behalf?

Read the article

How To Discover RSS Feeds for a given site.

- by ktolis

The quest is, given a site url (say http://stackoverflow.com/ ) to return the list of all the feeds available on the site. Methods acceptable: a) use a 3rd party service (google?, yahoo?, ...) programmatically b) using a crawler/spider (and some tips on how to configure the spider to return the rss/xml feeds only) c) programmatically using c/c++/php (any language/library) The task here is not to get the feeds contained on the page returned by the url but ALL the feeds that are available on the server at any depth... in any cases please provide a simple usage example.

Read the article

Build Interactive Floor With Projection !?

- by Synxmax

Dear Guys I am somehow newbie and sometimes guru , but at this moment i don't have enough time to research My nightmare is , my boss ask me to create an interactive floor ( he just saw in an exhibition ) , he ask me to create one of them instead of buying , i am an actionscript crawler and developer with some skills in java and c# programming , i just made some track motion with a simple web cam , and this idea came to my mind if i can use an infrared or thermographic camera instead of simple camera so i can get better positioning if camera place at top of floor ! Now i just came here to ask you guys is there any resource , tip , help i can know before getting into this deal !? is there any lib or api out to deal with this ?! EVEN, if there is any resource , article from another language c++ , c , .... could help i just didn't have enough time to test lot of ways If you search interactive floors , or interactive floor projection you can find some companies who provide such a thing Thanks in advance ( and sorry for my damn poor english , français could be better :D )

Read the article

Is there such a thing as a converter from php to html?

- by 0plus1

Don't think that I'm mad, I understand how php works! That being said. I develop personal website and I usually take advantage of php to avoid repetion during the development phase nothing truly dynamic, only includes for the menus, a couple of foreach and the likes. When the development phase ends I need to give the website in html files to the client. Is there a tool (crawler?) that can do this for me instead of visiting each page and saving the interpreted html?

Read the article

How to end a thread in java?

- by beagleguy

hi all, I have 2 pools of threads ioThreads = (ThreadPoolExecutor)Executors.newCachedThreadPool(); cpuThreads = (ThreadPoolExecutor)Executors.newFixedThreadPool(numCpus); I have a simple web crawler that I want to create an iothread, pass it a url, it will then fetch the url and pass the contents over to a cpuThread to be processed and the ioThread will then fetch another url, etc... At some point the IO thread will not have any new pages to crawl and I want to update my database that this session is complete. How can I best tell when the threads are all done processing and the program can be ended?

Read the article

The Sitemap Paradox

- by Jeff Atwood

We use a sitemap on Stack Overflow, but I have mixed feelings about it. Web crawlers usually discover pages from links within the site and from other sites. Sitemaps supplement this data to allow crawlers that support Sitemaps to pick up all URLs in the Sitemap and learn about those URLs using the associated metadata. Using the Sitemap protocol does not guarantee that web pages are included in search engines, but provides hints for web crawlers to do a better job of crawling your site. Based on our two years' experience with sitemaps, there's something fundamentally paradoxical about the sitemap: Sitemaps are intended for sites that are hard to crawl properly. If Google can't successfully crawl your site to find a link, but is able to find it in the sitemap it gives the sitemap link no weight and will not index it! That's the sitemap paradox -- if your site isn't being properly crawled (for whatever reason), using a sitemap will not help you! Google goes out of their way to make no sitemap guarantees: "We cannot make any predictions or guarantees about when or if your URLs will be crawled or added to our index" citation "We don't guarantee that we'll crawl or index all of your URLs. For example, we won't crawl or index image URLs contained in your Sitemap." citation "submitting a Sitemap doesn't guarantee that all pages of your site will be crawled or included in our search results" citation Given that links found in sitemaps are merely recommendations, whereas links found on your own website proper are considered canonical ... it seems the only logical thing to do is avoid having a sitemap and make damn sure that Google and any other search engine can properly spider your site using the plain old standard web pages everyone else sees. By the time you have done that, and are getting spidered nice and thoroughly so Google can see that your own site links to these pages, and would be willing to crawl the links -- uh, why do we need a sitemap, again? The sitemap can be actively harmful, because it distracts you from ensuring that search engine spiders are able to successfully crawl your whole site. "Oh, it doesn't matter if the crawler can see it, we'll just slap those links in the sitemap!" Reality is quite the opposite in our experience. That seems more than a little ironic considering sitemaps were intended for sites that have a very deep collection of links or complex UI that may be hard to spider. In our experience, the sitemap does not help, because if Google can't find the link on your site proper, it won't index it from the sitemap anyway. We've seen this proven time and time again with Stack Overflow questions. Am I wrong? Do sitemaps make sense, and we're somehow just using them incorrectly?

Read the article

Why do some user agents have spam urls in them?

- by Erx_VB.NExT.Coder

If you go to (say) the last 100 entries (visits) to the botsvsbrowsers.com website (exact link, feel free to take a look: http://www.botsvsbrowsers.com/recent/listings/index.html ), you'd notice that almost every User Agent that has the keywords "Opera" and "Presto" inside them, will almost certainly have a web link (URL/Web Address) inside it, and it won't just be a normal web address, but a HTML anchor tag/link to that address. Why is this so, I could not even find a single discussion about it on the internet, nowhere, I tried varying my search terms many times. If the user agent contains the words "Opera" and "Presto" it doesnt mean it will have this weblink, but it means there is about an 80% change that it will. A typical anchor tag/link inside a user agent will look like this: Mozilla/4.0 <a href="http://osis-uk.co.uk/disabled-equipment">disability equipment</a> (Windows NT 5.1; U; en) Presto/2.10.229 Version/11.60 If you check it out at the website, http://www.botsvsbrowsers.com/recent/listings/index.html you will notice that the back and forward arrows are in there unescaped format. This isn't just true for botsvsbrowsers, but several other user agent listing sites. I'm really confused and feel line I'm in a room full of 10,000 people and am the only one seeing this ghost :). If I'm doing statistical analysis, should I include or exclude this type of user agent from my listing (ie: are these just normal users who've set their user agents to attempt to drive some traffic to their sites as they browser the web), or is there something else going on? The fact that it is so consistent in terms of its format leads me to believe that it is an automated process (the setting or alteration of the user agent) so I cannot decide or understand the process by which this change is made (I know how to change a user agent), but unsure which program or facility is doing this, especially since it is exclusive to Opera (Presto) user agents that are beyond I think an 8 or 9 point something browser version. I've run some statistical tests, parsing entries from all over the place, writing custom programs, to get a better understanding of this. Keep in mind that I see normal URL's in user agents infrequently, they are just text such as +http://www.someSite.com appended to a user agent normally, especially if its a crawler or bot it provided its service URL, this is normal and isnt done with an embedded link (A HREF=) etc, so I'm not talking about "those".

Read the article

Meaning of Crawl errors

- by com

My question is about definition of Crawl errors in Google Webmaster Tools. Crawl errors is devided into few sections. Let's first consider HTTP section. I assume that all broken links in this section was somehow found by crawler, this is not the links from sitemap. If all this links was found by scanning pages from sitemap for links, why it doesn't mention what was the source page, like in sitemap section with column Linked From. Please correct me if I am wrong. Sitemap section. Looks like all those links came from my sitemap. But there is Linked From column, I already know, that all those broken links is from sitemap, so in order to fix the error, I should revise my sitemap. Am I wrong? Not followed section. I don't know what does it mean. Looks like it accumulates all links that caused redirect, but for some reason Google considers all those redirect as wrong redirect. Do you know if there are any set of rules how to determine wrong redirect. Actually I found were was my mistake, I tried to normalize URL and redirect it to the right URL, but I did normalization in a wrong way. Not found section. This section like HTTP section but with 404 errors. This section has Linked From column. But very often Linked From has unavailable. What does it mean, Google can not say me how it found this non existing page. How this section related to sitemap section. Does this section contains all 404 links from sitemap too. But there is too many 404 links, much more than in sitemap. I tried to take a look what we have in Linked From, and I saw that this link came from sitemap two month ago. But why Google keeps it indexed, the link is already dead, new sitemap doesn't have it. If there is any expire date for old links? Unreachable section. Looks like this section for 500 errors. This section doesn't contain Linked From column. There are too many completely meaningless links, I really don't know where this stuff came from, and without Linked From I am not able to figure out how to deal with it. Sorry for such a big topic, but I just want to make it clear, what every section stands for, because it's extremely crucial in order to deal with all those problems. Hopefully it will be useful not just for me. Thanks!

Read the article

Why do some user agents have spam urls in them (and why are they always Opera/Presto User-Agents)?

- by Erx_VB.NExT.Coder

If you go to (say) the last 100 entries (visits) to the botsvsbrowsers.com website (exact link, feel free to take a look: http://www.botsvsbrowsers.com/recent/listings/index.html ), you'd notice that almost every User Agent that has the keywords "Opera" and "Presto" inside them, will almost certainly have a web link (URL/Web Address) inside it, and it won't just be a normal web address, but a HTML anchor tag/link to that address. Why is this so, I could not even find a single discussion about it on the internet, nowhere, I tried varying my search terms many times. If the user agent contains the words "Opera" and "Presto" it doesnt mean it will have this weblink, but it means there is about an 80% change that it will. A typical anchor tag/link inside a user agent will look like this: Mozilla/4.0 <a href="http://osis-uk.co.uk/disabled-equipment">disability equipment</a> (Windows NT 5.1; U; en) Presto/2.10.229 Version/11.60 If you check it out at the website, http://www.botsvsbrowsers.com/recent/listings/index.html you will notice that the back and forward arrows are in there unescaped format. This isn't just true for botsvsbrowsers, but several other user agent listing sites. I'm really confused and feel line I'm in a room full of 10,000 people and am the only one seeing this ghost :). If I'm doing statistical analysis, should I include or exclude this type of user agent from my listing (ie: are these just normal users who've set their user agents to attempt to drive some traffic to their sites as they browser the web), or is there something else going on? The fact that it is so consistent in terms of its format leads me to believe that it is an automated process (the setting or alteration of the user agent) so I cannot decide or understand the process by which this change is made (I know how to change a user agent), but unsure which program or facility is doing this, especially since it is exclusive to Opera (Presto) user agents that are beyond I think an 8 or 9 point something browser version. I've run some statistical tests, parsing entries from all over the place, writing custom programs, to get a better understanding of this. Keep in mind that I see normal URL's in user agents infrequently, they are just text such as +http://www.someSite.com appended to a user agent normally, especially if its a crawler or bot it provided its service URL, this is normal and isnt done with an embedded link (A HREF=) etc, so I'm not talking about "those".

Read the article

What can cause an increase in inactive memory and how to reclaim it?

- by Boaz

I have heavy application running on a CentOS server and I'm seeing a strange memory behavior. Here is a snapshot of a munin graph: As you can see the amount of committed memory increases gradually causing the swap file to be use. What strikes me odd is that the amount of inactive memory keeps growing as well. It is my understanding that the inactive memory is actually memory freed up but not yet clean by the OS and put back in the free memory pool. It seems that running out of memory is acutally caused by this lack of clean up, but I may be wrong. Can you give some tips to find the cause of the problem and/or cause CentOS to reclaim the inactive memory? Thanks. Some extra info: 1) I have a tmpfs mounted on /tmp and the number of files stored there grows (but it is double the amount of the inactive memory). 2) cat /proc/meminfo (at a later stage than the image) gives: MemTotal: 14371428 kB MemFree: 1207108 kB Buffers: 35440 kB Cached: 4276628 kB SwapCached: 785316 kB Active: 9038924 kB Inactive: 3902876 kB HighTotal: 0 kB HighFree: 0 kB LowTotal: 14371428 kB LowFree: 1207108 kB SwapTotal: 10223608 kB SwapFree: 6438320 kB Dirty: 627792 kB Writeback: 0 kB AnonPages: 7844560 kB Mapped: 49304 kB Slab: 146676 kB PageTables: 27480 kB NFS_Unstable: 0 kB Bounce: 0 kB CommitLimit: 17409320 kB Committed_AS: 16471488 kB VmallocTotal: 34359738367 kB VmallocUsed: 275852 kB VmallocChunk: 34359462007 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 Hugepagesize: 2048 kB 3) The application is a combination of MySQL, Heritrix (http://crawler.archive.org/ ) and a Tomcat based Java servlet to manage things.

Read the article

What can cause an increase in inactive memory and how to reclaim it?

- by Boaz

Hi All, I have heavy application running on a CentOS server and I'm seeing a strange memory behavior. Here is a snapshot of a munin graph: As you can see the amount of committed memory increases gradually causing the swap file to be use. What strikes me odd is that the amount of inactive memory keeps growing as well. It is my understanding that the inactive memory is actually memory freed up but not yet clean by the OS and put back in the free memory pool. It seems that running out of memory is acutally caused by this lack of clean up, but I may be wrong. Can you give some tips to find the cause of the problem and/or cause CentOS to reclaim the inactive memory? Thanks. Some extra info: 1) I have a tmpfs mounted on /tmp and the number of files stored there grows (but it is double the amount of the inactive memory). 2) cat /proc/meminfo (at a later stage than the image) gives: MemTotal: 14371428 kB MemFree: 1207108 kB Buffers: 35440 kB Cached: 4276628 kB SwapCached: 785316 kB Active: 9038924 kB Inactive: 3902876 kB HighTotal: 0 kB HighFree: 0 kB LowTotal: 14371428 kB LowFree: 1207108 kB SwapTotal: 10223608 kB SwapFree: 6438320 kB Dirty: 627792 kB Writeback: 0 kB AnonPages: 7844560 kB Mapped: 49304 kB Slab: 146676 kB PageTables: 27480 kB NFS_Unstable: 0 kB Bounce: 0 kB CommitLimit: 17409320 kB Committed_AS: 16471488 kB VmallocTotal: 34359738367 kB VmallocUsed: 275852 kB VmallocChunk: 34359462007 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 Hugepagesize: 2048 kB 3) The application is a combination of MySQL, Heritrix (http://crawler.archive.org/ ) and a Tomcat based Java servlet to manage things.

Read the article

Is it worthwhile to block malicious crawlers via iptables?

- by EarthMind

I periodically check my server logs and I notice a lot of crawlers search for the location of phpmyadmin, zencart, roundcube, administrator sections and other sensitive data. Then there are also crawlers under the name "Morfeus Fucking Scanner" or "Morfeus Strikes Again" searching for vulnerabilities in my PHP scripts and crawlers that perform strange (XSS?) GET requests such as: GET /static/)self.html(selector?jQuery( GET /static/]||!jQuery.support.htmlSerialize&&[1, GET /static/);display=elem.css( GET /static/.*. GET /static/);jQuery.removeData(elem, Until now I've always been storing these IPs manually to block them using iptables. But as these requests are only performed a maximum number of times from the same IP, I'm having my doubts if it does provide any advantage security related by blocking them. I'd like to know if it does anyone any good to block these crawlers in the firewall, and if so if there's a (not too complex) way of doing this automatically. And if it's wasted effort, maybe because these requests come from from new IPs after a while, if anyone can elaborate on this and maybe provide suggestion for more efficient ways of denying/restricting malicious crawler access. FYI: I'm also already blocking w00tw00t.at.ISC.SANS.DFind:) crawls using these instructions: http://spamcleaner.org/en/misc/w00tw00t.html

Read the article

Software to automate website screenshot capture

- by Leniel Macaferi

Do you know any software that can automate the process of getting screenshots of every page of a website? It would act like a spider/crawler/robot. You name it... For example: I developed a website and now I'd like to get a screenshot of every page of the site. I of course could do it manually (a lot of work). For each module of the site (Student, Payment, etc) I have different pages (Create, Edit, Details, Delete, etc) forms. The thing I'm looking for is a software that can visit every link of the site and then capture the screen - a software that can automate the whole process. It would also be good if the software allowed the user to pass a list of URLs to capture screenshots allowing even more fine grained configuration. EDIT: I tried Selenium mentioned by Aaron in his answer but I managed to find an app that does exactly what I needed. It's called Paparazzi!. I wrote a blog post to showcase my attempt at Selenium and the findings regarding Paparazzi!'s batch capture functionality: Software to automate website screenshot capture

Search Results

Search found 261 results on 11 pages for 'crawler'.

Page 9/11 | < Previous Page | 5 6 7 8 9 10 11 | Next Page >

- by Skoder

- by kasun-456

- by Nacari

- by Marc Seeger

- by llazzaro

- by Ender

- by Mehdi Amrollahi

- by Ben Mills

- by Joshua Frank

- by paul

- by Harry

- by noname

- by user595606

- by ktolis

- by Synxmax

- by 0plus1

- by beagleguy

- by Jeff Atwood

- by Erx_VB.NExT.Coder

- by com

- by Erx_VB.NExT.Coder

- by Boaz

- by Boaz

- by EarthMind

- by Leniel Macaferi

< Previous Page | 5 6 7 8 9 10 11 | Next Page >