crawler - Page 2 - Developer IT

Malicious crawler blocker for ASP.NET

- by Marek

I have just stumbled upon Bad Behavior - a plugin for PHP that promises to detect spam and malicious crawlers by preventing them from accessing the site at all. Does something similar exist for ASP.NET/ASP.NET MVC? I am interested in blocking access to the site altogether, not in detecting spam after it was posted.

Read the article

Google crawler not found an error inside of the <head> tag

- by inckka

I've found a crawler error in my site and it is listed as a page not found(404) link. Heres the broken link http://mydomain.com/blog/comments/feed/ I'm using Google web master tools and found that broken link coming from my web site pages' head tag. here's actual code where that link situated. <head> <link rel="alternate" type="application/rss+xml" title="My Domain Blog » Feed" href="http://www.my-domain.com/blog/feed/" /> </head> So Google report this link as a not found. Actually this link target is not an exact page or a location. But essential for the blog feeds. Anyway I have to fix this and remove from the Google crawler error's list. But haven't got any idea, because cannot redirect or do a 404 header with this link target. Have anyone got an idea of fixing this?

Read the article

Grapeshot crawler ignoring robots.txt

- by QF_Developer

Has anyone come across a crawler called Grapeshot? They are hammering the same page repeatedly on our website. I believe they are looking for ad related keywords, based on previous content ad campaigns. The odd thing is we never ran any such campaigns on the page they are so interested in. We do have only a few pages running AdSense, is this what has attracted Grapeshot? I've added the following declaration to my robots.txt, but they don't seem to be honouring it? User-agent: grapeshot Disallow: / Any ideas on how to block this nuisance crawler? I'm starting to think the best way is by setting up IP rules in IIS?

Read the article

Logic behind crawling an webpages like that of Screaming Frog? [on hold]

- by sree

I would like to know what is the parameters to be considered while developing a crawler like that of Screaming Frog. Am looking forward for information on do's and dont's of webpage crawling. What are the problems the crawler may infuse on the webpages like loadtime (maybe?) or anything that effects webpage during crawling. What are the rules the crawler needs to follow etc. Basically anything info that makes the crawler look good and accurate. Just point me in a right direction to achieve it.. Hope my requirement is clear this time.. :)

Read the article

Submitting a sitemap to take care of inherited Google crawler errors

- by leeand00

I have an awful lot of Google Crawler errors (1000 or so) after I inherited a site that the previous owner migrated without moving much of their content. Would generating a map of the current site and submitting it to Google help fix this? Is there any quicker, automated way to eliminate errors other than clicking each and every site error? Note: I have already tried automating this on my own.

Read the article

How to hide pages from Google crawler? [closed]

- by NoobDev4iPhone

Possible Duplicate: What are the most important things I need to do to encourage Google Sitelinks? I'm currently working on a website and need to keep certain pages hidden from Google crawler. How to make it so that search engines see only what I want them to see in a directory? Also, you know how Google results also give you shortcut links, Like 'Login', 'About' etc... how to put these links to search result?

Read the article

How to best develop web crawlers

- by Fernando Barrocal

Heyall, I am used to create some crawlers to compile information and as I come to a website I need the info I start a new crawler specific for that site, using shell scripts most of the time and sometime PHP. The way I do is with a simple for to iterate for the page list, a wget do download it and sed, tr, awk or other utilities to clean the page and grab the specific info I need. All the process takes some time depending on the site and more to download all pages. And I often steps into an AJAX site that complicates everything I was wondering if there is better ways to do that, faster ways or even some applications or languages to help such work.

Read the article

Weird 301 redirection by google crawler

- by Ace

I have some pages on my website www.acethem.com which are having 301 redirection but they are not actually 301 redirects. e.g. www.acethem.com/pastpapers/by-year/2007/ is seen as a 301 redirection to www.acethem.com/pastpapers/by-year by google (I am using "Fetch as google" in webmaster tools. Now more weird: My paginated pages with page = 10 are all redirected to homepage: http://www.acethem.com/pastpapers/o-level/chemistry/page/10/ while http://www.acethem.com/pastpapers/o-level/chemistry/page/9/ is working properly in google crawler. Note that all these pages work fine with no redirect in browsers. Sidenote: on www.acethem.com/pastpapers/by-year/2007/, the facebook share button also points to www.acethem.com/pastpapers/by-year/.

Read the article

Can AdSense crawler view pages that require cookies?

- by moomoochoo

Details I require users to agree to terms and conditions before they can view several pages on my site. Once they have agreed a cookie is set and they can proceed to the webpage. If a user somehow manages to end up on the webpage without a cookie they will not be able to access the page's content. My question(s) Is the AdSense crawler able to set the cookie and visit these pages? If yes, how will it know to agree to the TOS? Is there some way to allow it access to the pages even if it couldn't use cookies?

Read the article

Which web crawler to use to save news articles from a website into .txt files?

- by brokencoding

Hi, i am currently in dire need of news articles to test a LSI implementation (it's in a foreign language, so there isnt the usual packs of files ready to use). So i need a crawler that given a starting url, let's say http://news.bbc.co.uk/ follows all the contained links and saves their content into .txt files, if we could specify the format to be UTF8 i would be in heaven. I have 0 expertise in this area, so i beg you for some sugestions in which crawler to use for this task.

Read the article

Extracting data from internet

- by Ankiov Spetsnaz

I would like to extract data from internet like www.mozenda.com does but I want to write my own program to do that. Specific data I'm looking for is various event data. Based on my research, I think custom web crawler is my answer but I Would like to confirm the answer and see if there are any suggestion to make custom web crawlers if web crawler indeed is an answer. Personally, I would prefer Java and I'm planning on using Glassfish technology if that matters...

Read the article

suspicious crawler activity

- by ithkuil

I'm noticing that I get accesses 66.249.66.198 - - [01/Jul/2011:17:13:46 +0200] "GET /img/clip.incubus.torrent.phtml HTTP/1.1" 404 143 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 66.249.66.198 - - [01/Jul/2011:17:13:48 +0200] "GET /img/clip.global.deejays.download.phtml HTTP/1.1" 404 143 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" that files don't exist and there is no file on my site that has this content (I hope). Why is googlebot trying out these links? reverse dns and whois state that 66.249.66.198 is really googlebot.

Read the article

Restricting crawler activity to certain directories with robots.txt

- by neimad

I would like to use robots.txt to prevent indexing of some parts of my website. I want search engines to index only the / directory and not search inside my controllers. In my robots.txt, I have this: User-Agent: * Disallow: /compagnies/ Disallow: /floors/ Disallow: /spaces/ Disallow: /buildings/ Disallow: /users/ Disallow: / I put this file in /mysite/public. I tested the file with a robots.txt validator and got no errors. However, Google always returns the result of my site. For testing, I added Disallow: /, but again, Google indexed all pages. floors, spaces, buildings, etc. are not physical directories. Is this a bug? How can I work around it?

Read the article

Clicks counting and crawler bots

- by Dennis

I am currently running a small affiliate-program for Facebook users. We use an auto-poster to publish links to fan pages. Every hit is stored in our database and we have included a 24 hour reload block for the IP-addresses. My problem right now is that the PHP script also stores every hit from all the bots that crawls my website. Now I was thinking to block those bots with the robots.txt of my website but I am afraid that this will have a negative effect on my AdSense ads. Does anybody have an idea for me how to work this out?

Read the article

Redirect Google crawler to different robots.txt via .htaccess

- by user3474818

I have googled for the answer all day and still couldn't find an answer. I have a virtual subdomain www.static.example.com which is a mirror site of www.example.com. It means I have just one root folder for subdomain and domain aswell. I want to redirect crawlers to different robots.txt file - robots_static.txt when they see .static in url in which I will forbid indexing via /disallow command. I want to do this because I have duplicated content in Google search results. Subdomain is showing the exact same content as the main domain. Does anyone know how could I achieve that crawlers sees robots_static.txt instead of robots.txt? What I have managed to find so far is this: RewriteCond %{HTTP_HOST} ^www.static.*$ [NC] RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /.*robots\.txt.*\ HTTP/ [NC] RewriteRule ^robots\.txt /robots_static.txt [NC,L] but when I check in webmaster tools, it still sees robots.txt as my robots file instead of robots_static.txt, so it crawls and index everything twice. What did I do wrong? Thanks EDIT: This is my .htaccess file ## # @package Joomla # @copyright Copyright (C) 2005 - 2013 Open Source Matters. All rights reserved. # @license GNU General Public License version 2 or later; see LICENSE.txt ## ## # READ THIS COMPLETELY IF YOU CHOOSE TO USE THIS FILE! # # The line just below this section: 'Options +FollowSymLinks' may cause problems # with some server configurations. It is required for use of mod_rewrite, but may already # be set by your server administrator in a way that dissallows changing it in # your .htaccess file. If using it causes your server to error out, comment it out (add # to # beginning of line), reload your site in your browser and test your sef url's. If they work, # it has been set by your server administrator and you do not need it set here. ## ## Can be commented out if causes errors, see notes above. Options +FollowSymLinks ## Mod_rewrite in use. RewriteEngine On RewriteEngine On RewriteCond %{HTTP_HOST} !^www\. RewriteRule ^(.*)$ http://www.%{HTTP_HOST}/$1 [R=301,L] RewriteCond %{HTTP_HOST} ^www.static.*$ [NC] RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /.*robots\.txt.*\ HTTP/ [NC] RewriteRule ^robots\.txt /robots_static.txt [NC,L] ## Begin - Rewrite rules to block out some common exploits. # If you experience problems on your site block out the operations listed below # This attempts to block the most common type of exploit `attempts` to Joomla! # # Block out any script trying to base64_encode data within the URL. RewriteCond %{QUERY_STRING} base64_encode[^(]*$[^)]*$ [OR] # Block out any script that includes a <script> tag in URL. RewriteCond %{QUERY_STRING} (<|%3C)([^s]*s)+cript.*(>|%3E) [NC,OR] # Block out any script trying to set a PHP GLOBALS variable via URL. RewriteCond %{QUERY_STRING} GLOBALS(=|\[|\%[0-9A-Z]{0,2}) [OR] # Block out any script trying to modify a _REQUEST variable via URL. RewriteCond %{QUERY_STRING} _REQUEST(=|\[|\%[0-9A-Z]{0,2}) # Return 403 Forbidden header and show the content of the root homepage RewriteRule .* index.php [F] # ## End - Rewrite rules to block out some common exploits. ## Begin - Custom redirects # # If you need to redirect some pages, or set a canonical non-www to # www redirect (or vice versa), place that code here. Ensure those # redirects use the correct RewriteRule syntax and the [R=301,L] flags. # ## End - Custom redirects ## # Uncomment following line if your webserver's URL # is not directly related to physical file paths. # Update Your Joomla! Directory (just / for root). ## # RewriteBase / RewriteCond %{THE_REQUEST} ^GET.*index\.php [NC] RewriteCond %{THE_REQUEST} !/system/.* RewriteRule (.*?)index\.php/*(.*) /$1$2 [R=301,L] RewriteCond %{THE_REQUEST} ^GET ## Begin - Joomla! core SEF Section. # RewriteRule .* - [E=HTTP_AUTHORIZATION:%{HTTP:Authorization}] # # If the requested path and file is not /index.php and the request # has not already been internally rewritten to the index.php script RewriteCond %{REQUEST_URI} !^/index\.php # and the request is for something within the component folder, # or for the site root, or for an extensionless URL, or the # requested URL ends with one of the listed extensions RewriteCond %{REQUEST_URI} /component/|(/[^.]*|\.(php|html?|feed|pdf|vcf|raw))$ [NC] # and the requested path and file doesn't directly match a physical file RewriteCond %{REQUEST_FILENAME} !-f # and the requested path and file doesn't directly match a physical folder RewriteCond %{REQUEST_FILENAME} !-d # internally rewrite the request to the index.php script RewriteRule .* index.php [L] # ## End - Joomla! core SEF Section. <FilesMatch "\.(ico|pdf|flv|jpg|ttf|jpg|jpeg|png|gif|js|css|swf)$"> Header set Expires "Wed, 15 Apr 2020 20:00:00 GMT" Header set Cache-Control "public" </FilesMatch> <ifModule mod_headers.c> Header set Connection keep-alive </ifModule> ########## Begin - Remove Etags # FileETag none # ########## End - Remove Etags

Read the article

How to build a web crawler to find a specific advert, which is in an iframe loaded by Javascript

- by ZoFreX

I'm trying to find all instances of an advert on a website. The advert is in an iframe which is loaded by javascript (it doesn't appear at all if javascript is turned off). Detecting the advert itself is extremely simple, both the name of the flash file and the target of the href always contain a certain string. What would be the best "starting point" for achieving this? At the moment I'm considering an Adobe AIR app, which could crawl the site and examine the DOM to find the ad, and would run javascript and load the content of the iframe. The other option I can think of is using Firefox as the platform (using maybe GreaseMonkey or Selenium? I don't really know how to leverage Firefox like this). Does anyone know of anything suitable to build this, or have any suggestions on using Firefox to do it? Extra details: Being CPU intensive isn't really an issue, nor is anything depending on a browser being open. This doesn't need to run on a headless server, it will be running on a powerful desktop box. OS is also not an issue. It would be advantageous if the crawler loaded each page multiple times, as the advert is in rotation. While the crawler does need to execute the javascript and load the content of the iframe, it does not need to be able to display flash files.

Read the article

A good open source web crawler for indexing Specific website for specific contents?

- by Peeyush

Hello Please suggest me a good open source web crawler written in C++,JAVA or PHP. i just need to crawl/index some specific websites for specific contents(images,text,videos). i know that their are already a lot of question & answers about this topic on this website but i am a little confused after reading all of them. So i am sorry if i am repeating the same question again. -Thanks in advance

Read the article

Crawling for geotagged data

- by abe3

I have no experience with web crawlers -- but I know that Apache maintains an open source web crawler called "Lucene." How would I go about writing such a crawler to search the web for geo tagged data close to a particular location? What would a general road map look like? How do I pick which slice of the web to crawl? Do I use regular expressions to find things that look like longitudes and latitudes? What does a general sketch of that solution look like?

Read the article

Message Queue: Which one is the best scenario?

- by pandaforme

I write a web crawler. The crawler has 2 steps: get a html page then parse the page I want to use message queue to improve performance and throughput. I think 2 scenarios: scenario 1: structure: urlProducer -> queue1 -> urlConsumer -> queue2 -> parserConsumer urlProducer: get a target url and add it to queue1 urlConsumer: according to the job info, get the html page and add it to queue2 parserConsumer: according to the job info, parse the page scenario 2: structure: urlProducer -> queue1 -> urlConsumer parserProducer-> queue2 -> parserConsumer urlProducer : get a target url and add it to queue1 urlConsumer: according to the job info, get the html page and write it to db parserProducer: get the html page from db and add it to queue2 parserConsumer: according to the job info, parse the page There are multiple producers or consumers in each structure. scenario1 likes a chaining call. It's difficult to find the point of problem, when occurring errors. scenario2 decouples queue1 and queue2. It's easy to find the point of problem, when occurring errors. I'm not sure the notion is correct. Which one is the best scenario? Or other scenarios? Thanks~

Read the article

What's holding up my PHP script?

- by gAMBOOKa

We've got a PHP crawler running on our web server. When the crawler is running, there are no cpu, memory or network bandwidth spikes. Everything is normal. But our website (also PHP), hosted on the same server, stops responding. Basically the crawler blocks any other php script from running. What could be the problem? EDIT: ** fsockopen is being used to download files to crawler! **

Read the article

Best way to store data for Greasemonkey based crawler?

- by Björn

I want to crawl a site with Greasemonkey and wonder if there is a better way to temporarily store values than with GM_setValue. What I want to do is crawl my contacts in a social network and extract the Twitter URLs from their profile pages. My current plan is to open each profile in it's own tab, so that it looks more like a normal browsing person (ie css, scrits and images will be loaded by the browser). Then store the Twitter URL with GM_setValue. Once all profile pages have been crawled, create a page using the stored values. I am not so happy with the storage option, though. Maybe there is a better way? I have considered inserting the user profiles into the current page so that I could all process them with the same script instance, but I am not sure if XMLHttpRequest looks indistignuishable from normal user initiated requests.

Read the article

where the crawled files are stored in Heritrix web crawler

- by zahir hussain

hi i want to know where the crawled files are stored in Heritrix web crawler... thanks and advance

Read the article

Is there a search engine that indexes source code of a web-page?

- by Dexter

I need to search the web for sites that are in our industry that use the same Adwords management company, to ensure that the said company is not violating our contract, as they have been accused of doing. They use a tracking code in the template of every page which has a certain domain in the URL, and I'm wondering if it's possible "Google" the source code using some bot that crawls the code rather than the content? For example, I bought an unlimited license for an image gallery, and I was asked to type the license number in a comment just before the script. I thought it was just so a human could look at the source and find out if someone paid, but it turned out that it was actually that they had a crawler looking for their source code and that comment. If it ran across the code on your site, it would look for the comment, and if it found one, it would check to see if it was an existing one. If not, it would first notify you of your noncompliance, and then notify the owner of the script. Edit: I'm looking to index HTML and JavaScript only, not the server-side languages or Java.

Read the article

Convert url for crawler

- by user260223

Hi.. I'm working on a crawler. Usually, when i type url1 in my browser, browser converts it to url2. How can i do this in Python? url1: www.odevsitesi.com/ara.asp?kelime=doganin dengesinin bozulmasi url2: www.odevsitesi.com/ara.asp?kelime=do%F0an%FDn%20dengesinin%20bozulmas%FD

Read the article

Trouble with go tour crawler exercise

- by David Mason

I'm going through the go tour and I feel like I have a pretty good understanding of the language except for concurrency. On slide 71 there is an exercise that asks the reader to parallelize a web crawler (and to make it not cover repeats but I haven't gotten there yet.) Here is what I have so far: func Crawl(url string, depth int, fetcher Fetcher, ch chan string) { if depth <= 0 { return } body, urls, err := fetcher.Fetch(url) if err != nil { ch <- fmt.Sprintln(err) return } ch <- fmt.Sprintf("found: %s %q\n", url, body) for _, u := range urls { go Crawl(u, depth-1, fetcher, ch) } } func main() { ch := make(chan string, 100) go Crawl("http://golang.org/", 4, fetcher, ch) for i := range ch { fmt.Println(i) } } The issue I have is where to put the close(ch) call. If I put a defer close(ch) somewhere in the Crawl method, then I end up writing to a closed channel in one of the spawned goroutines, since the method will finish execution before the spawned goroutines do. If I omit the call to close(ch), as is shown in my example code, the program deadlocks after all the goroutines finish executing but the main thread is still waiting on the channel in the for loop since the channel was never closed.

Search Results

Search found 261 results on 11 pages for 'crawler'.

Page 2/11 | < Previous Page | 1 2 3 4 5 6 7 8 9 10 11 | Next Page >

- by Marek

- by inckka

- by QF_Developer

- by sree

- by leeand00

- by NoobDev4iPhone

- by Fernando Barrocal

- by Ace

- by moomoochoo

- by brokencoding

- by Ankiov Spetsnaz

- by ithkuil

- by neimad

- by Dennis

- by user3474818

- by ZoFreX

- by Peeyush

- by abe3

- by pandaforme

- by gAMBOOKa

- by Björn

- by zahir hussain

- by Dexter

- by user260223

- by David Mason

< Previous Page | 1 2 3 4 5 6 7 8 9 10 11 | Next Page >