crawlers - Page 2 - Developer IT

What bots are really worth letting onto a site?

- by blunders

Having written a number of bots, and seen the massive amounts of random bots that happen to crawl a site, I am wondering if the goal of the site allowing bots is for the potential for the bot to send real traffic back to the site if there is any reason to allow bots that are not known to be sending real traffic back, and how to spot these "good" bots; based on how they ID themselves, IPs they come from, behaviors, etc.

Read the article

GWT: reporting crawling errors for non existing links

- by pixeline

Google Webmaster Tools is reporting crawl errors for links that never existed, and if i check the "Linked from" tab for a given error link, it shows another that never existed. They all mention joomla/ which is not the cms used on this domain (it's wordpress fyi). Exampled: http://example.com/joomla/index.php/component/user/register Linked from: http://example.com/joomla/component/user/login?return=L2###### What is going on? UPDATE 1 I tried something: I provided one of the faulty urls to the "Fetch as Google" functionality. Instead of returning a 404, it returns a 301 to another Joomla page. HTTP/1.1 301 Moved Permanently Server: Apache/2.4.3 X-Powered-By: PHP/5.4.4-10 X-Pingback: http://example.com/xmlrpc.php Expires: Wed, 11 Jan 1984 05:00:00 GMT Cache-Control: no-cache, must-revalidate, max-age=0 Pragma: no-cache Set-Cookie: PHPSESSID=1fgr5v2oip39miibuptd51s8h0; path=/ Set-Cookie: woocommerce_items_in_cart=0; expires=Sat, 12-Jan-2013 11:44:01 GMT; path=/ Location: http://example.com/joomla/component/user/register Content-Type: text/html; charset=iso-8859-1 Content-Length: 387 Date: Sat, 12 Jan 2013 12:44:01 GMT Via: 1.1 varnish Connection: keep-alive Accept-Ranges: bytes Age: 0 <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"> <html><head> <title>301 Moved Permanently</title> </head><body> <h1>Moved Permanently</h1> <p>The document has moved <a href="http://example.com/joomla/component/user/register">here</a>.</p> <p>Additionally, a 301 Moved Permanently error was encountered while trying to use an ErrorDocument to handle the request.</p> </body></html>

Read the article

how to get IP adress of google search/crawler bot to add to our white list of ip address

- by Jayapal Chandran

Hi, Google webmaster tools says network unreachable. When i contacted my hosting provider they said that they have installed firewall which could block frequent incoming ip addresses and they dont know the google's ip adress to unblock. so they requested me to find google search/crawler bot's ip adress so that they can add it to their whitelist. How to find the ip address of google search bot or crawler bot? My site stopped appearing in google search. My hits had gone too low. What should i do? any kind of reply would he helpful.

Read the article

Regexp that matches user-agents of end-user browsers but NOT crawlers with >90 % accuracy

- by knorv

I'm trying to construct a regexp that will evaluate to true for User-Agent:s of "browsers navigated by humans", but false for bots. Needless to say the matching will not be exact, but if it gets things right in say 90 % of cases that is more than good enough. My approach so far is to target the User-Agent string of the the five major desktop browsers (MSIE, Firefox, Chrome, Safari, Opera). Specifically I want the regexp NOT to match if the user-agent is a bot (Googlebot, msnbot, etc.). Currently I'm using the following regexp which appears to achieve the desired precision: ^(Mozilla.*(Gecko|KHTML|MSIE|Presto|Trident)|Opera).*$ I've observed small number of false negatives which are mostly mobile browsers. The exceptions all match: (BlackBerry|HTC|LG|MOT|Nokia|NOKIAN|PLAYSTATION|PSP|SAMSUNG|SonyEricsson) My question is: Given the desired accuracy level, how would you improve the regexp? Can you think of any major false positives or false negatives to the given regexp? Please note that the question is specifically about regexp-based User-Agent matching. There are a bunch of other approaches to solving this problem, but those are out of the scope of this question.

Read the article

Does a "nofollow" attribute on a link prevent URL discovery by search engines?

- by Stephen Ostermiller

I know that nofollow prevents link juice from being passed across a link. But if search engine robots discover a link with a nofollow on it, will they add that link to their crawl queue? In other words, if I create a link to a brand new page and put a rel=nofollow attribute on that link, will it prevent search engine bots (particularly Googlebot) from crawling the page. (Assuming that this link remains the only link into that page.) I've read conflicting reports about this over the years and I'm looking for authoritative references about the current state of affairs. Official statements from Google or published results of independent testing would be ideal.

Read the article

How do I remove a LOT of indexed pages from Google?

- by Thierry

A few weeks ago we have figured out that Google has indexed some information we would rather keep in some confidentiality, in the format of individual PDF files. Our assumption was that this was a problem with our robots.txt we had overlooked. Even though we are not sure whether or not this is the case, we are certain that the robots.txt file is in a valid format and is, according to Google's webmaster tools, blocking the files. However, even after this adjustment that has been made weeks ago, Google still has the PDF files indexed, but does tell us further information cannot be provided due to the robots.txt file being present. As you can hopefully understand, this is unwanted behaviour due to the nature of the documents. I am aware that there is a request page being provided by Google for this purpose, but there are a lot of files. Is there an easier way to get Google to remove all of the files from its search engine? If not, is there anything else you could advise us to do besides manually requesting Google to remove every single page? Thanks in advance.

Read the article

Grapeshot crawler ignoring robots.txt

- by QF_Developer

Has anyone come across a crawler called Grapeshot? They are hammering the same page repeatedly on our website. I believe they are looking for ad related keywords, based on previous content ad campaigns. The odd thing is we never ran any such campaigns on the page they are so interested in. We do have only a few pages running AdSense, is this what has attracted Grapeshot? I've added the following declaration to my robots.txt, but they don't seem to be honouring it? User-agent: grapeshot Disallow: / Any ideas on how to block this nuisance crawler? I'm starting to think the best way is by setting up IP rules in IIS?

Read the article

Understanding the maximum hit-rate supported by a web-server

- by SNag

I would like to crawl a publicly available site (and one that's legal to crawl) for a personal project. From a brief trial of the crawler, I gathered that my program hits the server with a new HTTPRequest 8 times in a second. At this rate, as per my estimate, to obtain the full set of data I need about 60 full days of crawling. While the site is legal to crawl, I understand it can still be unethical to crawl at a rate that causes inconvenience to the regular traffic on the site. What I'd like to understand here is -- how high is 8 hits per second to the server I'm crawling? Could I possibly do 4 times that (by running 4 instances of my crawler in parallel) to bring the total effort down to just 15 days instead of 60? How do you find the maximum hit-rate a web-server supports? What would be the theoretical (and ethical) upper-limit for the crawl-rate so as to not adversely affect the server's routine traffic?

Read the article

Page load speeds effect on crawl rate

- by Sam Pegler

We've noticed a big drop in the total pages crawled per day on our site, we have no control over the crawl rate in google webmaster tools so it's possible this has been changed by google. However it's a fairly large site and I wouldn't of thought that the crawl rate would've been decreased. What we have noticed though is a sizeable increase in page load times, in my mind this would be the cause. Can anyone else confirm if the crawl rate is directly correlated to page load time? Seems logical, longer page load time, less pages crawled. Any decent documentation on this would be appreciated, I don't normally have any input on SEO so this is new to me.

Read the article

Webmaster Tools word count

- by Henrik Erlandsson

Is there a way to somehow verify that the googlebot finds the headings and the content, for example by word count? I'm asking this because I tried a program called Screaming Frog, which fails to even fetch the first h1 on a validated page - for about 1/3 of all the pages(!) - and got insecure. Even though the site looks hunky dory in Webmaster Tools, I'd like to know what a googlebot-like content crawler finds on my page and in what order. Any tips on such tools is appreciated. This is not about keyword count.

Read the article

Will Google crawl session based website

- by DonShwep

I have a website, it is split into 3 categories but using PHP its an all-in one kind of style. When a user chooses a category on the home page a session is set, this is then used to set the style and contents of the website. Would Googlebot and other bots be able to still scan my website? If a page is accessed and no session is set then the user is sent back to the home page. I have created special links, that set a session but go straight to the contact page. Even this page doesn't seem to be showing up. Any ideas if a sitemap with specially crafted links (to set the session) will help Google?

Read the article

Fake links cause crawl error in Google Webmaster Tools

- by Itai

Google reported Crawl Errors last week on my largest site though Webmaster Tools. Here is the message: Google detected a significant increase in the number of URLs that return a 404 (Page Not Found) error. Investigating these errors and fixing them where appropriate ensures that Google can successfully crawl your site's pages. The Crawl Errors list is now full of hundreds of fake links like these causing 16,519 errors so far: Note that my site does not even have a search.html and is not related to any of the terms shown in the above image. Inspecting sources for one of those links, I can see this is not simply an isolated source but a concerted effort: Each of the links has a few to a dozen sources all from different, seemingly unrelated sites. It is completely baffling as to why would someone to spending effort doing this. What are they hoping to achieve? Is this an attack? Most importantly: Does this have a negative effect on my side? Could it negatively impact my ranking? If so, what to do about it? The few linking pages I looked at are full of thousands of links to tons of sites and have no contact information and do not seem like the kind of people who would simply stop if asked nicely! According to Google Webmaster Tools, these errors have appeared in a span of 11 days. No crawl errors were being reported previously.

Read the article

Is there a weight for html link?

- by Questions

I would like to know if there is any weight associated with a html link (its backlink) when google does crawling/indexing. Will 1,2 & 3 be ever considered as a backlink by Google? 1. <a href="xyz.com">1</a> // one character link 2. <a href="xyz.com"> </a> //blank space link 3. <a href="xyz.com">!</a> //special character link 4. <a href="xyz.com">keyword</a> //meaningful word link I hope my question is understandable and guess this is right forum. I don't know to put it in other words. Thanks in advance.

Read the article

Does Submit to Index on a page with new content update Content Keywords for the site?

- by Dan Kanze

Using Google Webmaster Tools I'm trying to update the Content Keywords of my site. I'm confused about the relationship between Submit to Index and Content Keywords Does Fetch as Google -- Submit to Index on a previously existing indexed page containing new content expidite updating the Content Keywords crawled by the real Google bot? Does Submit to Index only submit new URL's so that previously indexed URL's still point to the older cached version until Google crawls specifically for new content on its own? Does Submit to Index have anything to do with Content Keywords or crawling new content being a previously indexed page or never been indexed page?

Read the article

How frequently Googlebot fetch sitemaps? Is it depending on page rank?

- by JITHIN JOSE

How much frequently google fetches sitemaps? I am now working with a high traffic website normally have 30 new posts per minute.But currently it provides sitemaps which links to new 100 posts(3 minutes). Is this method is enough ?. Is Bots fetch sitemaps every 3 minutes?. Did need to change sitemaps to list all 5M posts(indexed sitemaps)?. How this change will effect on traffic and page rank. Is google bot remove urls that previously listed on sitemap but not now?

Read the article

Is this Anti-Scraping technique viable with Crawl-Delay?

- by skibulk

I want to prevent web scrapers from abusing 1,000,000 on my website. I'd like to do this by returning a "503 Service Unavailable" error code for users that access an abnormal number of pages per minute. I don't want search engine spiders to ever receive the error. My inclination is to set a robots.txt crawl-delay which will ensure spiders access a number of pages per minute under my 503 threshold. Is this an appropriate solution? Do all major search engines support the directive? Could it negatively affect SEO? Are there any other solutions or recommendations?

Read the article

Can I make query strings produce separate pages?

- by John Smith

I have a profile page with a URL like so: localhost/profile.php/?username=Bob I was wondering, if I had a separate <title> which changed according to the username, would they produce separate pages in the google search results? How do I tell Google to only use the username string or does it search within the title? On a similar note, how would I create a separate page with the username like so: localhost/bob instead of a query string like facebook does. Do that make a new file for each user?

Read the article

'Buy the app' landing page implementations: redirect or javascript popup?

- by benwad

My site (using Django) has an app that I'm trying to push - I currently have a piece of middleware that redirects the user to a page advertising the app if they're accessing the page on the iPhone, then setting a cookie so that the user isn't bugged by the message every time they visit the site. This works fine, however checking the page with the mobile Googlebot checker shows that the Googlebot gets stuck in the redirect (since it doesn't store cookies) and therefore won't index the proper content. So, I'm trying to think of an alternative implementation that won't hurt the site's Google ranking and won't have any other adverse effects. I've considered a couple of options: Redirect (the current solution), but don't redirect if the user agent matches the Googlebot's UA string. This would be ideal, however I'm not sure if Google like their bot being treated differently from other users, and I'm afraid the site's ranking may be somehow penalised if I go ahead with this. Use a Javascript popup instead of a redirect. This would make sure the Googlebot finds the content it needs, however I envision this approach causing compatibility issues with the myriad mobile devices/browsers out there, and may affect the page load time. How valid are these options? And is there a better option for implementing this feature out there? I've tried researching this topic but surprisingly can't find any reputable-looking blog posts that explore this topic.

Read the article

Web Crawler for Learnign Topics on Wikipedia

- by Chris Okyen

When I want to learn a vast topic on wikipedia, I don't know where to start. For instance say I want to learn about Binary Stars, I then have to know other things linked on that pages and linked pages on all the linked pages and so on for the specified number of levels. I want to write a web crawler like HTTracker or something similiar, that will display a heiarchy of the links on a certain page and the links on those linked pages.I wish to use as much prewritten code as possible. Here is an example: Pretending we are bending the rules by grabing links from only the first sentence of each pages The example archives and "processes" two levels deep The page is Ternary operation The First Level In mathematics a ternary operation is an N-ary operation The Second Level Under Mathmatics: Mathematics (from Greek µ???µa máthema, “knowledge, study, learning”) is the abstract study of topics encompassing quantity, structure, space, change and others; it has no generally accepted definition. Under N-ary In logic,mathematics, and computer science, the arity i/'ær?ti/ of a function or operation is the number of arguments or operands that the function takes Under Operation In its simplest meaning in mathematics and logic, an operation is an action or procedure which produces a new value from one or more input values ------------------------------------------------------------------------- I need some way to determine what oder to approach all these wiki pages to learn the concept ( in this case ternary operations )... Following along with this exmpakle, one way to show the path to read would a printout flowout like so: This shows that the first sentence of the Mathematics page doesn't link to the first sentence of pages linked on ternary page two levels deep. (Please tell me how I should explain this ) --- In otherwords, the child node of the top pages first sentence, ternary_operation, does not have any child nodes that reference the children of the top pages other children nodes- N-ary and operation. Thus it is safe to read this first. Since N-ary has a link to operations we shoudl read the operation page second and finally read the N-ary page last. Again, I wish to use as much prewritten code as possible, and was wondering what language to use and what would be the simpliest way to go about doing this if there isn't already somethign out there? Thank You!

Read the article

Logic behind crawling an webpages like that of Screaming Frog? [on hold]

- by sree

I would like to know what is the parameters to be considered while developing a crawler like that of Screaming Frog. Am looking forward for information on do's and dont's of webpage crawling. What are the problems the crawler may infuse on the webpages like loadtime (maybe?) or anything that effects webpage during crawling. What are the rules the crawler needs to follow etc. Basically anything info that makes the crawler look good and accurate. Just point me in a right direction to achieve it.. Hope my requirement is clear this time.. :)

Read the article

Disqus thread migration. Gotchas?

- by sramsay

I've been migrating a site to a new domain. The site itself is pretty straightforward (it uses Jekyll), and everything has gone fine -- except migration of Disqus threads. I've had partial success -- some of the threads have migrated successfully, but not all. I've tried the domain migration wizard (which caught a few), the URL mapper (which caught a few), and the 301 redirect crawler (which caught a few). But the remaining threads just won't move, no matter which method I use. So, I suppose I suppose I'm asking if there are any "gotchas" I should know about with this. When you execute any of these migration tools, it says it will "take awhile." Does that mean hours? Days? I can't tell if it's working, and there's no logging or error reporting that I can see.

Read the article

'Buy the app' landing page implementations

- by benwad

My site (using Django) has an app that I'm trying to push - I currently have a piece of middleware that redirects the user to a page advertising the app if they're accessing the page on the iPhone, then setting a cookie so that the user isn't bugged by the message every time they visit the site. This works fine, however checking the page with the mobile Googlebot checker shows that the Googlebot gets stuck in the redirect (since it doesn't store cookies) and therefore won't index the proper content. So, I'm trying to think of an alternative implementation that won't hurt the site's Google ranking and won't have any other adverse effects. I've considered a couple of options: Redirect (the current solution), but don't redirect if the user agent matches the Googlebot's UA string. This would be ideal, however I'm not sure if Google like their bot being treated differently from other users, and I'm afraid the site's ranking may be somehow penalised if I go ahead with this. Use a Javascript popup instead of a redirect. This would make sure the Googlebot finds the content it needs, however I envision this approach causing compatibility issues with the myriad mobile devices/browsers out there, and may affect the page load time. How valid are these options? And is there a better option for implementing this feature out there? I've tried researching this topic but surprisingly can't find any reputable-looking blog posts that explore this topic. EDIT: I posted this on SF because it seemed unsuitable for SO, but if there's another site that would be better for this issue then I'd be happy to move the question elsewhere.

Read the article

Can AdSense crawler view pages that require cookies?

- by moomoochoo

Details I require users to agree to terms and conditions before they can view several pages on my site. Once they have agreed a cookie is set and they can proceed to the webpage. If a user somehow manages to end up on the webpage without a cookie they will not be able to access the page's content. My question(s) Is the AdSense crawler able to set the cookie and visit these pages? If yes, how will it know to agree to the TOS? Is there some way to allow it access to the pages even if it couldn't use cookies?

Read the article

Why google isn't updating my site title in search results? [closed]

- by SharkTheDark

Possible Duplicate: Google doesn't seem to update the description or title of my homepage I had my domain for few days before I uploaded site to it, and it had one title, and then when I uploaded content it should get new title, but with my misunderstanding of WordPress it had blocked robots.txt and keyword with no-index and no-follow. But I removed that like 7 days ago, and I see in reports that Google bot is crawling over my site, but my site title isn't updating, it still has old domain title when site wasn't there... My robots.txt has now: User-agent: * Allow: / I have clear title tag on every page. How long does it take to update? Do I need to check something else?

Read the article

robots.txt dissalow url containing string with a '/' at the end

- by thanili

i have a website with thousands of dynamic pages. I want to use the robots.txt file in order to dissalow certain url patterns corresponding to pages with duplicate content. For example i have a page for article itemA belonging to category catA/subcatA, with URL: /catA/subcatA/itemA this is the URL that i want to be indexed from google. this article is also visible via tagging in various other places in the web site. The URLs produced via tagging is like: /tagA1/itemA this URL i want NOT to be indexed from google. However i want to have indexed all tag listings: /tagA1 so how can i achieve this? dissalow URLs of including a specific string with a '/' at the end? /tagA1/ itemA - dissalow /tagA1 - allow

Search Results

Search found 149 results on 6 pages for 'crawlers'.

Page 2/6 | < Previous Page | 1 2 3 4 5 6 | Next Page >

- by blunders

- by pixeline

- by Jayapal Chandran

- by knorv

- by Stephen Ostermiller

- by Thierry

- by QF_Developer

- by SNag

- by Sam Pegler

- by Henrik Erlandsson

- by DonShwep

- by Itai

- by Questions

- by Dan Kanze

- by JITHIN JOSE

- by skibulk

- by John Smith

- by benwad

- by Chris Okyen

- by sree

- by sramsay

- by benwad

- by moomoochoo

- by SharkTheDark

- by thanili

< Previous Page | 1 2 3 4 5 6 | Next Page >