Search Results

Search found 499 results on 20 pages for 'robots'.

Page 7/20 | < Previous Page | 3 4 5 6 7 8 9 10 11 12 13 14 | Next Page >

Is browser and bot whitelisting a practical approach?

- by Sn3akyP3t3

With blacklisting it takes plenty of time to monitor events to uncover undesirable behavior and then taking corrective action. I would like to avoid that daily drudgery if possible. I'm thinking whitelisting would be the answer, but I'm unsure if that is a wise approach due to the nature of deny all, allow only a few. Eventually someone out there will be blocked unintentionally is my fear. Even so, whitelisting would also block plenty of undesired traffic to pay per use items such as the Google Custom Search API as well as preserve bandwidth and my sanity. I'm not running Apache, but the idea would be the same I'm assuming. I would essentially be depending on the User Agent identifier to determine who is allowed to visit. I've tried to take into account for accessibility because some web browsers are more geared for those with disabilities although I'm not aware of any specific ones at the moment. The need to not depend on whitelisting alone to keep the site away from harm is fully understood. Other means to protect the site still need to be in place. I intend to have a honeypot, checkbox CAPTCHA, use of OWASP ESAPI, and blacklisting previous known bad IP addresses.

Read the article
SEO consideration for duplicate sites

- by Malk

I am building a brochure-ware website for a company that sells products all across the world. They need the site to ask the user what region they are in before using the site; there are 5 regions. This is because there are different products offered to different regions and each region may or may not want to customize their own content. However, at launch and likely forever, most of the pages will be the exact same minus what is listed in the footer and in the product selection menu. My question is how should I structure the sitemap for this site for best SEO? Should I be concerned with duplicate content penalties and/or cannibalizing the site's presence on the SERP? Some considerations: The client wants to be able to print links directly to regional specific content bypassing any prompt for the user to select a region (to ensure they land on the target page). The client cannot have a 'default' region so the user must have a region specified "Clean" urls are important, but there is wiggle room The client does not want each region to have its own domain There will be a link on the page to allow users to specify a different region The client is not concerned with localization ...at this time Some products are available in multiple regions A quick list of options I am considering: www.site.com/region/page region.site.com/page www.site.com/page?region (no cookie, pages require the parameter. If visited without; the user must select a region) www.site.com/page (using cookie and a splash screen if needed; could pass parameter in to set the region for direct linking) Thanks in advance for your advice.

Read the article
How can I allow search engines to index my invite only website in ruby on rails?

- by tstyle

I have a ruby on rails website that will be in invite-only mode for the next couple of months. Currently I have it set up so visits to any page performs an authentication: before_filter :authenticate, :except => [:beta] //authenticate checks for a logged in user But the webpage has a lot of content that I would like to see indexed by search engines, and I was wondering if there's an easy way to allow crawlers to do their work? I am not very knowledgable on SEO related stuff at all, so sorry if this is an suboptimal way to phrase the question.

Read the article
Why am I getting this message "Some important page is blocked by robots.txt"?

- by Rounak

My site's URL is: www.hackinguniverse.org From some day in Google Webmaster tool a message is showing that says "Some important page is blocked by robots.txt". My robots.txt is: User-agent: Mediapartners-Google Disallow: User-agent: * Disallow: /search Allow: / Sitemap: http://www.hackinguniverse.org/feeds/posts/default?orderby=updated Now for other information - I hosted this website on blogger. I have some other sitemaps too where I had only included "/Search" in Disallow like in this robots.txt file. But those sites are ok. I mean no message is showing on those. So why am I getting that message telling that I have blocked some important page via robot.txt?

Read the article
Screen scraping: getting around "HTTP Error 403: request disallowed by robots.txt"

- by Diego

Is there a way to get around the following? httperror_seek_wrapper: HTTP Error 403: request disallowed by robots.txt Is the only way around this to contact the site-owner (barnesandnoble.com).. i'm building a site that would bring them more sales, not sure why they would deny access at a certain depth. I'm using mechanize and BeautifulSoup on Python2.6. hoping for a work-around

Read the article
Robots.txt with one site but two domains

- by Dofs

I have a website which has two domains added. Both domains point to the root of the website. Is it possible to alter the robots.txt so that one of the domains doesn't get crawled, while the other still does?

Read the article
Blocking yandex.ru bot

- by Ross

I want to block all request from yandex.ru search bot. It is very traffic intensive (2GB/day). I first blocked one C class IP range, but it seems this bot appear from different IP ranges. For example: spider31.yandex.ru - 77.88.26.27 spider79.yandex.ru - 95.108.155.251 etc.. I can put some deny in robots.txt but not sure if it respect this. I am thinking of blocking a list of IP ranges. Can somebody suggest some general solution.

Read the article
The API v2 Robots Have Risen: Check them out!

We launched the new Robots API (v2) last month, and since then, we've seen a a slew of new and interesting robots from Google Wave developers -- proving...

Read the article
How to create real-life robots?

- by Click Upvote

Even before I learnt programming I've been fascinated with how robots could work. Now I know how the underlying programming instructions would be written, but what I don't understand is how those intructions are followed by the robot. For example, if I wrote this code: object=Robot.ScanSurroundings(300,400); if (Objects.isEatable(object)) { Robot.moveLeftArm(300,400); Robot.pickObject(object); } How would this program be followed by the CPU in a way that would make the robot do the physical action of looking to the left, moving his arm, and such? Is it done primarily in binary language/ASM? Lastly, where would i go if I wanted to learn how to create a robot?

Read the article
Googlebot substitutes the links of Rails app with subdomain.

- by Victor

I have this Rails app, with domain name abc.com. I am also having a separate subdomain for Piwik stats, in this subdomain stats.abc.com. Googlebot somehow listed some of the links with my subdomain too. http://abc.com/login http://stats.abc.com/login http://abc.com/signup http://stats.abc.com/signup The ones with stats will reference to the same page in the app, but are treated entirely different website. I have put in robots.txt in stats after this matter, but wondering if there is any appropriate way to block this because I may have new subdomains in future. Here's my content in robots.txt User-agent: * Disallow: / Thanks.

Read the article
SEO Help with Pages Indexed by Google

- by Joe Majewski

I'm working on optimizing my site for Google's search engine, and lately I've noticed that when doing a "site:www.joemajewski.com" query, I get results for pages that shouldn't be indexed at all. Let's take a look at this page, for example: http://www.joemajewski.com/wow/profile.php?id=3 I created my own CMS, and this is simply a breakdown of user id #3's statistics, which I noticed is indexed by Google, although it shouldn't be. I understand that it takes some time before Google's results reflect accurately on my site's content, but this has been improperly indexed for nearly six months now. Here are the precautions that I have taken: My robots.txt file has a line like this: Disallow: /wow/profile.php* When running the url through Google Webmaster Tools, it indicates that I did, indeed, correctly create the disallow command. It did state, however, that a page that doesn't get crawled may still get displayed in the search results if it's being linked to. Thus, I took one more precaution. In the source code I included the following meta data: <meta name="robots" content="noindex,follow" /> I am assuming that follow means to use the page when calculating PageRank, etc, and the noindex tells Google to not display the page in the search results. This page, profile.php, is used to take the $_GET['id'] and find the corresponding registered user. It displays a bit of information about that user, but is in no way relevant enough to warrant a display in the search results, so that is why I am trying to stop Google from indexing it. This is not the only page Google is indexing that I would like removed. I also have a WordPress blog, and there are many category pages, tag pages, and archive pages that I would like removed, and am doing the same procedures to attempt to remove them. Can someone explain how to get pages removed from Google's search results, and possibly some criteria that should help determine what types of pages that I don't want indexed. In terms of my WordPress blog, the only pages that I truly want indexed are my articles. Everything else I have tried to block, with little luck from Google. Can someone also explain why it's bad to have pages indexed that don't provide any new or relevant content, such as pages for WordPress tags or categories, which are clearly never going to receive traffic from Google. Thanks!

Read the article
Can I tell sitecrawlers to visit a certain page?

- by Ace

Hi there! I have this drupal website that revolves around a document database. By design you can only find these documents by searching the site. But I want all the results to be indexed by Googlebot and other crawlers, so I was thinking, what if I make a page that lists all the documents, and then tell the robots to visit the page to index all my documents..? Is this possible, or is there a better way to do it?

Read the article
How can I allow robots access to my sitemap, but prevent casual users from accessing it?

- by morpheous

I am storing my sitemaps in my web folder. I want web crawlers (Googlebot etc) to be able to access the file, but I dont necessarily want all and sundry to have access to it. For example, this site (superuser.com), has a site index - as specified by its robots.txt file (http://superuser.com/robots.txt). However, when you type http://superuser.com/sitemap.xml, you are directed to a 404 page. How can I implement the same thing on my website? I am running a LAMP website, also I am using a sitemap index file (so I have multiple site maps for the site). I would like to use the same mechanism to make them unavailable via a browser, as described above.

Read the article
How much HDD space would I need to cache the web while respecting robot.txts?

- by Koning Baard XIV

I want to experiment with creating a web crawler. I'll start with indexing a few medium sized website like Stack Overflow or Smashing Magazine. If it works, I'd like to start crawling the entire web. I'll respect robot.txts. I save all html, pdf, word, excel, powerpoint, keynote, etc... documents (not exes, dmgs etc, just documents) in a MySQL DB. Next to that, I'll have a second table containing all restults and descriptions, and a table with words and on what page to find those words (aka an index). How much HDD space do you think I need to save all the pages? Is it as low as 1 TB or is it about 10 TB, 20? Maybe 30? 1000? Thanks

Read the article
how to rewrite or redirect old or missing or invalid url to 404 page

- by kath

I recently upgraded a site and almost all URLs have changed. I have redirected all of them (or so I hope) but it may be possible that some of them have slipped by me. Is there a way to somehow catch all invalid URLs and send the user to a certain page I am using PHP Thanks so much! error file is already in .htaccess but seems nothing going to change you can see the error file as below AddHandler application/x-httpd-php5s .php ErrorDocument 404 /content/404.php <IfModule mod_rewrite.c> RewriteEngine on RewriteBase / here are 2 different url one the first one is old one which i edited and the secound one is edited one #1 old one (which is no longer on the server) http://adsbuz.com/vehicles-cars/toyoya/2009-toyota-land-cruiser-gxr-4686.htm #2 the editet one which is on the server http://adsbuz.com/vehicles-cars-for-sale/toyoya/2009-toyota-land-cruiser-gxr-4686.htm i need only the secound one with the vehicles-cars-for-sale because the other directory is already modified and its not on the server but as you can see after the (adsbuz) site name vehicles-cars and vehicles-cars-for-sale both are opening for same location I hope I made myself clear

Read the article
Mysterious visitor to hidden PHP page

- by B. VB.

On my website, I have a "hidden" page that displays a list of the most recent visitors. There exist no links at all to this single PHP page, and, theoretically, only I know of its existence. I check it many times per day to see what new hits I have. However, about once a week, I get a hit from a 208.80.194.* address on this supposedly hidden page (it records hits to itself). The strange thing is this: this mysterious person/bot does not visit any other page on my site. Not the public PHP pages, but only this hidden page that prints the visitors. It's always a single hit, and the HTTP_REFERER is blank. The other data is always some variation of Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; YPC 3.2.0; FunWebProducts; .NET CLR 1.1.4322; SpamBlockerUtility 4.8.4; yplus 5.1.04b) ... but sometimes MSIE 6.0 instead of 7, and various other plug ins. The browser is different every time, as with the lowest-order bits of the address. And it's just that. One hit per week or so, to that one page. Absolutely no other pages are touched by this mysterious vistor. Doing a whois on that IP address showed it's from the new york area, and from the "Websense" ISP. The lowest order 8 bits of their address are always different, but always from 208.80.194.*/8. From most of the computers that I access my website, doing a tracerout to my server does not contain a router anywhere along the way with the IP 208.80.*. So that rules out any kind of HTTP sniffing, I might think. I have NO idea how, why this is happening. Does anyone have any clue, or have seen something as strange as this before? It seems completely benign, but unexplainable and a little creepy. Thanks in advance!

Read the article
How to track humans, but not spiders?

- by johnnietheblack

I am writing a nice little PHP/MySQL/JS ad system for my site - charging by impression. Therefore, my clients would like to be sure that they aren't paying for impressions caused by robots / spiders / etc. Is there a decently effective way to decipher between human and robot, without doing something obstructive like a captcha, or requesting "no index" or something? Basically, is there some sort of "name tag" that legit spiders have on so that my site knows their presence, therefore letting me react accordingly? And, if there isn't...how do ad companies like Google defend against this?

Read the article
What is Robot Army Testing?

- by joe

What is Robot Army Testing? Where is it used? How can I learn it?

Read the article
Allowing Google to bypass CAPTCHA verification - sensible or not?

- by edanfalls

My web site has a database lookup; filling out a CAPTCHA gives you 5 minutes of lookup time. There is also some custom code to detect any automated scripts. I do this as I don't want someone data mining my site. The problem is that Google does not see the lookup results when it crawls my site. If someone is searching for a string that is present in the result of a lookup, I would like them to find this page by Googling it. The obvious solution to me is to use the PHP variable $_SERVER['HTTP_USER_AGENT'] to bypass the CAPTCHA and custom security code for the Google bots. My question is whether this is sensible or not. People could then use Google's cache to view the lookup results without having to fill out the CAPTCHA, but would Google's own script detection methods prevent them from data mining these pages? Or would there be some way for people to make $_SERVER['HTTP_USER_AGENT'] appear as Google to bypass the security measures? Thanks in advance.

Read the article
Web bot in C++/PHP.

- by John

Hello, I've recently started learning PHP, but I have a wide knowledge on C++. I've been wondering how to make a web bot and now, I would greatly like to make one. I won't be using this robot for spamming or anything, just as a test of what PHP/C++ can do online. I was wondering how I could go about doing this and if you have any articles/tutorials that would be helpful. Thanks, John

Read the article
What's a good Web Crawler tool

- by Glenn Slaven

I need to index a whole lot of webpages, what good webcrawler utilities are there? I'm preferably after something that .NET can talk to, but that's not a showstopper. What I really need is something that I can give a site url to & it will follow every link and store the content for indexing.

Read the article
How do I forbid search engine to index subdirectory /CRM with robot.txt?

- by user198729

Or even forbid indexing the whole site?

Read the article
Is it a good idea to add robots "noindex" m tags deep, low content pages, e.g. product model data

- by Cognize

I'm considering adding robots "noindex, follow" tags to the very numerous product data pages that are linked from the product style pages in our online store. For example, each product style has a page with full text content on the product: http://www.shop.example/Product/Category/Style/SOME-STYLE-CODE Then many data pages with technical data for each model code is linked from the product style page. http://www.shop.example/Product/Category/Style/SOME-STYLE-CODE-1 http://www.shop.example/Product/Category/Style/SOME-STYLE-CODE-2 http://www.shop.example/Product/Category/Style/SOME-STYLE-CODE-3 It is these technical data pages that I intend to add the no index code to, as I imagine that this might stop these pages from cannibalizing keyword authority for more important content rich pages on the site. Any advice appreciated.

Read the article
Category to Page and blocking category url via robots.txt -Good for SEO?

- by user2952353

I am using a template which in the pages it allows me to add sidebars / more content under and above the content I want to pull from a category which is very helpful. If I create pages to display my categories content wont the page urls go in conflict with the category urls? By conflict I mean causing a duplicate content error? What I thought might help was to block from robots.txt the category urls of the blog ex. /category/books /category/music Would that be a good practice in order to avoid the duplicate content penalty? Any tips appreciated.

Read the article
Is it a good idea to add robots "noindex" meta tags to deep low content pages, e.g. product model data

- by Cognize

I'm considering adding robots "noindex, follow" tags to the very numerous product data pages that are linked from the product style pages in our online store. For example, each product style has a page with full text content on the product: http://www.shop.example/Product/Category/Style/SOME-STYLE-CODE Then many data pages with technical data for each model code is linked from the product style page. http://www.shop.example/Product/Category/Style/SOME-STYLE-CODE-1 http://www.shop.example/Product/Category/Style/SOME-STYLE-CODE-2 http://www.shop.example/Product/Category/Style/SOME-STYLE-CODE-3 It is these technical data pages that I intend to add the no index code to, as I imagine that this might stop these pages from cannibalizing keyword authority for more important content rich pages on the site. Any advice appreciated.

Read the article

< Previous Page | 3 4 5 6 7 8 9 10 11 12 13 14 | Next Page >