Search Results

Search found 4984 results on 200 pages for 'robots txt'.

Page 2/200 | < Previous Page | 1 2 3 4 5 6 7 8 9 10 11 12  | Next Page >

  • My First robots.txt

    - by Whitechapel
    I'm creating my first robots.txt and wanted to get a second opinion on it. Basically I have a FTP setup on my board for some special users to transfer files between each other and I do NOT want that included in the search by the bots. I also want to point to my sitemap which gets auto generated by a PHP page. So here is what I have, what else should I include, and if I need to fix anything with it? Also, it's linking to xmlsitemap.php because that generates the sitemap when called. My goal is to allow any search bot crawl the forums to grab meta data. User-agent: * Disallow: /admin/ Disallow: /ali/ Disallow: /benny/ Disallow: /cgi-bin/ Disallow: /ders/ Disallow: /empire/ Disallow: /komodo_117/ Disallow: /xanxan/ Disallow: /zeroordie/ Disallow: /tmp/ Sitemap: http://www.vivalanation.com/forums/xmlsitemap.php Edit, I'm not sure how to handle all the user's folders under /public_html/ since the robots.txt will be going in /public_html.

    Read the article

  • blocking bad bots with robots.txt in 2012 [closed]

    - by Rachel Sparks
    does it still work good? I have this: # Generated using http://solidshellsecurity.com services # Begin block Bad-Robots from robots.txt User-agent: asterias Disallow:/ User-agent: BackDoorBot/1.0 Disallow:/ User-agent: Black Hole Disallow:/ User-agent: BlowFish/1.0 Disallow:/ User-agent: BotALot Disallow:/ User-agent: BuiltBotTough Disallow:/ User-agent: Bullseye/1.0 Disallow:/ User-agent: BunnySlippers Disallow:/ User-agent: Cegbfeieh Disallow:/ User-agent: CheeseBot Disallow:/ User-agent: CherryPicker Disallow:/ User-agent: CherryPickerElite/1.0 Disallow:/ User-agent: CherryPickerSE/1.0 Disallow:/ User-agent: CopyRightCheck Disallow:/ User-agent: cosmos Disallow:/ User-agent: Crescent Disallow:/ User-agent: Crescent Internet ToolPak HTTP OLE Control v.1.0 Disallow:/ User-agent: DittoSpyder Disallow:/ User-agent: EmailCollector Disallow:/ User-agent: EmailSiphon Disallow:/ User-agent: EmailWolf Disallow:/ User-agent: EroCrawler Disallow:/ User-agent: ExtractorPro Disallow:/ User-agent: Foobot Disallow:/ User-agent: Harvest/1.5 Disallow:/ User-agent: hloader Disallow:/ User-agent: httplib Disallow:/ User-agent: humanlinks Disallow:/ User-agent: InfoNaviRobot Disallow:/ User-agent: JennyBot Disallow:/ User-agent: Kenjin Spider Disallow:/ User-agent: Keyword Density/0.9 Disallow:/ User-agent: LexiBot Disallow:/ User-agent: libWeb/clsHTTP Disallow:/ User-agent: LinkextractorPro Disallow:/ User-agent: LinkScan/8.1a Unix Disallow:/ User-agent: LinkWalker Disallow:/ User-agent: LNSpiderguy Disallow:/ User-agent: lwp-trivial Disallow:/ User-agent: lwp-trivial/1.34 Disallow:/ User-agent: Mata Hari Disallow:/ User-agent: Microsoft URL Control - 5.01.4511 Disallow:/ User-agent: Microsoft URL Control - 6.00.8169 Disallow:/ User-agent: MIIxpc Disallow:/ User-agent: MIIxpc/4.2 Disallow:/ User-agent: Mister PiX Disallow:/ User-agent: moget Disallow:/ User-agent: moget/2.1 Disallow:/ User-agent: mozilla/4 Disallow:/ User-agent: Mozilla/4.0 (compatible; BullsEye; Windows 95) Disallow:/ User-agent: Mozilla/4.0 (compatible; MSIE 4.0; Windows 95) Disallow:/ User-agent: Mozilla/4.0 (compatible; MSIE 4.0; Windows 98) Disallow:/ User-agent: Mozilla/4.0 (compatible; MSIE 4.0; Windows NT) Disallow:/ User-agent: Mozilla/4.0 (compatible; MSIE 4.0; Windows XP) Disallow:/ User-agent: Mozilla/4.0 (compatible; MSIE 4.0; Windows 2000) Disallow:/ User-agent: Mozilla/4.0 (compatible; MSIE 4.0; Windows ME) Disallow:/ User-agent: mozilla/5 Disallow:/ User-agent: NetAnts Disallow:/ User-agent: NICErsPRO Disallow:/ User-agent: Offline Explorer Disallow:/ User-agent: Openfind Disallow:/ User-agent: Openfind data gathere Disallow:/ User-agent: ProPowerBot/2.14 Disallow:/ User-agent: ProWebWalker Disallow:/ User-agent: QueryN Metasearch Disallow:/ User-agent: RepoMonkey Disallow:/ User-agent: RepoMonkey Bait & Tackle/v1.01 Disallow:/ User-agent: RMA Disallow:/ User-agent: SiteSnagger Disallow:/ User-agent: SpankBot Disallow:/ User-agent: spanner Disallow:/ User-agent: suzuran Disallow:/ User-agent: Szukacz/1.4 Disallow:/ User-agent: Teleport Disallow:/ User-agent: TeleportPro Disallow:/ User-agent: Telesoft Disallow:/ User-agent: The Intraformant Disallow:/ User-agent: TheNomad Disallow:/ User-agent: TightTwatBot Disallow:/ User-agent: Titan Disallow:/ User-agent: toCrawl/UrlDispatcher Disallow:/ User-agent: True_Robot Disallow:/ User-agent: True_Robot/1.0 Disallow:/ User-agent: turingos Disallow:/ User-agent: URLy Warning Disallow:/ User-agent: VCI Disallow:/ User-agent: VCI WebViewer VCI WebViewer Win32 Disallow:/ User-agent: Web Image Collector Disallow:/ User-agent: WebAuto Disallow:/ User-agent: WebBandit Disallow:/ User-agent: WebBandit/3.50 Disallow:/ User-agent: WebCopier Disallow:/ User-agent: WebEnhancer Disallow:/ User-agent: WebmasterWorldForumBot Disallow:/ User-agent: WebSauger Disallow:/ User-agent: Website Quester Disallow:/ User-agent: Webster Pro Disallow:/ User-agent: WebStripper Disallow:/ User-agent: WebZip Disallow:/ User-agent: WebZip/4.0 Disallow:/ User-agent: Wget Disallow:/ User-agent: Wget/1.5.3 Disallow:/ User-agent: Wget/1.6 Disallow:/ User-agent: WWW-Collector-E Disallow:/ User-agent: Xenu's Disallow:/ User-agent: Xenu's Link Sleuth 1.1c Disallow:/ User-agent: Zeus Disallow:/ User-agent: Zeus 32297 Webster Pro V2.9 Win32 Disallow:/

    Read the article

  • Valid robots.txt? [closed]

    - by psot
    I am waiting for Google to crawl my site and display the results in search. Is my robots.txt alright and will it let google, bing etc crawl my site? Thanks! User-agent: * Disallow: /cgi-bin/ Disallow: /wp-admin/ Disallow: /wp-includes/ Disallow: /wp-content/ Disallow: /build/ Disallow: /css/ Disallow: /trackback/ Disallow: /comments Disallow: /assets/graphics/ Disallow: /assets/visual/ Disallow: /category/*/* Disallow: */trackback Disallow: */feed Disallow: */comments Disallow: /*?* Disallow: /*? User-agent: Slurp Disallow: / User-agent: Baiduspider Disallow: / User-agent: ia_archiver Disallow: / User-agent: duggmirror Disallow: / User-agent: Yandex Disallow: / Sitemap: http://example.com/sitemap.xml.gz

    Read the article

  • Googlebot fetches my pages very frequent, rel-nofollow, meta-noindex or robots.txt-disallow

    - by trante
    Googlebot fetches pages in my site very frequently. And this slowens my website. I don't want Googlebot to crawl too frequent. I decreased crawl rate from Google webmaster tools. But I'm supposing to use these three tools: Adding rel="nofollow" to my inner pages. So Googlebot won't crawl and index them. Adding meta tag "noindex" so Google will remove this page from index and won't get it again. Adding Disallow: /mySomeFolder/ to robots.txt and Googlebot won't crawl that pages. I'm planning to use these methods for my 56.000 pages, except the most important 6-7 pages. Which method would you prefer and what would be disadvantages or advantages ? Or won't it change my website speed etc..

    Read the article

  • Grapeshot crawler ignoring robots.txt

    - by QF_Developer
    Has anyone come across a crawler called Grapeshot? They are hammering the same page repeatedly on our website. I believe they are looking for ad related keywords, based on previous content ad campaigns. The odd thing is we never ran any such campaigns on the page they are so interested in. We do have only a few pages running AdSense, is this what has attracted Grapeshot? I've added the following declaration to my robots.txt, but they don't seem to be honouring it? User-agent: grapeshot Disallow: / Any ideas on how to block this nuisance crawler? I'm starting to think the best way is by setting up IP rules in IIS?

    Read the article

  • Robots meta tag with "noimageindex"

    - by jimy
    I have some doubt regarding noimageindex value in meta robots tag. If I add this tag on my page say http://www.example.com/somepage/someaction.php and on that page the images are served from another page say http://www.exampleimg.com. Then will the tag has meaning. I mean to say will the images will be ignored by bots? Or exampleimg is not affected by that tag. And all images will be indexed? Note: We want to stop indexing of the images on that particular page.

    Read the article

  • robots.txt dissalow url containing string with a '/' at the end

    - by thanili
    i have a website with thousands of dynamic pages. I want to use the robots.txt file in order to dissalow certain url patterns corresponding to pages with duplicate content. For example i have a page for article itemA belonging to category catA/subcatA, with URL: /catA/subcatA/itemA this is the URL that i want to be indexed from google. this article is also visible via tagging in various other places in the web site. The URLs produced via tagging is like: /tagA1/itemA this URL i want NOT to be indexed from google. However i want to have indexed all tag listings: /tagA1 so how can i achieve this? dissalow URLs of including a specific string with a '/' at the end? /tagA1/ itemA - dissalow /tagA1 - allow

    Read the article

  • Google Webmaster Tools robots test not working

    - by tracy_snap
    Within Webmaster Tools I have supplied my test content: User-agent: * Disallow:/admin/ Disallow: /tag/ When I specify the URL to test against, for example: http://www.site.com/tag/ It gives me this result: "Allowed: Detected as a directory; specific files may have different restrictions" As far as I know I have set this up correctly, shouldn't Google be saying that the /tag/ directory is "disallowed"?

    Read the article

  • Stop bots from crawling old links with extensions

    - by Jared
    I've recently switched to MVC3 which is extension-less for the URL's, but Google and Bing have a wealth of links that they are crawling which no longer exist. So I'm trying to find out if there is a way to format robots.txt (or by some other method) to tell google/bing that any link that ends in an extension isn't a valid link... Is this possible? On pages that I'm concerned about a User having saved as a fav I'm displaying a 404 page that lists the links to take once they are redirected to the new page (I decided to not just redirect them as I don't want to maintain these forever). For Google/Bing sake I do have the canonical tag in the header. User-agent: * Allow: / Disallow: /*.* EDIT: I just added the 3rd line (in text above) and it APPEARS to do what I'm wanting. Allow a path, but disallow a file. Can anyone confirm this?

    Read the article

  • Googlebot cant access my site webmaster tools reply Unreachable robots.txt

    - by Ahmad Ahmadi
    When I try to fetch my site as a googlebot in webmaster tools it return Unreachable robots.txt, after investigate I understood google bot can see my server: tcpdump | grep google it return that google can access my server with IP 66.249.81.172 or 66.249.75.111. but there is not any think in access log or error log or other apache logs. cat access_log | grep google or cat error_log | grep 66.249.81.172 Other bot (bing,...) can access apache but google cant. there is not any problem in my robots.txt or its permissions because as you know robots.txt is not necessary so I delete it but again webmaster tools returned Unreachable robots.txt not 404 not found! information about server: Server OS : CentOS 6 Web Server : Apache 2.x Firewall : IPTables is stoped SELinux is Disabled There is not any think else for security on my server. how can I investigate the problem and is there any other command that can help me to find the problem.

    Read the article

  • apache robots.txt with SSL

    - by user224013
    I have an .htaccess file with a rewrite rule to get a redirect of every HTTP request to HTTPS. But now I have a problem that my robots.txt is not recognized by some online checker. If I remove the redirect from the .htaccess file the robots.txt is recognized correctly. Maybe I should exclude that the robots.txt is redirects to an HTTPS connection? This is the part of .htaccess for redirecting to HTTPS RewriteCond %{SERVER_PORT} !^443$ RewriteRule (.*) https://%{HTTP_HOST}/$1 [L]

    Read the article

  • Hiding elements based on last closed element jquery script

    - by Jared
    Hi my question is, how can I make this jquery script close all previously opened children when entering a new parent? At the moment it traverses thru all the tree structure fine, but switching from one parent to another does not close the previous children, but rather only the each individual parents elements as a user browses. Here is the jquery I'm using: <script type="text/javascript"> $(document).ready($(function(){ $('#nav>li>ul').hide(); $('.children').hide(); $('#nav>li').mousedown(function(){ // check that the menu is not currently animated if ($('#nav ul:animated').size() == 0) { // create a reference to the active element (this) // so we don't have to keep creating a jQuery object $heading = $(this); // create a reference to visible sibling elements // so we don't have to keep creating a jQuery object $expandedSiblings = $heading.siblings().find('ul:visible'); if ($expandedSiblings.size() > 0) { $expandedSiblings.slideUp(0, function(){ $heading.find('ul').slideDown(0); }); } else { $heading.find('ul').slideDown(0); } } }); $('#nav>li>ul>li').mousedown(function(){ // check that the menu is not currently animated if ($('#nav ul:animated').size() == 0) { // create a reference to the active element (this) // so we don't have to keep creating a jQuery object $heading2 = $(this); // create a reference to visible sibling elements // so we don't have to keep creating a jQuery object $expandedSiblings2 = $heading2.siblings().find('.children:visible'); if ($expandedSiblings2.size() > 0) { $expandedSiblings2.slideUp(0, function(){ $heading2.find('.children').slideDown(0); }); } else { $heading2.find('.children').slideDown(0); } } }); })); </script> and here is my html output <ul id="nav"> <li><a href="#">folder 4</a> <ul><li><a href="#">2001</a> <ul><li class="children"><a href="./directory//folder 4/2001/doc1.txt">doc1.txt</a></li> <li class="children"><a href="./directory//folder 4/2001/doc2.txt">doc2.txt</a></li> <li class="children"><a href="./directory//folder 4/2001/doc3.txt">doc3.txt</a></li> </ul> </li> <li><a href="#">2002</a> <ul><li class="children"><a href="./directory//folder 4/2002/doc1.txt">doc1.txt</a></li> <li class="children"><a href="./directory//folder 4/2002/doc2.txt">doc2.txt</a></li> <li class="children"><a href="./directory//folder 4/2002/doc3.txt">doc3.txt</a></li> <li class="children"><a href="./directory//folder 4/2002/doc4.txt">doc4.txt</a></li> </ul> </li> <li><a href="#">2003</a> <ul><li class="children"><a href="./directory//folder 4/2003/Copy of doc1.txt">Copy of doc1.txt</a></li> <li class="children"><a href="./directory//folder 4/2003/doc1.txt">doc1.txt</a></li> <li class="children"><a href="./directory//folder 4/2003/doc2.txt">doc2.txt</a></li> </ul> </li> <li><a href="#">2004</a> <ul><li class="children"><a href="./directory//folder 4/2004/doc1.txt">doc1.txt</a></li> <li class="children"><a href="./directory//folder 4/2004/doc2.txt">doc2.txt</a></li> <li class="children"><a href="./directory//folder 4/2004/doc3.txt">doc3.txt</a></li> <li class="children"><a href="./directory//folder 4/2004/doc4.txt">doc4.txt</a></li> </ul> </li> </ul> </li> <li><a href="#">folder1</a> <ul><li><a href="#">2001</a> <ul><li class="children"><a href="./directory//folder1/2001/doc1.txt">doc1.txt</a></li> <li class="children"><a href="./directory//folder1/2001/doc2.txt">doc2.txt</a></li> <li class="children"><a href="./directory//folder1/2001/doc3.txt">doc3.txt</a></li> </ul> </li> <li><a href="#">2002</a> <ul><li class="children"><a href="./directory//folder1/2002/doc1.txt">doc1.txt</a></li> <li class="children"><a href="./directory//folder1/2002/doc2.txt">doc2.txt</a></li> <li class="children"><a href="./directory//folder1/2002/doc3.txt">doc3.txt</a></li> <li class="children"><a href="./directory//folder1/2002/doc4.txt">doc4.txt</a></li> </ul> </li> <li><a href="#">2003</a> <ul><li class="children"><a href="./directory//folder1/2003/Copy of doc1.txt">Copy of doc1.txt</a></li> <li class="children"><a href="./directory//folder1/2003/doc1.txt">doc1.txt</a></li> <li class="children"><a href="./directory//folder1/2003/doc2.txt">doc2.txt</a></li> </ul> </li> <li><a href="#">2004</a> <ul><li class="children"><a href="./directory//folder1/2004/doc1.txt">doc1.txt</a></li> <li class="children"><a href="./directory//folder1/2004/doc2.txt">doc2.txt</a></li> <li class="children"><a href="./directory//folder1/2004/doc3.txt">doc3.txt</a></li> <li class="children"><a href="./directory//folder1/2004/doc4.txt">doc4.txt</a></li> </ul> </li> </ul> </li> <li><a href="#">folder2</a> <ul><li><a href="#">2001</a> <ul><li class="children"><a href="./directory//folder2/2001/doc1.txt">doc1.txt</a></li> <li class="children"><a href="./directory//folder2/2001/doc2.txt">doc2.txt</a></li> <li class="children"><a href="./directory//folder2/2001/doc3.txt">doc3.txt</a></li> </ul> </li> <li><a href="#">2002</a> <ul><li class="children"><a href="./directory//folder2/2002/doc1.txt">doc1.txt</a></li> <li class="children"><a href="./directory//folder2/2002/doc2.txt">doc2.txt</a></li> <li class="children"><a href="./directory//folder2/2002/doc3.txt">doc3.txt</a></li> <li class="children"><a href="./directory//folder2/2002/doc4.txt">doc4.txt</a></li> </ul> </li> <li><a href="#">2003</a> <ul><li class="children"><a href="./directory//folder2/2003/Copy of doc1.txt">Copy of doc1.txt</a></li> <li class="children"><a href="./directory//folder2/2003/doc1.txt">doc1.txt</a></li> <li class="children"><a href="./directory//folder2/2003/doc2.txt">doc2.txt</a></li> </ul> </li> <li><a href="#">2004</a> <ul><li class="children"><a href="./directory//folder2/2004/doc1.txt">doc1.txt</a></li> <li class="children"><a href="./directory//folder2/2004/doc2.txt">doc2.txt</a></li> <li class="children"><a href="./directory//folder2/2004/doc3.txt">doc3.txt</a></li> <li class="children"><a href="./directory//folder2/2004/doc4.txt">doc4.txt</a></li> </ul> </li> </ul> </li> <li><a href="#">folder3</a> <ul><li><a href="#">2001</a> <ul><li class="children"><a href="./directory//folder3/2001/doc1.txt">doc1.txt</a></li> <li class="children"><a href="./directory//folder3/2001/doc2.txt">doc2.txt</a></li> <li class="children"><a href="./directory//folder3/2001/doc3.txt">doc3.txt</a></li> </ul> </li> <li><a href="#">2002</a> <ul><li class="children"><a href="./directory//folder3/2002/doc1.txt">doc1.txt</a></li> <li class="children"><a href="./directory//folder3/2002/doc2.txt">doc2.txt</a></li> <li class="children"><a href="./directory//folder3/2002/doc3.txt">doc3.txt</a></li> <li class="children"><a href="./directory//folder3/2002/doc4.txt">doc4.txt</a></li> </ul> </li> <li><a href="#">2003</a> <ul><li class="children"><a href="./directory//folder3/2003/Copy of doc1.txt">Copy of doc1.txt</a></li> <li class="children"><a href="./directory//folder3/2003/doc1.txt">doc1.txt</a></li> <li class="children"><a href="./directory//folder3/2003/doc2.txt">doc2.txt</a></li> </ul> </li> <li><a href="#">2004</a> <ul><li class="children"><a href="./directory//folder3/2004/doc1.txt">doc1.txt</a></li> <li class="children"><a href="./directory//folder3/2004/doc2.txt">doc2.txt</a></li> <li class="children"><a href="./directory//folder3/2004/doc3.txt">doc3.txt</a></li> <li class="children"><a href="./directory//folder3/2004/doc4.txt">doc4.txt</a></li> </ul> </li> </ul> </li> </ul> I assume my problem is, jquery isn't closing the children between each new parent so I need to make a call, but I'm a bit lost on how to do that. I know the code is pretty messy, this project was done in a huge rush and a very tight timeframe. Appreciate your answers and any other constructive comments, cheers :)

    Read the article

  • How to sort the file names in bash in this circumstance?

    - by Nicolas
    I have run a program to generate some results with the different parameters(i.e. the R, C and RP). These results are saved in files named results.txt. Then, I should parse these experimental results to make an analysis. In the params_R_7_C_16_RP_0, the 7 is the value of the parameter R, the 16 is the value of the parameter C and the 0 is the value of the parameter RP. Now, I want to get these results.txt files in the current directory to parse, and sort the path with the parameter values of R,C and RP. I first use the following command to get the results.txt files that I want to parse: find ./ -name "results.txt" and the output is: ./params_R_11_C_9_RP_0/results.txt ./params_R_7_C_9_RP_0/results.txt ./params_R_7_C_4_RP_0/results.txt ./params_R_11_C_16_RP_0/results.txt ./params_R_9_C_4_RP_0/results.txt ./params_R_5_C_9_RP_0/results.txt ./params_R_9_C_25_RP_0/results.txt ./params_R_7_C_16_RP_0/results.txt ./params_R_5_C_25_RP_0/results.txt ./params_R_5_C_16_RP_0/results.txt ./params_R_11_C_4_RP_0/results.txt ./params_R_9_C_16_RP_0/results.txt ./params_R_7_C_25_RP_0/results.txt ./params_R_15_C_4_RP_0/results.txt ./params_R_5_C_4_RP_0/results.txt ./params_R_9_C_9_RP_0/results.txt and I change the command as follows: find ./ -name "results.txt" | sort and the output is: ./params_R_11_C_16_RP_0/results.txt ./params_R_11_C_25_RP_0/results.txt ./params_R_11_C_4_RP_0/results.txt ./params_R_11_C_9_RP_0/results.txt ./params_R_5_C_16_RP_0/results.txt ./params_R_5_C_25_RP_0/results.txt ./params_R_5_C_4_RP_0/results.txt ./params_R_5_C_9_RP_0/results.txt ./params_R_7_C_16_RP_0/results.txt ./params_R_7_C_25_RP_0/results.txt ./params_R_7_C_4_RP_0/results.txt ./params_R_7_C_9_RP_0/results.txt ./params_R_9_C_16_RP_0/results.txt ./params_R_9_C_25_RP_0/results.txt ./params_R_9_C_4_RP_0/results.txt ./params_R_9_C_9_RP_0/results.txt But I want it output as following: ./params_R_5_C_4_RP_0/results.txt ./params_R_5_C_9_RP_0/results.txt ./params_R_5_C_16_RP_0/results.txt ./params_R_5_C_25_RP_0/results.txt ./params_R_7_C_4_RP_0/results.txt ./params_R_7_C_9_RP_0/results.txt ./params_R_7_C_16_RP_0/results.txt ./params_R_7_C_25_RP_0/results.txt ./params_R_9_C_4_RP_0/results.txt ./params_R_9_C_9_RP_0/results.txt ./params_R_9_C_16_RP_0/results.txt ./params_R_9_C_25_RP_0/results.txt ... I should let it params_R_005_C_004_RP_0 when generating the results. But it would take much time to rerun the program to get the results. So I wonder if there is any way to use the bash command to achieve this objective.

    Read the article

  • Is there a way to disallow only crawling in https in robots.txt?

    - by David Wilkins
    I just realized that Bingbot is crawling my company's website's pages over https. Bing already crawls the site over http, so this seems frivolous. Is there a way to specify Disallow: / for https only? According to Wikipedia, each protocol has its own robots.txt And according to Google's Robots.txt Specification, the robots.txt applies to http AND https I don't want to Disallow: / for Bing totally, just over https.

    Read the article

  • Strange robots.txt - how and why did it get there?

    - by Mick
    I recently created a very simple, pure HTML website which I have hosted with "hostmonster". Hostmonster had very good reviews on some comparison website and in general so far they appear to be perfectly good in every way... At least I thought so until just now... I have been making lots of edits to my site on an almost daily basis. My site now appears on the first page (7th on the list) for my most important keyphrase when doing a google search. But I did notice some problem with the snippet chosen by google. I asked a question on this site about snippets and got some great answers. I then made some modifications to my meta data and within 48hrs the google snippet for my search was perfect. The odd thing though was that looking at the "cached" version google had, it appeared that the cache was still very odl- like three weeks previous. This seemed very odd - how could it be that the google robots had read my new metadata without updating the cache? This puzzled me greatly. Just now it occurred to me that maybe I had some goofey setting in my robots.txt file. I didn't actually remember even making one - but I thought I'd have a look just in case. Much to my horror, I saw that there was a robots.txt and it contained the disturbing text below: sitemap: http://cdn.attracta.com/sitemap/728687.xml.gz Intuitively this looks like some kind of junk, spam trick, and I had indeed been getting some spam from "attracta". So my questions are: 1. Should I simply delete this robots.txt? 2. Was the file there all along - placed there because of some commercial tie-in between attracta and hostmonster. 3. Does the attracta robots file explain the lack of re-caching?

    Read the article

  • Disallow robots.txt from being accessed in a browser but still accessible by spiders?

    - by Michael Irigoyen
    We make use of the robots.txt file to prevent Google (and other search spiders) from crawling certain pages/directories in our domain. Some of these directories/files are secret, meaning they aren't linked (except perhaps on other pages encompassed by the robots.txt file). Some of these directories/files aren't secret, we just don't want them indexed. If somebody browses directly to www.mydomain.com/robots.txt, they can see the contents of the robots.txt file. From a security standpoint, this is not something we want publicly available to anybody. Any directories that contain secure information are set behind authentication, but we still don't want them to be discoverable unless the user specifically knows about them. Is there a way to provide a robots.txt file but to have it's presence masked by John Doe accessing it from his browser? Perhaps by using PHP to generate the document based on certain criteria? Perhaps something I'm not thinking of? We'd prefer a way to centrally do it (meaning a <meta> tag solution is less than ideal).

    Read the article

  • Is there a way to disallow crawling of only HTTPS in robots.txt?

    - by David Wilkins
    I just realized that Bingbot is crawling my company's website's pages over https. Bing already crawls the site over http, so this seems frivolous. Is there a way to specify Disallow: / for https only? According to Wikipedia, each protocol has its own robots.txt And according to Google's Robots.txt Specification, the robots.txt applies to http AND https I don't want to Disallow: / for Bing totally, just over https.

    Read the article

  • Disadvantages of a fake phpMyAdmin honeypot that causes ip blacklisting and robots.txt disallow/exclusion of the honeypot?

    - by Tchalvak
    I'm trying to figure out whether I should set up a honeypot system with a fake phpMyAdmin (site gets hits all the time with people spidering for insecurities with that app). My thought was to create a honeypot php script that would mimic a phpMyAdmin login, and then blacklist ips that hit that url (and aren't already whitelisted). I would then add the appropriate urls to the robots.txt so that spiders that actually respect my robots.txt wouldn't be caught by the blacklist. Are there disadvantages to this approach, do legit robots sometimes not respect robots.txt in certain circumstances, are there any problems with this that I should consider in advance?

    Read the article

  • How to Save data in txt file in MATLAB

    - by Jessy
    I have 3 txt files s1.txt, s2.txt, s3.txt.Each have the same format and number of data.I want to combine only the second column of each of the 3 files into one file. Before I combine the data, I sorted it according to the 1st column: UnSorted file: s1.txt s2.txt s3.txt 1 23 2 33 3 22 4 32 4 32 2 11 5 22 1 10 5 28 2 55 8 11 7 11 Sorted file: s1.txt s2.txt s3.txt 1 23 1 10 2 11 2 55 2 33 3 22 4 32 4 32 5 28 5 22 8 11 7 11 Here is the code I have so far: BaseFile ='s' n=3 fid=fopen('RT.txt','w'); for i=1:n %Open each file consecutively d(i)=fopen([BaseFile num2str(i)'.txt']); %read data from file A=textscan(d(i),'%f%f') a=A{1} b=A{2} ab=[a,b]; %sort the data according to the 1st column B=sortrows(ab,1); %delete the 1st column after being sorted B(:,1)=[] %write to a new file fprintf(fid,'%d\n',B'); %close (d(i)); end fclose(fid); How can I get the output in the new txt file in this format? 23 10 11 55 33 22 32 32 28 22 11 11 instead of this format? 23 55 32 22 10 33 32 11 11 22 28 11

    Read the article

  • What dangers await if I block non-standard, non-major-usa search engine bots from my USA only website?

    - by Ryan
    I noticed tons of bandwidth being used by non-USA search engine bots, so I began blocking them in an effort to save bandwidth and cpu cycles for actual users and the search engines they come from (Google, Bing, Yahoo, Ask, etc.). Other than potentially losing some international traffic (which isn't really important to us since all of our content is very USA-centric), what additional dangers should I be concerned about? I'm using a modified version of Jeff Starr's User Agent Blocklist

    Read the article

  • Android - store .txt filenames in array

    - by JustMe
    Hello! I would like to store only .txt filenames from a specific dir (let's say: /sdcard/docs/) into an String array. Example: In /sdcard/docs/ there are 5 files: 'foo.txt', 'bar.jpg', 'foofoo.txt', 'loool.png' and 'foobar.txt'. I would like to get array with contents: "foo.txt", "foofoo.txt", "foobar.txt" How could I do it? Thanks =)

    Read the article

  • No description for any page on the website is available in Google despite robots.txt allowing crawling

    - by Abhijit
    I seem to have the weirdest issue with Search Engine Optimization, and I asked the IT folks at my university, I asked people on Joomla forums and I am trying to sort this issue out using Google Webmaster Tools for more than 2 months to little avail. I want to know if I have some blatantly wrong configuration somewhere that is causing search engines to be unable to index this site. I noticed a similar issue with another website I searched for online (ECEGSA - The University of British Columbia at gsa.ece.ubc.ca), making me believe this might be a concern that people might be looking an answer for. Here are the details: The website in question is: http://gsa.ece.umd.edu/. It runs using Joomla 2.5.x (latest). The site was up since around mid December of 2013, and I noticed right from the get go that the site was not being indexed correctly on Google. Specifically I see the following message when I search for the website on Google: A description for this result is not available because of this site's robots.txt – learn more. The thing is in December till around March I used the default Joomla robots.txt file which is: User-agent: * Disallow: /administrator/ Disallow: /cache/ Disallow: /cli/ Disallow: /components/ Disallow: /images/ Disallow: /includes/ Disallow: /installation/ Disallow: /language/ Disallow: /libraries/ Disallow: /logs/ Disallow: /media/ Disallow: /modules/ Disallow: /plugins/ Disallow: /templates/ Disallow: /tmp/ Nothing there should stop Google from searching my website. And even more confusingly, when I go to Google Webmaster tools, under "Blocked URLs" tab, when I try many of the links on the site, they are all shown up as "Allowed". I then tried adding a sitemap, putting it in the robots.txt file. That did not help. Same exact search result, same behavior in the "Blocked URLs" tab on the webmaster tools. Now additionally, the "sitemaps" tab says for several links an error saying "URL is robotted out". I tried those exact links in the "Blocked URLs" and they are allowed! I then tried deleting the robots.txt file. No use. Same exact problem. Here is an example screenshot from Google's Webmaster Tools: At this point I cannot give a rational explanation to why this is happening and neither can anyone in the IT department here. No one on Joomla forums can seem to understand what is going on. Based on what I explained, does it seem that I have somehow set a setting in the robots.txt or in .htaccess or somewhere else, incorrectly?

    Read the article

  • Why do Google search results include pages disallowed in robots.txt?

    - by Ilmari Karonen
    I have some pages on my site that I want to keep search engines away from, so I disallowed them in my robots.txt file like this: User-Agent: * Disallow: /email Yet I recently noticed that Google still sometimes returns links to those pages in their search results. Why does this happen, and how can I stop it? Background: Several years ago, I made a simple web site for a club a relative of mine was involved in. They wanted to have e-mail links on their pages, so, to try and keep those e-mail addresses from ending up on too many spam lists, instead of using direct mailto: links I made those links point to a simple redirector / address harvester trap script running on my own site. This script would return either a 301 redirect to the actual mailto: URL, or, if it detected a suspicious access pattern, a page containing lots of random fake e-mail addresses and links to more such pages. To keep legitimate search bots away from the trap, I set up the robots.txt rule shown above, disallowing the entire space of both legit redirector links and trap pages. Just recently, however, one of the people in the club searched Google for their own name and was quite surprised when one of the results on the first page was a link to the redirector script, with a title consisting of their e-mail address followed by my name. Of course, they immediately e-mailed me and wanted to know how to get their address out of Google's index. I was quite surprised too, since I had no idea that Google would index such URLs at all, seemingly in violation of my robots.txt rule. I did manage to submit a removal request to Google, and it seems to have worked, but I'd like to know why and how Google is circumventing my robots.txt like that and how to make sure that none of the disallowed pages will show up in their search results. Ps. I actually found out a possible explanation and solution, which I'll post below, while preparing this question, but I thought I'd ask it anyway in case someone else might have the same problem. Please do feel free to post your own answers. I'd also be interested in knowing if other search engines do this too, and whether the same solutions work for them also.

    Read the article

< Previous Page | 1 2 3 4 5 6 7 8 9 10 11 12  | Next Page >