Search Results

Search found 211 results on 9 pages for 'spider paddy'.

Page 6/9 | < Previous Page | 2 3 4 5 6 7 8 9 | Next Page >

Scrapy issue with iTunes' AppStore

- by Eric

I am using Scrapy to fetch some data from iTunes' AppStore database. I start with this list of apps: http://itunes.apple.com/us/genre/mobile-software-applications/id36?mt=8 In the following code I have used the simplest regex which targets all apps in the US store. from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.contrib.spiders import CrawlSpider, Rule class AppStoreSpider(CrawlSpider): domain_name = 'itunes.apple.com' start_urls = ['http://itunes.apple.com/us/genre/mobile-software-applications/id6015?mt=8'] rules = ( Rule(SgmlLinkExtractor(allow='itunes\.apple\.com/us/app'), 'parse_app', follow=True, ), ) def parse_app(self, response): .... SPIDER = AppStoreSpider() When I run it I receive the following: [itunes.apple.com] DEBUG: Crawled (200) <GET http://itunes.apple.com/us/genre/mobile-software-applications/id6015?mt=8> (referer: None) [itunes.apple.com] DEBUG: Filtered offsite request to 'itunes.apple.com': <GET http://itunes.apple.com/us/app/bloomberg/id281941097?mt=8> As you can see, when it starts crawling the first page it says: "Filtered offsite request to 'itunes.apple.com'". and then the spider stops.. it also returns this message: [ScrapyHTTPPageGetter,client] /usr/lib/python2.5/cookielib.py:1577: exceptions.UserWarning: cookielib bug! Traceback (most recent call last): File "/usr/lib/python2.5/cookielib.py", line 1575, in make_cookies parse_ns_headers(ns_hdrs), request) File "/usr/lib/python2.5/cookielib.py", line 1532, in _cookies_from_attrs_set cookie = self._cookie_from_cookie_tuple(tup, request) File "/usr/lib/python2.5/cookielib.py", line 1451, in _cookie_from_cookie_tuple if version is not None: version = int(version) ValueError: invalid literal for int() with base 10: '"1"' I have used the same script for other website and I didn't have this problem. Any suggestion?

Read the article
Do I need to use http redirect code 302 or 307?

- by Iain Fraser

I am working on a CMS that uses a search facility to output a list of content items. You can use this facility as a search engine, but in this instance I am using it to output the current month's Media Releases from an archive of all Media Releases. The default parameters for these "Data Lists" as they are called, don't allow you to specify "current month" or "current year" for publication date - only "last x days" or "from dateA to dateB". The search facility will accept querystring parameters though, so I intend to code around it like this: Page loads How many days into the current month are we? Do we have a query string that asks for a list including this many days? If no, redirect the client back to this page with the appropriate query-string included. If yes, allow the CMS to process the query Now here's the rub. Suppose the spider from your favourite search engine comes along and tries to index your main Media Releases page. If you were to use a 301 redirect to the default query page, the spider would assume the main page was defunct and choose to add the query page to its index instead of the main page. Now I see that 302 and 307 indicate that a page has been moved temporarily; if I do this, are spiders likely to pop the main page into their index like I want them to? Thanks very much in advance for your help and advice. Kind regards Iain

Read the article
dealing cards in Clojure

- by Ralph

I am trying to write a Spider Solitaire player as an exercise in learning Clojure. I am trying to figure out how to deal the cards. I have created (with the help of stackoverflow), a shuffled sequence of 104 cards from two standard decks. Each card is represented as a (defstruct card :rank :suit :face-up) The tableau for Spider will be represented as follows: (defstruct tableau :stacks :complete) where :stacks is a vector of card vectors, 4 of which contain 5 cards face down and 1 card face up, and 6 of which contain 4 cards face down and 1 card face up, for a total of 54 cards, and :complete is an (initially) empty vector of completed sets of ace-king (represented as, for example, king-hearts, for printing purposes). The remainder of the undealt deck should be saved in a ref (def deck (ref seq)) During the game, a tableau may contain, for example: (struct-map tableau :stacks [[AH 2C KS ...] [6D QH JS ...] ... ] :complete [KC KS]) where "AH" is a card containing {:rank :ace :suit :hearts :face-up false}, etc. How can I write a function to deal the stacks and then save the remainder in the ref?

Read the article
Grep... What patterns to extract href attributes, etc. with PHP's preg_grep?

- by inktri

Hi, I'm having trouble with grep.. Which four patterns should I use with PHP's preg_grep to extract all instances the "____" stuff in the strings below? 1. <h2><a ....>_____</a></h2> 2. <cite><a href="_____" .... >...</a></cite> 3. <cite><a .... >________</a></cite> 4. <span>_________</span> The dots denote some arbitrary characters while the underscores denote what I want. An example string is: </style></head> <body><div id="adBlock"><h2><a href="https://www.google.com/adsense/support/bin/request.py?contact=afs_violation&hl=en" target="_blank">Ads by Google</a></h2> <div class="ad"><div><a href="http://www.google.com/aclk?sa=L&ai=C4vfT4Sa3S97SLYO8NN6F-ckB5oq5sAGg6PKlDaT-kwUQASCF4p8UKARQtobS9AVgyZbRhsijoBnIAQGqBBxP0OSEnIsuRIv3ZERDm8GiSKZSnjrVf1kVq-_Y&num=1&sig=AGiWqtwG1qHnwpZ_5BNrjrzzXO5Or6EDMg&q=http://www.crackle.com/c/Spider-Man_The_New_Animated_Series/%3Futm_source%3Dgoogle%26utm_medium%3Dcpc%26utm_campaign%3DGST_10016_CRKL_US_PRD_S_TeleV_SPID_Tele_Spider-Man%26utm_term%3Dspiderman%26utm_content%3Ds264Yjg9f_3472685742_487lrz1638" class="titleLink" target="_parent">Spider-<b>Man</b> Animated Serie</a></div> <span>See Your Favorite Spiderman <br> Episodes for Free. Only on Crackle.</span> <cite><a href="http://www.google.com/aclk?sa=L&ai=C4vfT4Sa3S97SLYO8NN6F-ckB5oq5sAGg6PKlDaT-kwUQASCF4p8UKARQtobS9AVgyZbRhsijoBnIAQGqBBxP0OSEnIsuRIv3ZERDm8GiSKZSnjrVf1kVq-_Y&num=1&sig=AGiWqtwG1qHnwpZ_5BNrjrzzXO5Or6EDMg&q=http://www.crackle.com/c/Spider-Man_The_New_Animated_Series/%3Futm_source%3Dgoogle%26utm_medium%3Dcpc%26utm_campaign%3DGST_10016_CRKL_US_PRD_S_TeleV_SPID_Tele_Spider-Man%26utm_term%3Dspiderman%26utm_content%3Ds264Yjg9f_3472685742_487lrz1638" class="domainLink" target="_parent">www.Crackle.com/Spiderman</a></cite></div> <div class="ad"><div><a href="http://www.google.com/aclk?sa=l&ai=CnQFi4Sa3S97SLYO8NN6F-ckB3M7nQtyU2PQEq6bCBRACIIXinxQoBFCm15KB-f____8BYMmW0YbIo6AZoAHiq_X-A8gBAaoEIU_Q9JKLiy1MiwdnHpZoBnmpR1J8pP2jpTwMx2uj2nN4WA&num=2&sig=AGiWqtwDrI5pWBCncdDc80FKt32AJMAQ6A&q=http://www.costumeexpress.com/browse/TV-Movies/_/N-1z141uu/Ntt-batman/results1.aspx%3FREF%3DKNC-CEgoogle" class="titleLink" target="_parent">Kids <b>Batman</b> Costumes</a></div> <span>Great Selection of <b>Batman</b> & Batgirl <br> Costumes For Kids. Ships Same Day!</span> <cite><a href="http://www.google.com/aclk?sa=l&ai=CnQFi4Sa3S97SLYO8NN6F-ckB3M7nQtyU2PQEq6bCBRACIIXinxQoBFCm15KB-f____8BYMmW0YbIo6AZoAHiq_X-A8gBAaoEIU_Q9JKLiy1MiwdnHpZoBnmpR1J8pP2jpTwMx2uj2nN4WA&num=2&sig=AGiWqtwDrI5pWBCncdDc80FKt32AJMAQ6A&q=http://www.costumeexpress.com/browse/TV-Movies/_/N-1z141uu/Ntt-batman/results1.aspx%3FREF%3DKNC-CEgoogle" class="domainLink" target="_parent">www.CostumeExpress.com</a></cite></div> <div class="ad"><div><a href="http://www.google.com/aclk?sa=l&ai=CAMYT4Sa3S97SLYO8NN6F-ckB3ZnWmgGdoNLrDaumwgUQAyCF4p8UKARQrqSVxwdgyZbRhsijoBmgAZH77uwDyAEBqgQYT9DU7oqLLEyLB2dHlxZFnQzyeg-yHt88&num=3&sig=AGiWqtzqAphZ9DLDiEFBJlb0Ou_1HyEyyA&q=http://www.OfficialBatmanCostumes.com" class="titleLink" target="_parent"><b>Batman</b> Costume</a></div> <span>Official <b>Batman</b> Costumes. <br> Huge Selection & Same Day Shipping!</span> <cite><a href="http://www.google.com/aclk?sa=l&ai=CAMYT4Sa3S97SLYO8NN6F-ckB3ZnWmgGdoNLrDaumwgUQAyCF4p8UKARQrqSVxwdgyZbRhsijoBmgAZH77uwDyAEBqgQYT9DU7oqLLEyLB2dHlxZFnQzyeg-yHt88&num=3&sig=AGiWqtzqAphZ9DLDiEFBJlb0Ou_1HyEyyA&q=http://www.OfficialBatmanCostumes.com" class="domainLink" target="_parent">www.OfficialBatmanCostumes.com</a></cite></div> <div class="ad"><div><a href="http://www.google.com/aclk?sa=l&ai=C767t4Sa3S97SLYO8NN6F-ckBkZfSfoOppaMHq6bCBRAEIIXinxQoBFDX2bw6YMmW0YbIo6AZoAHpprP8A8gBAaoEG0_QhJSMiytMiwdnHpZoF3g0Uj8_Vl2r4TpI_g&num=4&sig=AGiWqtyGO2DnFq_jMhP6ufj8pufT9sWQWA&q=http://www.discountsuperherocostumes.com/batman-costumes.html" class="titleLink" target="_parent">Discount <b>Batman</b> Costumes</a></div> <span>Discount adult and kids <b>batman</b> <br> superhero costumes.</span> <cite><a href="http://www.google.com/aclk?sa=l&ai=C767t4Sa3S97SLYO8NN6F-ckBkZfSfoOppaMHq6bCBRAEIIXinxQoBFDX2bw6YMmW0YbIo6AZoAHpprP8A8gBAaoEG0_QhJSMiytMiwdnHpZoF3g0Uj8_Vl2r4TpI_g&num=4&sig=AGiWqtyGO2DnFq_jMhP6ufj8pufT9sWQWA&q=http://www.discountsuperherocostumes.com/batman-costumes.html" class="domainLink" target="_parent">www.discountsuperherocostumes.com</a></cite></div></div></body> <script type="text/javascript"> var relay = ""; </script> <script type="text/javascript" src="/uds/?file=ads&v=1&packages=searchiframe&nodependencyload=true"></script></html> Thanks!

Read the article
10 Easy DIY Father’s Day Gift Ideas

- by Jason Fitzpatrick

If you’re looking for a DIY gift for this Father’s Day that really shows off your maker ethic, this roundup of 10 DIY gifts is sure to have something to offer–fire pistons anyone? Courtesy of Make magazine, we find this 10 item roundup for great DIY projects you could hammer out between now and Father’s Day. The roundup includes everything from the mini-toolbox (really, more of a parts box) see in the photo here to more dynamic gifts like a homemade fire piston and a spider rifle. Hit up the link below to check out all the neat projects which, intended as a gift or not, will prompt you to head out to the workshop. Top 10: Easy DIY Gifts My Dad Would Dig [Make] HTG Explains: What Is RSS and How Can I Benefit From Using It? HTG Explains: Why You Only Have to Wipe a Disk Once to Erase It HTG Explains: Learn How Websites Are Tracking You Online

Read the article
Why Google skips page title

- by Bob

I have no idea why this is happening. An example http://www.londonofficespace.com/ofdj17062004934429t.htm Title tag is: Unfurnished Office Space Wimbledon – Serviced Office on Lombard Road SW19 But is indexed as: Lombard Road – SW19 - London Office Space If you look in the source code and search for this portion ‘Lombard Road – SW19’ You then find that it's next to an office image alt=’Lombard Road – SW19’. The only thing I could think of is that the spider somehow ‘skips’ our title tag and grabs this bit, and then inserts the name of the site (but WHY?) Is there anything I can do with this? or is this a Google behaviour?

Read the article
scrapy cannot find div on this website [on hold]

- by Jaspal Singh Rathour

I am very new at this and have been trying to get my head around my first selector can somebody help? i am trying to extract data from page http://groceries.asda.com/asda-webstore/landing/home.shtml?cmpid=ahc--ghs-d1--asdacom-dsk-_-hp#/shelf/1215337195041/1/so_false all the info under div class = listing clearfix shelfListing but i cant seem to figure out how to format response.xpath(). I have managed to launch scrapy console but no matter what I type in response.xpath() i cant seem to select the right node. I know it works because when I type response.xpath('//div[@class="container"]') I get a response but don't know how to navigate to the listsing cleardix shelflisting. I am hoping that once i get this bit I can continue working my way through the spider. Thank you in advance! PS I wonder if it is not possible to scan this site - is it possible for the owners to block spiders?

Read the article
Scrapy cannot find div on this website [on hold]

- by Jaspal Singh Rathour

I am very new at this and have been trying to get my head around my first selector can somebody help? i am trying to extract data from page http://groceries.asda.com/asda-webstore/landing/home.shtml?cmpid=ahc--ghs-d1--asdacom-dsk-_-hp#/shelf/1215337195041/1/so_false all the info under div class = listing clearfix shelfListing but i cant seem to figure out how to format response.xpath(). I have managed to launch scrapy console but no matter what I type in response.xpath() i cant seem to select the right node. I know it works because when I type response.xpath('//div[@class="container"]') I get a response but don't know how to navigate to the listsing cleardix shelflisting. I am hoping that once i get this bit I can continue working my way through the spider. Thank you in advance! PS I wonder if it is not possible to scan this site - is it possible for the owners to block spiders?

Read the article
Yandex frequently replaces page names with ampersands

- by Guy

The Yandex spider is a frequent visitor to one of the sites I manage. On ocassion it replaces the page name with two ampersands and a space. So if the page is: /mypage.aspx?param=value then it will try and crawl it as: /&& ?param=value Any idea why it is doing this? [EDIT] If I remember correctly the IP that this "mistake" is coming from is based in California and not Russia. I believe that they crawl US sites from a US based IP address. Not sure if that helps. More Info about request: IP: 199.21.99.82 City: Palo Alto State: California Country: United States ISP: Yandex Inc. User-Agent: Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)

Read the article
Does schema.org improve SEO?

- by marko

http://schema.org This site provides a collection of schemas, i.e., html tags, that webmasters can use to markup their pages in ways recognized by major search providers. Search engines including Bing, Google, Yahoo! and Yandex rely on this markup to improve the display of search results, making it easier for people to find the right web pages. It sounds wonderful, but does the search spider ignore the extra attributes and elements? Is it just too clever and ignores it? May it also be that it lowers your visibility because of such alteration?

Read the article
SEO tool is telling me title, description and keywords don't exist, but they do. Where is the problem?

- by DaveDev

I'm using the following tool to analyse how 'optimal' a site that I'm working on is for search engines: http://tools.seobook.com/general/spider-test/ I enter the URL for the site - http://ftmsuat.moneymate.com - into the search bar, and it returns a breakdown of the contents of the page. I'm a little confused by what I see though. According to the results, the page doesn't have a title, description or keywords. But if you check the source of the page, those elements are definitely there. So I'm wondering now, which is wrong? seobook.com or my page?

Read the article
Does SEO optimisation count on the responsive side of a site?

- by Rick Donohoe

I'm looking at making some SEO optimisation fixes, and at this point I'm sorting out the heading structure and keywords - H1's, H2's etc We have a site where there are a number of similar blocks, and one is always visible, and one is hidden depending on the screen size. This is our method of making a single site responsive. Firstly, how does this technique affect the SEO, and in general does the responsive side of a site matter at all to search engines? What I mean by this is if the site has different content depending on screen sizes, then which content would the search spider crawl?

Read the article
Working on the search

The one thing I've always like working on and about this site, is the full text search engine and spider. Bascially it goes out and spiders all the major development blogs on the web, and then indexes them. The engine uses Lucene for it's index. Lucene is another open source project and it works really fast. Currently the directory is indexed, and the rss feeds are underway. We're talking about a lot of content, but once it's done you'll be able to pull podcasts, and videos as they get posted to...Did you know that DotNetSlackers also publishes .net articles written by top known .net Authors? We already have over 80 articles in several categories including Silverlight. Take a look: here.

Read the article
Is the use of hashbang really a good idea? [on hold]

- by user32642

I've been working on a WordPress site lately that was design with hashbang or shebang in the dynamically generated URLs. After doing some research, I noticed that there was some preference by Google in their use and how it crawled the site. However, after I ran several sitemap generators and Screaming Frog SEO Spider, I realized that the only page being crawled was the index page. So now I am questioning the use of hashbangs. What do you think? Should I attempt to remove them? Or will it even matter? And does anyone know of a easy way to remove this? The site is www.modernvintage1005.com

Read the article
How does Google index our site when you're doing a Website Optimizer experiment?

- by user305175

I'm about to use Google's Website Optimizer to do a/b testing on the home page of my site. My question is: which of the alternative pages will google's spider index? All of them? I couldn't find any info about this on google or on GWO pages.

Read the article
MSN like box for Ad rotation.

- by Muhammad Umar Siddique

Hi Everyone. I want to create a JavaScript based box much like the one found on MSN or AOL with the navigational buttons. Box content must be spider-able by search engines. On MSN you can find the box near top left corner. Note this box contains links and images. Any idea how to implement this ? Thanks.

Read the article
Executing Javascript without a browser?

- by Daniel

I am looking into Javascript programming without a browser. I want to run scripts from the Linux or Mac OS X command line, much like we run any other scripting language (ruby, php, perl, python...) $ javascript my_javascript_code.js I looked into spider monkey (Mozilla) and v8 (Google), but both of these appear to be embedded. Is anyone using Javascript as a scripting language to be executed from the command line? If anyone is curious why I am looking into this, I've been poking around node.js

Read the article
Developer workload management software: recommendations?

- by PaddyC

In my place, we use two issue tracking tools, one for production bugs, and another for issues on projects in development. These tools are good for managers. However, at an individual level, other work can also arrive through desk visits, email, or the phone, and those allocating the work aren't always interested in issue tracking systems. I've recently rolled off an 18-month project where I got into a cycle of overtime, and I found that I didn't always have good visibility of all work assigned to me. As a result, I was always very busy, and felt constantly laden down with work, but didn't have the clear data to show my manager (or, ironically, the time to stop and gather the information!). A handwritten list, updated at the end of each day, is a good start, but can anyone recommend better tools to help me get a clear view of my own workload? Ideally, I'm thinking of software tools for developers, which could incorporate estimates, but all suggestions welcome. Thanks, Paddy

Read the article
How to iterate loop inside a string searching for any word aftera fixed keyword?

- by Parth

Suppose I have a sting as "PHP Paddy,PHP Pranav,PHP Parth", now i have a count as 3,now how should I iterate loop in the string aiming on string after "PHP " to display the all the names? Alright This is the string "BEGIN IF (NEW.name != OLD.name) THEN INSERT INTO jos_menuaudit set menuid=OLD.id, oldvalue = OLD.name, newvalue = NEW.name, field = "name"; END IF; IF (NEW.alias != OLD.alias) THEN INSERT INTO jos_menuaudit set menuid=OLD.id, oldvalue = OLD.alias, newvalue = NEW.alias, field = "alias"; END IF; END" in which i am searching the particular word after " IF (NEW.", and after that particualar others strings should not b displayed, hence whenever in a loop it finds " IF (NEW." I musr get a word just next to it. and in this way an array should b ready for to use.

Read the article
How can I find unused images and CSS styles in a website?

- by Jon Galloway

Is there a tool or methodology (other than trial and error) I can use to find unused image files? How about CSS declarations for ID's and Classes that don't even exist in the site? It seems like there might be a way to just spider the site, profile it, and see which images and styles are never loaded.

Read the article
Data extract from website URL

- by user2522395

From this below script I am able to extract all links of particular website, But i need to know how I can generate data from extracted links especially like eMail, Phone number if its there Please help how i will modify the existing script and get the result or if you have full sample script please provide me. Private Sub btnGo_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles btnGo.Click 'url must be in this format: http://www.example.com/ Dim aList As ArrayList = Spider("http://www.qatarliving.com", 1) For Each url As String In aList lstUrls.Items.Add(url) Next End Sub Private Function Spider(ByVal url As String, ByVal depth As Integer) As ArrayList 'aReturn is used to hold the list of urls Dim aReturn As New ArrayList 'aStart is used to hold the new urls to be checked Dim aStart As ArrayList = GrabUrls(url) 'temp array to hold data being passed to new arrays Dim aTemp As ArrayList 'aNew is used to hold new urls before being passed to aStart Dim aNew As New ArrayList 'add the first batch of urls aReturn.AddRange(aStart) 'if depth is 0 then only return 1 page If depth < 1 Then Return aReturn 'loops through the levels of urls For i = 1 To depth 'grabs the urls from each url in aStart For Each tUrl As String In aStart 'grabs the urls and returns non-duplicates aTemp = GrabUrls(tUrl, aReturn, aNew) 'add the urls to be check to aNew aNew.AddRange(aTemp) Next 'swap urls to aStart to be checked aStart = aNew 'add the urls to the main list aReturn.AddRange(aNew) 'clear the temp array aNew = New ArrayList Next Return aReturn End Function Private Overloads Function GrabUrls(ByVal url As String) As ArrayList 'will hold the urls to be returned Dim aReturn As New ArrayList Try 'regex string used: thanks google Dim strRegex As String = "<a.*?href=""(.*?)"".*?>(.*?)</a>" 'i used a webclient to get the source 'web requests might be faster Dim wc As New WebClient 'put the source into a string Dim strSource As String = wc.DownloadString(url) Dim HrefRegex As New Regex(strRegex, RegexOptions.IgnoreCase Or RegexOptions.Compiled) 'parse the urls from the source Dim HrefMatch As Match = HrefRegex.Match(strSource) 'used later to get the base domain without subdirectories or pages Dim BaseUrl As New Uri(url) 'while there are urls While HrefMatch.Success = True 'loop through the matches Dim sUrl As String = HrefMatch.Groups(1).Value 'if it's a page or sub directory with no base url (domain) If Not sUrl.Contains("http://") AndAlso Not sUrl.Contains("www") Then 'add the domain plus the page Dim tURi As New Uri(BaseUrl, sUrl) sUrl = tURi.ToString End If 'if it's not already in the list then add it If Not aReturn.Contains(sUrl) Then aReturn.Add(sUrl) 'go to the next url HrefMatch = HrefMatch.NextMatch End While Catch ex As Exception 'catch ex here. I left it blank while debugging End Try Return aReturn End Function Private Overloads Function GrabUrls(ByVal url As String, ByRef aReturn As ArrayList, ByRef aNew As ArrayList) As ArrayList 'overloads function to check duplicates in aNew and aReturn 'temp url arraylist Dim tUrls As ArrayList = GrabUrls(url) 'used to return the list Dim tReturn As New ArrayList 'check each item to see if it exists, so not to grab the urls again For Each item As String In tUrls If Not aReturn.Contains(item) AndAlso Not aNew.Contains(item) Then tReturn.Add(item) End If Next Return tReturn End Function

Read the article
JSON.parse vs. eval()

- by Kevin Major

My Spider Sense warns me that using eval() to parse incoming JSON is a bad idea. I'm just wondering if JSON.parse() - which I assume is a part of JavaScript and not a browser-specific function - is more secure.

Read the article
Blocking 'good' bots in nginx with multiple conditions for certain off-limits URL's where humans can go

- by Glenn Plas

After 2 days of searching/trying/failing I decided to post this here, I haven't found any example of someone doing the same nor what I tried seems to be working OK. I'm trying to send a 403 to bots not respecting the robots.txt file (even after downloading it several times). Specifically Googlebot. It will support the following robots.txt definition. User-agent: * Disallow: /*/*/page/ The intent is to allow Google to browse whatever they can find on the site but return a 403 for the following type of request. Googlebot seems to keep on nesting these links eternally adding paging block after block: my_domain.com:80 - 66.x.67.x - - [25/Apr/2012:11:13:54 +0200] "GET /2011/06/ page/3/?/page/2//page/3//page/2//page/3//page/2//page/2//page/4//page/4//pag e/1/&wpmp_switcher=desktop HTTP/1.1" 403 135 "-" "Mozilla/5.0 (compatible; G ooglebot/2.1; +http://www.google.com/bot.html)" It's a wordpress site btw. I don't want those pages to show up, even though after the robots.txt info got through, they stopped for a while only to begin crawling again later. It just never stops .... I do want real people to see this. As you can see, google get a 403 but when I try this myself in a browser I get a 404 back. I want browsers to pass. root@my_domain:# nginx -V nginx version: nginx/1.2.0 I tried different approaches, using a map and plain old nono if's and they both act the same: (under http section) map $http_user_agent $is_bot { default 0; ~crawl|Googlebot|Slurp|spider|bingbot|tracker|click|parser|spider 1; } (under the server section) location ~ /(\d+)/(\d+)/page/ { if ($is_bot) { return 403; # Please respect the robots.txt file ! } } I recently had to polish up my Apache skills for a client where I did about the same thing like this : # Block real Engines , not respecting robots.txt but allowing correct calls to pass # Google RewriteCond %{HTTP_USER_AGENT} ^Mozilla/5\.0\ \(compatible;\ Googlebot/2\.[01];\ \+http://www\.google\.com/bot\.html\)$ [NC,OR] # Bing RewriteCond %{HTTP_USER_AGENT} ^Mozilla/5\.0\ \(compatible;\ bingbot/2\.[01];\ \+http://www\.bing\.com/bingbot\.htm\)$ [NC,OR] # msnbot RewriteCond %{HTTP_USER_AGENT} ^msnbot-media/1\.[01]\ \(\+http://search\.msn\.com/msnbot\.htm\)$ [NC,OR] # Slurp RewriteCond %{HTTP_USER_AGENT} ^Mozilla/5\.0\ \(compatible;\ Yahoo!\ Slurp;\ http://help\.yahoo\.com/help/us/ysearch/slurp\)$ [NC] # block all page searches, the rest may pass RewriteCond %{REQUEST_URI} ^(/[0-9]{4}/[0-9]{2}/page/) [OR] # or with the wpmp_switcher=mobile parameter set RewriteCond %{QUERY_STRING} wpmp_switcher=mobile # ISSUE 403 / SERVE ERRORDOCUMENT RewriteRule .* - [F,L] # End if match This does a bit more than I asked nginx to do but it's about the same principle, I'm having a hard time figuring this out for nginx. So my question would be, why would nginx serve my browser a 404 ? Why isn't it passing, The regex isn't matching for my UA: "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.30 Safari/536.5" There are tons of example to block based on UA alone, and that's easy. It also looks like the matchin location is final, e.g. it's not 'falling' through for regular user, I'm pretty certain that this has some correlation with the 404 I get in the browser. As a cherry on top of things, I also want google to disregard the parameter wpmp_switcher=mobile , wpmp_switcher=desktop is fine but I just don't want the same content being crawled multiple times. Even though I ended up adding wpmp_switcher=mobile via the google webmaster tools pages (requiring me to sign up ....). that also stopped for a while but today they are back spidering the mobile sections. So in short, I need to find a way for nginx to enforce the robots.txt definitions. Can someone shell out a few minutes of their lives and push me in the right direction please ? I really appreciate ANY response that makes me think harder ;-)

Read the article
AWS Load balancer connection reset

- by joshmmo

I have an ELB set up with two instances. The issue I have with it is that when I do not add www. to it, the ELB just hangs. This is some info I get when I spider with wget: Spider mode enabled. Check if remote file exists. --2013-06-20 13:40:54-- http://learning.example.com/ Resolving learning.example.com... 54.xxx.x.x53, 50.xx.xxx.x71 Connecting to learning.example.com|54.xxx.x.x53|:80... connected. HTTP request sent, awaiting response... No data received. Retrying. when I add www. it works great. I have a GoDaddy SSL cert that I added to the listener section that covers 3 domains, www.learning.example.com, files.learning.example.com and learning.example.com. These are my listener settings: - HTTP 80 HTTPS 443 N/A N/A - SSL 443 SSL 443 Change canvasNew (Change) My EC2 instances are running apache2 on Ubuntu 12.04. I will be happy to post my vhosts file if needed. However, when I ran the server with the domains pointing to just one EC2 instance things worked fine. How can I fix this issue for learning.example.com? Why does www work just fine? A second question would be what is the difference between instance protocol and load balancer protocol? EDIT: Here are the dig results for learning.example.com from yesterday. I changed the DNS entry to point to one instance to make sure it was the elb. When I switch it back I will do it for www.learning.example.com ; <<>> DiG 9.9.1-P2 <<>> learning.example.com ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 20210 ;; flags: qr rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 0 ;; QUESTION SECTION: ;learning.example.com. IN A ;; ANSWER SECTION: learning.example.com. 2559 IN CNAME canvas-22222222222.us-west-1.elb.amazonaws.com. canvas-22222222222.us-west-1.elb.amazonaws.com. 60 IN A 54.xxx.x.x53 canvas-22222222222.us-west-1.elb.amazonaws.com. 60 IN A 50.xx.xxx.x71 ;; Query time: 83 msec ;; SERVER: 10.x.xx.20#53(10.x.xx.20) ;; WHEN: Thu Jun 20 13:40:47 2013 ;; MSG SIZE rcvd: 137 EDIT 2: Here is some more info that might be helpful. Port Configuration: 80 (HTTP) forwarding to 443 (HTTPS) Backend Authentication: Disabled Stickiness: Disabled(edit) 443 (SSL, Certificate: canvasNew) forwarding to 443 (SSL) Backend Authentication: Disabled So I switched everything to one EC2 IP address to bypass the elb to make sure things are working. It's running great. www and the non-www url work perfectly fine. Its only when I switch things to the ELB that learning.example.com hangs and www.learning.example.com works. Hopefully you can get some ideas flowing.

Read the article
Would an array of SSD drives be able to succesfully substitute the system memory?

- by Florin Mircea

I watched a few videos trying to answer this. This video (youtube.com/watch?v=eULFf6F5Ri8) shows a bunch of guys stacking 24 SSD's reaching a peak of around 2GBps r/w. That's under the limit of the worst DDR3 in this list (memorybenchmark.net/write_ddr3_amd.html) - that shows DDR3 memory performance varying from 2.78 to 6.55 Gb per second, but that video is over 3 years old. This video (youtube.com/watch?v=27GmBzQWwP0) shows a more optimistic situation, but for PCI-E SSD drives: 5 drives peaking at around 4Gb. And this other video shows that stacking up more than 3 SSD's doesn't realistically offer a substantial added performance. This and the fact that in all benchmarks the drives act quite poorly when dealing with small files (5k file read/write averaging from 10MB to around 30-40MBps) as opposed to how native memory handles such files, seems to indicate a definite NO to this question. Also, the write life cycle is indeed limited and the drives might wear out quickly, as kindly pointed out by paddy. However, I wanted to get more opinions on this. Would it be possible to at least obtain current memory performance with SSD's in RAID 0? And if so, in what circumstances? I am assuming using this configuration with a Windows OS that has a memory pagefile resident to that stack of SSD's, thus making it very fast to work with.

Read the article

< Previous Page | 2 3 4 5 6 7 8 9 | Next Page >