Search Results

Search found 149 results on 6 pages for 'crawlers'.

Page 6/6 | < Previous Page | 2 3 4 5 6

Changing domain name - what are the practical steps involved

- by Homunculus Reticulli

I launched a website a couple of years ago, bright eyed and bushy tailed, with dreams of conquering the world. Unfortunately it wasn't to be. Now, that I am a bit older and wiser, I have spent some money on branding and creating more quality content etc, I am rebranding and relaunching the site with a new domain name. Although the traffic on the old site is laughable (i.e. non-existent), there are a few pages of good information on there and I don't want to lose any "juice" those pages may have gained because web crawlers have been seeing it for a few years now. Ok, the upshot of all that is this: I want to change my domain name from xyz.com to abc.com. I am maintaining the same friendly urls I had before, only the domanin name part of the url will change, so that any traffic coming to the old page will be forwarded/redirected? to the new page seamlessly. How do I go about achieving this (i.e. what are the steps I need to carry out, and to minimize any "disruption" to any credibility the existing site has with Googlebot etc? I am running Apache 2.x on a headless Linux (Ubuntu) server.

Read the article
Homepage not showing on Google

- by MIke Mayberry

About six weeks ago my homepage (mayberrykayakingdotcodotuk) disappeared from the google organic search for "kayaking pembrokeshire" despite it having been number 2 within a few weeks of it's launch last summer. My previous site (www.mikemayberrykayakingdotcodotuk) had been 2nd for about six years and has 301 redirects for all pages to the new site. Google toolbar still rates the homepage as 3/10 and the domain is still showing in search results, just not the homepage. A little research suggests that this is most likely to be due to an issue with google treating two pages as identical content (one with www. and one with not) since the changes in their algorithms around that time and that the way to fix this is to add some code somewhere. This makes sense to me as my print advertising doesn't have the www part of the address. I have cpanel access but a limited knowledge on web coding, having picked things up as I've gone along and paid for designers etc., when needed. Would someone be able to let me know where I have to go to add the code and what code I need to add to redirect the crawlers to one page? Or is there another issue that is causing this? Thanks in advance.

Read the article
SEO and external sites that serve responsive images (like Re-SRC)

- by Baumr

Re-SRC is a tool that allows you to automatically serve responsive images for your website from their cloud servers. It delivers a new image file each time the browser window (viewport) is resized. To use it in your HTML when linking to an image, you would do the following: <img src="http://app.resrc.it//www.your-domain.com/img/img001.jpg"/> Some more background for SEO considerations: As an example, looking at their demo page's code, the src of the Arc de Triomphe photo — when the browser window is resized to be at a tablet-width — shows this particular file at it's widest. It is found under the following URL: http://app4-uk.resrc.it/s=w560,pd1/ro=h//www.resrc.it/img/demo/demo-image-1.jpg If the viewport is increased to desktop-width, then a smaller image is served in line with the design; see this URL: http://app4-uk.resrc.it/s=w320,pd1/ro=h//www.resrc.it/img/demo/demo-image-1.jpg If I change the viewport to be about half-way between those two, then the image's URL is: http://app4-uk.resrc.it/s=w240,pd1/ro=h//www.resrc.it/img/demo/demo-image-1.jpg In other words, I found that there is a separate file for every 10-pixel increment of the image width. Very cool for saving bandwidth on mobile devices and service responsive/retina images on others, but... Here are two problems I see for SEO: The img on your site, part of your semantic markup, will not be hosted on your site at all, or even a server you control. Any links to these images will pass on "link juice" to Re-SRC's site instead. You are serving a vast array of different image files to different people — some may link to one, others to another size. Then there's the question of what different search engine crawlers will see. Also: There seems to be no fallback option if their servers are down. Do you see any other concerns? Or, perhaps, do you not see those as concerns?

Read the article
Google Bot information?

- by Bub Bradlee

Does anyone know any more details about google's web-crawler (aka GoogleBot)? I was curious about what it was written in (I've made a few crawlers myself and am about to make another) and if it parses images and such. I'm assuming it does somewhere along the line, b/c the images in images.google.com are all resized. It also wouldn't surprise me if it was all written in Python and if they used all their own libraries for most everything, including html/image/pdf parsing. Maybe they don't though. Maybe it's all written in C/C++. Thanks in advance-

Read the article
Millions of anonymous ASP.Net profiles!?

- by Mantorok

Hi all, some advice needed! Our website receives approximately 50,000 hits a day, and we use anonymous ASP.Net membership profiles/users, this is resulting in millions (4.5m currently) of "active" profiles and the database is 'crawling', we have a nightly task that cleans up all the inactive ones. There is no way that we have 4.5m unique visitors (our county population is only 1/2 million), could this be caused by crawlers and spiders? Also, if we have to live with this huge number of profiles is there anyway of optimising the DB? Thanks Kev

Read the article
Spring-Security http-basic auth in addition to other authentication types

- by Keith

I have a pretty standard existing webapp using spring security that requires a database-backed form login for user-specific paths (such as /user/**), and some completely open and public paths (such as /index.html). However, as this webapp is still under development, I'd like to add a http-basic popup across all paths (/**) to add some privacy. Therefore, I'm trying to add a http-basic popup that asks for a universal user/pass combo (ex admin/foo) that would be required to view any path, but then still keep intact all of the other underlying authentication mechanisms. I can't really do anything with the <http> tag, since that will confuse the "keep out the nosy crawlers" authentication with the "user login" authentication, and I'm not seeing any way to associate different paths with different authentication mechanisms. Is there some way to do this with spring security? Alternatively, is there some kind of a dead simple filter that I can apply independently of spring-security's authentication mechanisms?

Read the article
disallow certain url in robots.txt

- by chrism

We implemented a rating system on a site a while back that involves a link to a script. However, with the vast majority of ratings on the site at 3/5 and the ratings very even across 1-5 we're beginning to suspect that search engine crawlers etc. are getting through. The urls used look like this: http://www.thesite.com/path/to/the/page/rate?uid=abcdefghijk&value=3 When we started we add the following to our robots.txt: User-agent: * Disallow: /rate Is this incorrect or are googlebot and others simply ignoring our robots.txt?

Read the article
How can I handle this kind of exceptions (in Doctrine)

- by ppavlovic

Can you tell me how can I handle this kind of exceptions: Fatal error: Uncaught exception 'Doctrine_Connection_Exception' with message 'PDO Connection Error: SQLSTATE[HY000] [2013] Lost connection to MySQL server at 'reading initial communication packet', system error: 110' in ... It happens when connection with MySQL is lost during query. I need to handle this exception so I can show 500 error page so the crawlers do not cache page, and to redirect user to appropriate "Try again" page. P.S. I have a lot's of code, so I can not go trough all code to put try/catch block. I need something simple and yet effective.

Read the article
Php text spinner

- by Sir Lojik

Hi, i spent time writing a rss feed aggregator and have come to find out it completely has no impact on seo. infact it could be damaging my website. i cant get rid of it as its a commonly used resource. So was wondering is there any hardcore php text spinner(synonimizer) out there. that way i could server crawlers/spiders with spun text. is this ethical? or would it cause more damage? please i need feedback.

Read the article
SSIS Catalog, Windows updates and deployment failures due to System.Core mismatch

- by jamiet

This is a heads-up for anyone doing development on SSIS. On my current project where we are implementing a SQL Server Integration Services (SSIS) 2012 solution we recently encountered a situation where we were unable to deploy any of our projects even though we had successfully deployed in the past. Any attempt to use the deployment wizard resulted in this error dialog: The text of the error (for all you search engine crawlers out there) was: A .NET Framework error occurred during execution of user-defined routine or aggregate "create_key_information": System.IO.FileLoadException: Could not load file or assembly 'System.Core, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089' or one of its dependencies. The located assembly's manifest definition does not match the assembly reference. (Exception from HRESULT: 0x80131040) ---> System.IO.FileLoadException: The located assembly's manifest definition does not match the assembly reference. (Exception from HRESULT: 0x80131040) System.IO.FileLoadException: System.IO.FileLoadException: at Microsoft.SqlServer.IntegrationServices.Server.Security.CryptoGraphy.CreateSymmetricKey(String algorithm) at Microsoft.SqlServer.IntegrationServices.Server.Security.CryptoGraphy.CreateKeyInformation(SqlString algorithmName, SqlBytes& key, SqlBytes& IV) . (Microsoft SQL Server, Error: 6522) After some investigation and a bit of back and forth with some very helpful members of the SSIS product team (hey Matt, Wee Hyong) it transpired that this was due to a .Net Framework fix that had been delivered via Windows Update. I took a look at the server update history and indeed there have been some recently applied .Net Framework updates: This fix had (in the words of Matt Masson) “somehow caused a mismatch on System.Core for SQLCLR” and, as you may know, SQLCLR is used heavily within the SSIS Catalog. The fix was pretty simple – restart SQL Server. This causes the assemblies to be upgraded automatically. If you are using Data Quality Services (DQS) you may have experienced similar problems which are documented at Upgrade SQLCLR Assemblies After .NET Framework Update. I am hoping the SSIS team will follow-up with a more thorough explanation on their blog soon. You DBAs out there may be questioning why Windows Update is set to automatically apply updates on our production servers. We’re checking that out with our hosting provider right now You have been warned! @Jamiet

Read the article
How exactly is Google Webmaster Tools measuring "Site Performance"?

- by Rémi

I've been working for two months now on improving our response time (mainly server side) on a new forum (a brand new product on a technical point of view) we've launched in Germany a few month ago and I'm a lot surprised by the results I get. I monitor our response time using Apache logs and our own implementation of Boomerang beacon. Using my stats, I can see that our new product responds in about 680 ms where our old product was responding in about 1050 ms. On the other side, Google Webmaster Tool tells us that our pages have an average reponse time of about 1500 ms today where it was 700 three months ago with our old product. I've figured that GWT was taking client side metrics into account so I've added some measures on our Boomerang beacon and everything looks just fine. I've also ran some random pages on ySlow and Google's Page Speed and everything looks better than it was before. We event have a 82% on Google's Page Speed tool which is quite cool for a site with some ads in it :) Lately, we have signed a deal with Akamai to use two of their products : CDN for our static files (we were using another CDN before but it wasn't very effective) and RMA to improve Networks routes. We have also introduced a new agressive cache mecanism to ensure that most of the pages served to crawlers are cached by our memcache grid. After checking my metrics, it seems that this changes have improved from 650ms to about 500ms, which is good (still not great but it is definitly an improvement). But webmaster tools continues to report an increasing average response time where we see it decreasing in the same time. Have you ever had the same kind of wierd behavior on your sites while doing performance improvements ? Do you have any idea how to monitor the same thing Google does with Site Performance in Google Webmaster Tools so that we could improve our site and constantly check if it is what Google wants ? Edit 2011/07/26 : Thanks for your answers guys ! Nevertheless, I was not precise enough. The main issue we have is not with the Site Performance page but with the Crawl Stats one for now. We probably found an issue on our side with some very slow pages (around 3000 ms !!) and we are trying to fix them. I'll keep you posted as soon I'll have some infos. Thanks again !

Read the article
SEO to ensure visibility for a narrow, non-competitive, non-commercial site

- by hen3ry

I'm webmaster of a non-commercial site in English. A non-native-English speaker asked me why our site doesn't produce hits in Google searches she conducts for relevant keywords in her native language. I asked her for a list of keywords in her native language, and I naively tried inserting those into the META info in the page headers and waited a couple of weeks. No help. A little searching informed me that Google doesn't use the META info, and has not done so for a very long time. D'oh! To be entirely concrete, suppose the StackExchange folks want Russian speakers to find this site, Pro Webmasters. The direct translation in Russian of "webmaster" --thanks, Google Translator-- is: "?????????". (Not sure this will render properly, but that's not essential to my question.) Assuming Pro Webmasters has a common template for all pages it generates, inserting "?????????" into the Keywords META for that template won't help, it seems. What could StackExchange do to make this site visible to users searching with the Russian keyword "?????????" ? Pretty much all the advice I've seen boils down to this, if I understand correctly: use the desired search term often (but not too often) among site content, and the problem will be solved. That's great, but I don't think sprinkling "?????????" visibly all over Pro Webmasters is going to fly. Just for completeness, crawlers must be long immune to the invisible-to-visitors scheme, e.g, format "?????????" in a tiny text size in a color the same as an existing background, e.g. white-over-white. Or, put that text inside a div styled: ' style="visibility: hidden" '. Probably some other equivalents. I can only think of one slightly effective method, along these lines: place an unobtrusive link on the common template to a page titled "for international users" , and on that page list desired synonyms for "webmaster" in various languages on that page. A test case --admittedly, just one-- using my site implies that a Google search for "international users" ????????? will produce a hit for this page, and thus make the site minimally visible, despite the fact that the page will almost never be visited. At the moment, anyway. Note: All the SEO discussions I have found so far are about competitive and --almost certainly-- commercial sites. To repeat: my site is non-commercial, and it is about an obscure, narrow topic that is of interest to only a small number of people worldwide. This isn't about clawing our way to the top of competitive rankings, just making this content minimally visible to interested non-native-English speakers. Ideas? TIA

Read the article
Finding all IP ranges blelonging to a specific ISP

- by Jim Jim

I'm having an issue with a certain individual who keeps scraping my site in an aggressive manner; wasting bandwidth and CPU resources. I've already implemented a system which tails my web server access logs, adds each new IP to a database, keeps track of the number of requests made from that IP, and then, if the same IP goes over a certain threshold of requests within a certain time period, it's blocked via iptables. It may sound elaborate, but as far as I know, there exists no pre-made solution designed to limit a certain IP to a certain amount of bandwidth/requests. This works fine for most crawlers, but an extremely persistent individual is getting a new IP from his/her ISP pool each time they're blocked. I would like to block the ISP entirely, but don't know how to go about it. Doing a whois on a few sample IPs, I can see that they all share the same "netname", "mnt-by", and "origin/AS". Is there a way I can query the ARIN/RIPE database for all subnets using the same mnt-by/AS/netname? If not, how else could I go about getting every IP belonging to this ISP? Thanks.

Read the article
Copy a website and preserve the file & folder structure

- by DrStalker

I have an old web site running on an ancient version of Oracle Portal that we need to convert to a flat-html structure. Due to damage to the server we are not able to access the administrative interface, and even if we could there is no export functionality that can work with modern software versions. It would be enough to crawl the website and have all the pages & images saved to a folder, but the file structure needs to be preserved; that is, if a page is located at http://www.oldserver.com/foo/bar/baz/mypage.html then it needs to be saved to /foo/bar/baz/mypage.html so that the various Javascript bits will continue to function. None of the web crawlers I've found have been able to do this; they all want to rename the pages (page01.html, page02.html etc) and break the folder structure. Is there any crawler out there that will recreate the site structure as it appears to a user accessing the site? It doesn't need to redo any of teh content of the pages; once rehosted the pages will all have the same names they did originally so links will continue to work.

Read the article
Web scraping etiquette

- by Ash

I'm considering writing a simple web scraping application to extract information from a website that does not seem to specifically prohibit this. I've checked for other alternatives (eg RSS, web service) to get this information, but there are none available at this stage. Despite this I've also developed/maintained a few websites myself and so I realize that if web scraping is done naively/greedily it can slow things down for other users and generally become a nuisance. So, what etiquette is involved in terms of: Number of requests per second/minute/hour. HTTP User Agent content. HTTP Referer content. HTTP Cache settings. Buffer size for larger files/resources. Legalities and licensing issues. Good tools or design approaches to use. Robots.txt, is this relevant for web scraping or just crawlers/spiders? Compression such as GZip in requests. Update Found this relevant question on Meta: Etiquette of Screen Scaping StackOverflow. Jeff Atwood's answer has some helpful recommendations. Other related StackOverflow questions: Options for html scraping Legalities of screen scraping

Read the article
CSS sliding-door buttons center alignment

- by rochal

Hi guys, I need help to align CSS buttons. I tried many different variations and I just cannot center my button the way I want. Firstly, have a look at this url: http://www.front-end-developer.net/cssbuttons/example.htm I'm using 2 images to form a button (this could be done on 1 image, but in this case we've got two). Everything works as expected as long as we apply float:left or float:right to the parent div element, to 'limit' width of the div and close it as soon as the content of the div ends. You can remove float:left from the button to see what I mean. But what about center positioned buttons? I cannot add float:left/right because I want align it in the middle. In theory, I could set { width:XXpx; margin:0 auto; } And I will get what you can see on this picture: But I don't know the length of the text inside. Having different translations my button can be very short, or 5 times that long. I also tried to use <span> instead of <div>, but unfortunately nested inline elements don't respect their padding correctly... And yes, I must use <a> inside, so buttons can be accessed by web crawlers. I'm really stuck on this one.

Read the article
Webcrawler, feedback?

- by Jan Kuboschek

Hey folks, every once in a while I have the need to automate data collection tasks from websites. Sometimes I need a bunch of URLs from a directory, sometimes I need an XML sitemap (yes, I know there is lots of software for that and online services). Anyways, as follow up to my previous question I've written a little webcrawler that can visit websites. Basic crawler class to easily and quickly interact with one website. Override "doAction(String URL, String content)" to process the content further (e.g. store it, parse it). Concept allows for multi-threading of crawlers. All class instances share processed and queued lists of links. Instead of keeping track of processed links and queued links within the object, a JDBC connection could be established to store links in a database. Currently limited to one website at a time, however, could be expanded upon by adding an externalLinks stack and adding to it as appropriate. JCrawler is intended to be used to quickly generate XML sitemaps or parse websites for your desired information. It's lightweight. Is this a good/decent way to write the crawler, provided the limitations above? http://pastebin.com/VtgC4qVE - Main.java http://pastebin.com/gF4sLHEW - JCrawler.java http://pastebin.com/VJ1grArt - HTMLUtils.java Thanks for your feedback in advance! :)

Read the article
What techniques can be used to detect so called "black holes" (a spider trap) when creating a web crawler?

- by Tom

When creating a web crawler, you have to design somekind of system that gathers links and add them to a queue. Some, if not most, of these links will be dynamic, which appear to be different, but do not add any value as they are specifically created to fool crawlers. An example: We tell our crawler to crawl the domain evil.com by entering an initial lookup URL. Lets assume we let it crawl the front page initially, evil.com/index The returned HTML will contain several "unique" links: evil.com/somePageOne evil.com/somePageTwo evil.com/somePageThree The crawler will add these to the buffer of uncrawled URLs. When somePageOne is being crawled, the crawler receives more URLs: evil.com/someSubPageOne evil.com/someSubPageTwo These appear to be unique, and so they are. They are unique in the sense that the returned content is different from previous pages and that the URL is new to the crawler, however it appears that this is only because the developer has made a "loop trap" or "black hole". The crawler will add this new sub page, and the sub page will have another sub page, which will also be added. This process can go on infinitely. The content of each page is unique, but totally useless (it is randomly generated text, or text pulled from a random source). Our crawler will keep finding new pages, which we actually are not interested in. These loop traps are very difficult to find, and if your crawler does not have anything to prevent them in place, it will get stuck on a certain domain for infinity. My question is, what techniques can be used to detect so called black holes? One of the most common answers I have heard is the introduction of a limit on the amount of pages to be crawled. However, I cannot see how this can be a reliable technique when you do not know what kind of site is to be crawled. A legit site, like Wikipedia, can have hundreds of thousands of pages. Such limit could return a false positive for these kind of sites. Any feedback is appreciated. Thanks.

Read the article
Redirecting to frontpage after 404 error in PHP

- by Saif Bechan

I have a php web page that now uses custom error pages when a page is not found. The custom error pages are included in PHP. So when somebody types in an URL that does not exists I just include an error page, and the error page starts with: <?php header("HTTP/1.1 404 Not> Found"); ?> This also tells crawlers that the page does not exist. Now I have set up a new system. When a user types a wrong url, the user is sent back to the frontpage and a message is displayed on the frontpage. I redirect to the frontpage like this: header('Location:' . __TINY_URL . '/'); Now the problem is PHP just sends back a 200 code, page found. How can I mix these two to create a 404 code on the frontpage. And is this overall a nice way of presenting and error page.

Read the article
Best practice to hide/encrypt email adress in webpage

- by Sebi

I couldn't find a similar question, that's why here it is: Whats the best way to hide or encrypt an email link in a website, so that a crawler can't read it, but the user can nevertheless click it? I don't want to conufse the users by typing the email like this: john (at) mail.com or similar ways. (and i think this kind of links can nevertheless read by crawlers?) I also tried things like that: <script>// <![CDATA[eval(unescape('%76%61%72%20%73%3D%27%61%6D%6C%69%6F%74%72%3A%62%61%40%65%64%61%6E%6F%6C%2E%69%27%3B%76%61%72%20%72%3D%27%27%3B%66%6F%72%28%76%61%72%20%69%3D%30%3B%69%3C%73%2E%6C%65%6E%67%74%68%3B%69%2B%2B%2C%69%2B%2B%29%7B%72%3D%72%2B%73%2E%73%75%62%73%74%72%69%6E%67%28%69%2B%31%2C%69%2B%32%29%2B%73%2E%73%75%62%73%74%72%69%6E%67%28%69%2C%69%2B%31%29%7D%64%6F%63%75%6D%65%6E%74%2E%77%72%69%74%65%28%27%3C%61%20%68%72%65%66%3D%22%27%2B%72%2B%27%22%3E%4F%62%65%72%70%61%72%6C%65%69%74%65%72%3C%2F%61%3E%27%29%3B'))]]></script> but i heard this can also be read by crawler and it isn't really good practices are ther any common approaches?

Read the article
DB Strategy for inserting into a high read table (Sql Server)

- by Tom

Looking for strategies for a very large table with data maintained for reporting and historical purposes, a very small subset of that data is used in daily operations. Background: We have Visitor and Visits tables which are continuously updated by our consumer facing site. These tables contain information on every visit and visitor, including bots and crawlers, direct traffic that does not result in a conversion, etc. Our back end site allows management of the visitor's (leads) from the front end site. Most of the management occurs on a small subset of our visitors (visitors that become leads). The vast majority of the data in our visitor and visit tables is maintained only for a much smaller subset of user activity (basically reporting type functionality). This is NOT an indexing problem, we have done all we can with indexing and keeping our indexes clean, small, and not fragmented. ps: We do not currently have the budget or expertise for a data warehouse. The problem: We would like the system to be more responsive to our end users when they are querying, for instance, the list of their assigned leads. Currently the query is against a huge data set of mostly irrelevant data. I am pondering a few ideas. One involves new tables and a fairly major re-architecture, I'm not asking for help on that. The other involves creating redundant data, (for instance a Visitor_Archive and a Visitor_Small table) where the larger visitor and visit tables exist for inserts and history/reporting, the smaller visitor1 table would exist for managing leads, sending lead an email, need leads phone number, need my list of leads, etc.. The reason I am reaching out is that I would love opinions on the best way to keep the Visitor_Archive and the Visitor_Small tables in sync... Replication? Can I use replication to replicate only data with a certain column value (FooID = x) Any other strategies?

Read the article
Crawling engine architecture - Java/ Perl integration

- by Bigtwinz

Hi all, I am looking to develop a management and administration solution around our webcrawling perl scripts. Basically, right now our scripts are saved in SVN and are manually kicked off by SysAdmin/devs etc. Everytime we need to retrieve data from new sources we have to create a ticket with business instructions and goals. As you can imagine, not an optimal solution. There are 3 consistent themes with this system: the retrieval of data has a "conceptual structure" for lack of a better phrase i.e. the retrieval of information follows a particular path we are only looking for very specific information so we dont have to really worry about extensive crawling for awhile (think thousands-tens of thousands of pages vs millions) crawls are url-based instead of site-based. As I enhance this alpha version to a more production-level beta I am looking to add automation and management of the retrieval of data. Additionally our other systems are Java (which I'm more proficient in) and I'd like to compartmentalize the perl aspects so we dont have to lean heavily on outside help. I've evaluated the usual suspects Nutch, Droid etc but the time spent on modifying those frameworks to suit our specific information retrieval cant be justified. So I'd like your thoughts regarding the following architecture. I want to create a solution which use Java as the interface for managing and execution of the perl scripts use Java for configuration and data access stick with perl for retrieval An example use case would be a data analyst delivers us a requirement for crawling perl developer creates the required script and uses this webapp to submit the script (which gets saved to the filesystem) the script gets kicked off from the webapp with specific parameters .... Webapp should be able to create multiple threads of the perl script to initiate multiple crawlers. So questions are what do you think how solid is integration between Java and Perl specifically from calling perl from java has someone used such a system which actually is part perl repository The goal really is to not have a whole bunch of unorganized perl scripts and put some management and organization on our information retrieval. Also, I know I can use perl do do the web part of what we want - but as I mentioned before - trying to keep perl focused. But it seems assbackwards I'm not adverse to making it an all perl solution. Open to any all suggestions and opinions. Thanks

Read the article
nginx: How can I set proxy_* directives only for matching URIs?

- by Artem Russakovskii

I've been at this for hours and I can't figure out a clean solution. Basically, I have an nginx proxy setup, which works really well, but I'd like to handle a few urls more manually. Specifically, there are 2-3 locations for which I'd like to set proxy_ignore_headers to Set-Cookie to force nginx to cache them (nginx doesn't cache responses with Set-Cookie as per http://wiki.nginx.org/HttpProxyModule#proxy_ignore_headers). So for these locations, all I'd like to do is set proxy_ignore_headers Set-Cookie; I've tried everything I could think of outside of setting up and duplicating every config value, but nothing works. I tried: Nesting location directives, hoping the inner location which matches on my files would just set this value and inherit the rest, but that wasn't the case - it seemed to ignore anything set in the outer location, most notably proxy_pass and I end up with a 404). Specifying the proxy_cache_valid directive in an if block that matches on $request_uri, but nginx complains that it's not allowed ("proxy_cache_valid" directive is not allowed here). Specifying a variable equal to "Set-Cookie" in an if block, and then trying to set proxy_cache_valid to that variable later, but nginx isn't allowing variables for this case and throws up. It should be so simple - modifying/appending a single directive for some requests, and yet I haven't been able to make nginx do that. What am I missing here? Is there at least a way to wrap common directives in a reusable block and have multiple location blocks refer to it, after adding their own unique bits? Thank you. Just for reference, the main location / block is included below, together with my failed proxy_ignore_headers directive for a specific URI. location / { # Setup var defaults set $no_cache ""; # If non GET/HEAD, don't cache & mark user as uncacheable for 1 second via cookie if ($request_method !~ ^(GET|HEAD)$) { set $no_cache "1"; } if ($http_user_agent ~* '(iphone|ipod|ipad|aspen|incognito|webmate|android|dream|cupcake|froyo|blackberry|webos|s8000|bada)') { set $mobile_request '1'; set $no_cache "1"; } # feed crawlers, don't want these to get stuck with a cached version, especially if it caches a 302 back to themselves (infinite loop) if ($http_user_agent ~* '(FeedBurner|FeedValidator|MediafedMetrics)') { set $no_cache "1"; } # Drop no cache cookie if need be # (for some reason, add_header fails if included in prior if-block) if ($no_cache = "1") { add_header Set-Cookie "_mcnc=1; Max-Age=17; Path=/"; add_header X-Microcachable "0"; } # Bypass cache if no-cache cookie is set, these are absolutely critical for Wordpress installations that don't use JS comments if ($http_cookie ~* "(_mcnc|comment_author_|wordpress_(?!test_cookie)|wp-postpass_)") { set $no_cache "1"; } if ($request_uri ~* wpsf-(img|js)\.php) { proxy_ignore_headers Set-Cookie; } # Bypass cache if flag is set proxy_no_cache $no_cache; proxy_cache_bypass $no_cache; # under no circumstances should there ever be a retry of a POST request, or any other request for that matter proxy_next_upstream off; proxy_read_timeout 86400s; # Point nginx to the real app/web server proxy_pass http://localhost; # Set cache zone proxy_cache microcache; # Set cache key to include identifying components proxy_cache_key $scheme$host$request_method$request_uri$mobile_request; # Only cache valid HTTP 200 responses for this long proxy_cache_valid 200 15s; #proxy_cache_min_uses 3; # Serve from cache if currently refreshing proxy_cache_use_stale updating timeout; # Send appropriate headers through proxy_set_header Host $host; # no need for this proxy_set_header X-Real-IP $remote_addr; # no need for this proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; # Set files larger than 1M to stream rather than cache proxy_max_temp_file_size 1M; access_log /var/log/nginx/androidpolice-microcache.log custom; }

Read the article
Agile: User Stories for Machine Learning Project?

- by benjismith

I've just finished up with a prototype implementation of a supervised learning algorithm, automatically assigning categorical tags to all the items in our company database (roughly 5 million items). The results look good, and I've been given the go-ahead to plan the production implementation project. I've done this kind of work before, so I know how the functional components of the software. I need a collection of web crawlers to fetch data. I need to extract features from the crawled documents. Those documents need to be segregated into a "training set" and a "classification set", and feature-vectors need to be extracted from each document. Those feature vectors are self-organized into clusters, and the clusters are passed through a series of rebalancing operations. Etc etc etc etc. So I put together a plan, with about 30 unique development/deployment tasks, each with time estimates. The first stage of development -- ignoring some advanced features that we'd like to have in the long-term, but aren't high enough priority to make it into the development schedule yet -- is slated for about two months worth of work. (Keep in mind that I already have a working prototype, so the final implementation is significantly simpler than if the project was starting from scratch.) My manager said the plan looked good to him, but he asked if I could reorganize the tasks into user stories, for a few reasons: (1) our project management software is totally organized around user stories; (2) all of our scheduling is based on fitting entire user stories into sprints, rather than individually scheduling tasks; (3) other teams -- like the web developers -- have made great use of agile methodologies, and they've benefited from modelling all the software features as user stories. So I created a user story at the top level of the project: As a user of the system, I want to search for items by category, so that I can easily find the most relevant items within a huge, complex database. Or maybe a better top-level story for this feature would be: As a content editor, I want to automatically create categorical designations for the items in our database, so that customers can easily find high-value data within our huge, complex database. But that's not the real problem. The tricky part, for me, is figuring out how to create subordinate user stories for the rest of the machine learning architecture. Case in point... I know that the algorithm requires two major architectural subdivisions: (A) training, and (B) classification. And I know that the training portion of the architecture requires construction of a cluster-space. All the Agile Development literature I've read seems to indicate that a user story should be the "smallest possible implementation that provides any business value". And that makes a lot of sense when designing a piece of end-user software. Start small, and then incrementally add value when users demand additional functionality. But a cluster-space, in and of itself, provides zero business value. Nor does a crawler, or a feature-extractor. There's no business value (not for the end-user, or for any of the roles internal to the company) in a partial system. A trained cluster-space is only possible with the crawler and feature extractor, and only relevant if we also develop an accompanying classifier. I suppose it would be possible to create user stories where the subordinate components of the system act as the users in the stories: As a supervised-learning cluster-space construction routine, I want to consume data from a feature extractor, so that I can exist. But that seems really weird. What benefit does it provide me as the developer (or our users, or any other stakeholders, for that matter) to model my user stories like that? Although the main story can be easily divided along architectural-component boundaries (crawler, trainer, classifier, etc), I can't think of any useful decomposition from a user's perspective. What do you guys think? How do you plan Agile user stories for sophisticated, indivisible, non-user-facing components?

Read the article

Search Results

Search found 149 results on 6 pages for 'crawlers'.

Page 6/6 | < Previous Page | 2 3 4 5 6

- by Homunculus Reticulli

- by MIke Mayberry

- by Baumr

- by Bub Bradlee

- by Mantorok

- by Keith

- by chrism

- by ppavlovic

- by Sir Lojik

- by jamiet

- by Rémi

- by hen3ry

- by Jim Jim

- by DrStalker

- by Ash

- by rochal

- by Jan Kuboschek

- by Tom

- by Saif Bechan

- by Sebi

- by Tom

- by Bigtwinz

- by Artem Russakovskii

- by benjismith

< Previous Page | 2 3 4 5 6