robots - Page 9 - Developer IT

Including hostname in apache logwatch reports

- by Robert Munteanu

When hosting multiple domains with apache it's useful to see the logwatch apache output with the virtual host name included, but I only get: --------------------- httpd Begin ------------------------ Requests with error response codes 400 Bad Request /: 1 Time(s) /robots.txt: 1 Time(s) whereas I would like something like --------------------- httpd Begin ------------------------ Requests with error response codes 400 Bad Request example.com/: 1 Time(s) example.org/robots.txt: 1 Time(s) How can I achieve this with logwatch?

Read the article

robocode engine: how to design (write) the runtime engine -- the robot world

- by Bob

IBM has (had) a free learn-Java program called RoboCode, in which custom robots could be written that would then do battle in a 2D space. I would like to write the environment that supports such robots, but don't know what pattern or design to use. Each robot is a thread. Each thread is given a certain (indeterminate) amount of run-time by the JRE. A robot world engine would start by building a list of the players/robots, then enter a loop that allows each player in turn to do whatever it likes: move forward, turn left, fire toward the northwest corner, whatever. Each robot is also informed of events of interest: an enemy or several enemies are within range, it has been hit by a bullet, etc. The combination of handling threads that are also event listeners (are they also producers?) confuses me; it seems that there's more to the game engine. Even some general ideas would help. TIA.

Read the article

Prevent Django from redirecting to add trailing slash

- by konrad

UPDATED: Sorry, it looks like it's Apache that's rewriting it for some reason, not Django. I'll investigate further and post my findings. I need to add a /xmlrpc.php to my Byteflow installation to handle an application that is written for PHP blog engines and uses this hardcoded path. For some reason Byteflow appends a slash to this URL using a 301 Moved Permanently redirect, which breaks the application. It does not do so for the /robots.txt that is configured in a similar way. Relevant lines from the project urls.py: url(r'^xmlrpc.php$', 'django_xmlrpc.views.xmlrpc_handler'), url(r'^robots.txt$', include('robots.urls')), I read that the behavior was changed in the Django codebase in commit 6852 (in 2007) to prevent redirects being done for urls that have been explicitly configured not to contain any trailing slashes. I'm using Django 1.1. I assume that once I have fixed this problem, I should be able to use this application with Byteflow, because the application uses the MetaWeblog XML-RPC API. Any clue?

Read the article

What does upload in Google AppEngine app.yaml do?

- by Great Turtle

I am going through the various Google AppEngine tutorials sometimes, and I just noticed something odd in a StackOverflow question about favicon.ico - specifically this question: http://stackoverflow.com/questions/887328/favicon-ico-not-found-error-in-app-engine - url: /favicon.ico static_files: media/img/favicon.ico upload: media/img/favicon.ico - url: /robots.txt static_files: media/robots.txt upload: media/robots.txt All of the posters included an "upload:" line in their app.yaml definitions The application appears to work the same with or without the upload: line, and I have not seen any mention of it in the official documentation. Where is it used, or what difference does it make if this line is included or not?

Read the article

Google I/O 2010 - Google Wave API design principles

Google I/O 2010 - Google Wave API design principles Google I/O 2010 - Google Wave API design principles + anatomy of a great extension Wave 201 Pamela Fox, Michael Goderbauer (Hasso Plattner Institute) Google Wave is all about collaboration. The most successful extensions are user-friendly and collaborative. Wave robots should be as intuitive to communicate with as a human, and play well with other robots; Wave gadgets should extend the metaphors of the textual collaboration into the visual. In this talk, we'll discuss the design and privacy principles you should consider while building extensions, and show examples of extensions that demonstrate these principles. For all I/O 2010 sessions, please go to code.google.com/events/io/2010/sessions.html From: GoogleDevelopers Views: 6 0 ratings Time: 01:01:54 More in Science & Technology

Read the article

Microsoft TechEd 2010 - Day 2 @ Bangalore

- by sathya

Microsoft TechEd 2010 - Day 2 @ Bangalore Today is the day 2 @ Microsoft TechEd 2010. We had lot of technical sessions as usual there were many tracks going on side by side and I was attending the Web simplified track, Which comprised of the following sessions : Developing a scalable Media Application using ASP.NET MVC - This was a kind of little advanced stuff. Anyways I couldn't understand much because this was not my piece of cake and I havent worked on this before ASP.Net MVC Unplugged - This was really great because this session covered from the basics of MVC showing what is Model,View and Controller and how it worked and the speaker went into the details of the same. Building RESTful Applications with the Open Data Protocol - There were some concepts explained about this from the basics on how to build RESTful Services and it went on till some advanced configurations of the same. Developing Scalable Web Applications with AppFabric Caching - This session showed about the integration of AppFabric with the .Net Web Applications. Instead of using Inproc Sessions, we can use this AppFabric as a substitute for Caching and outofProc Session Storage without writing code and doing a little bit of configurations which brings in High Scalability, performance to our applications. (But unfortunately there were no demos for this session ) Deep Dive : WCF RIA Services - This session was also an interactive one, in this the speaker presented from the basics of WCF and took a Book Store Application as a sample and explained all details concepts on linking with RIA Services Apart from these sessions, in between there happened some small events in the breaks like Some discussions about Technology, Innovations Music Jokes Mimicry, etc. And on doing all these things, the developers were given some kool gifts / goodies like USBs, T-Shirts, etc. And today I got a chance to do the following certification : (70-562) Microsoft Certified Technology Specialist in .NET 3.5 Web Applications Since I already have an MCTS in .NET 2.0, I wanted to do an MCPD and for doing the same I was required to do an update to my MCTS with the .NET 3.5 framework and I did the same I cleared it and now am an MCTS in .NET 3.5 Web Apps And on doing this I got a T-Shirt and they gave something called Learning $ of worth 30$. And in various stalls for attending each quiz or some game or some referrals we got some Learning $ which we can redeem later based on our Total Learning $. I got 105 $ which i was able to redeem and got a Microsoft Learning BagPack, 1 free Microsoft certification offer, a laptop light and an e-learning content activated. And after all these sessions and small events, we had something called Demo Extravaganza like I mentioned yesterday. This was a great funfilled event with lot of goodies for the attendees. There were some lucky draw which enabled 2 attendees to get Netbooks (Sponsored by Intel) and 1 attendee to get X-box (Sponsored by Citrix). After Choosing the raffle in the lucky draw they kept it on a device called Microsoft Surface which is a kind of big touch screen device and on putting the raffle on that it detected the code of the attendee and said intelligently how many sessions that person has attended and if he has attended more than 5 he got a Netbook and this was coded by a guy called Imran. Apart from they showed demos on : Research by 2 Tamilnadu students from Krishna Arts and Science college, taken 1200 photographs of their college from different angles and put that up in Bing maps using silverlight and linked with Photosynth, which showed a 3d view of their college based on the photos they uploaded Reasearch by Microsoft on Panaramic HD views of the images. One young guy from Microsoft Research showed a demo of this on Srivilliputhur Andal Temple, in Tamil Nadu and its history with a panoramic view of the temple and the near by places with narration of the historical information on the same and with the videos embedded in it with high definition images which we can zoom to a very detailed level. Some Demo on a business app with Silverlight, Business Intelligence (BI) and maps integrated. It showed the sales of a particular product across locations. Some kool demos by 2 geeks who used Robots to show their development talents. 2 Robots fought with each other 2 Robots danced in sync for the A.R. Rehman song Humma Humma... A dream home project by Raman. He is currently using the same in his home too. Robots are controlling his home currently. They showed a video on this. Here are the list of activities that Robot does for him When he reads a book, robot automatically scans that and shows that image of that person in the screen (TV or comp) in front of him. It shows a wikipedia about that person. It says that person is not in linked in. do you want to add him If he sees an IPL Match news in the book and smiles it understands he is interested in that and opens a website related to that and shows the current game and the scorecard. It cooks for him It cleans the room for him whenever he leaves the house when he is doing something if some intruder comes inside his house his computer automatically switches his screen showing the video of the person coming inside. When he wakes up it automatically opens up the system, loads his mails and the news by the side, etc. Some Demos on Microsoft Pivot. This was there in livelabs but it is now available in getpivot.com its a pivoting of the pictorial data based on some categories and filters on the searches that we do. And finally on filling up some feedback forms we got T-Shirts and Microsoft Visual Studio 2010 Training Kit CDs. Whats more on TechEd??? Stay tuned!!! Will update you soon on the other happenings!! PS : I typed a lot of content for more than a hour but I pressed a backspace and it went to the previous page and all my content were lost and I was not able to retrieve the same and I typed everything again.

Read the article

Is there any reason to allow Yahoo! Slurp to crawl my site?

- by James Skemp

I thought a year or more ago Yahoo! would be using another search engine for results, and no longer using their own Slurp bot. However, a couple of the sites I manage Yahoo! Slurp continues to crawl pages, and seems to ignore the Gone status code when returned (as it keeps coming back). Is there any reason why I wouldn't want to block Yahoo! Slurp via robots.txt or by IP (since it tends to ignore robots.txt in some cases anyways)? I've confirmed that when the bot does hit it is from Yahoo! IPs, so I believe this is a legit instance of the bot. Is Yahoo Search the same as Bing Search now? is a related question, but I don't think it completely answers whether one should add a new block of the bot.

Read the article

Little Wheel Is An Atmospheric and Engaging Point-and-Click Adventure

- by Jason Fitzpatrick

If you’re a fan of the resurgence of highly stylized and atmospheric adventure games–such as Spirit, World of Goo, and the like–you’ll definitely want to check out this well executed, free, and more than a little bit charming browser-based game. Little Wheel is set in a world of robots where, 10,000 years ago, a terrible accident at the central power plant left all the robots without power. The entire robot world went into a deep sleep and now, thanks to a freak lightning strike, one little robot has woken up. Your job, as that little robot, is to navigate the world of Little Wheel and help bring it back to life. Hit up the link below to play the game for free–the quality of the visual and audio design make going full screen and turning the speakers on a must. Little Wheel [via Freeware Genuis] How to Make Your Laptop Choose a Wired Connection Instead of Wireless HTG Explains: What Is Two-Factor Authentication and Should I Be Using It? HTG Explains: What Is Windows RT and What Does It Mean To Me?

Read the article

Are Meta tags useful for SEO?

- by Lynda

Reading a Search Results with this or a similar phrased question can lead to reading a lot of conflicting answers. Are there any meta tags that matter in SEO? From what I have read I do know that meta keywords are no longer used (or so little it is not worth using them) and don't worry with using them. Meta Description tags are not used for page ranking but can effect click through rates so should be used but be less than 160 characters. I know the following meta tags exist: author - the author's name and possibly email address robots - to allow or disallow indexing by robots copyright - the copyright date of the page How much do these meta tags matter and are there others that are new (including ones that may not be used by all but might be used in the future or are used by only one of the big players like Google or Bing) meta tags that should be included? Note: Even if a meta tag doesn't matter in SEO but helps with click through rates similar to the descript tag does then feel free to include it with your answer.

Read the article

How to link subprograms to a main program's game loop?

- by Jim

I recently discovered Crobot which is (briefly) a game where each player codes a virtual robot in a pseudo-C language. Each robot is then put in an arena where it fights against other robots. A robots' source code has this shape : /* Beginning file robot.r */ main() { while (1) { /* Do whatever you want */ ... move(); ... fire(); } } /* End file robot.r */ You can see that : The code is totally independent from any library/include Some predefined functions are available (move, fire, etc…) The program has its own game loop, and consequently is not called every frame My question is roughly : how does it work ? It seems that each robot's code is compiled by the main program and then used in a way I cannot understand. I thought it could yields a thread for each robot, but I have not any proof of this and it seems a bit complicated to achieve it. Any idea how it could work, someone ?

Read the article

How to get scripted programs governing game entities run in parallel with a game loop?

- by Jim

I recently discovered Crobot which is (briefly) a game where each player codes a virtual robot in a pseudo-C language. Each robot is then put in an arena where it fights against other robots. A robots' source code has this shape : /* Beginning file robot.r */ main() { while (1) { /* Do whatever you want */ ... move(); ... fire(); } } /* End file robot.r */ You can see that : The code is totally independent from any library/include Some predefined functions are available (move, fire, etc…) The program has its own game loop, and consequently is not called every frame My question is: How to achieve a similar result using scripted languages in collaboration with a C/C++ main program ? I found a possible approach using Python, multi-threading and shared memory, although I am not sure yet that it is possible this way. TCP/IP seems a bit too complicated for this kind of application.

Read the article

How can I clone or mirror a site without SEO penalties for duplicate content?

- by Amanda

I am a web developer and I want to create clones of the sites I've developed for clients, so that I have an "original copy" on a subdomain of my own website, so that I can showcase my work to new clients. What is the best way to not get my clients original websites penalised for duplicate content? I am planning to have a robots.txt file that disallows all robots, as well as using <link href="http://www.client-canonical-site.com/" rel="canonical" /> in the <head> of the pages. Is that sufficient? Should I use rel=nofollow on all the links as well?

Read the article

best/simplest way to inform search engine of sitemap location

- by Don

AFAIK, there are 2 ways to make search engines aware of a sitemap's location: Include an absolute link to it in robots.txt Submit it to them directly. The relevant URLs are: http://www.google.com/webmasters/tools/ping?sitemap=SITEMAP_URL http://www.bing.com/webmaster/ping.aspx?sitemap=SITEMAP_URL Where SITEMAP_URL is the absolute URL of the sitemap. Currently, I do both. Regarding (2), I have a job that runs automatically every day which submits the sitemap to Bing and Google. I don't think there's any reason to do (1) and (2), but I'm paranoid, so I do. I imagine you can avoid both (1) and (2) if you just make your sitemap accessible at a conventional URL (like robots.txt). What's the simplest and most reliable way to ensure that search engines can find your sitemap?

Read the article

How can I clone or mirror a site without SEO penalties for duplicate content?

- by Amanda

I am a web developer and I want to create clones of the sites I've developed for clients, so that I have an "original copy" on a subdomain of my own website, so that I can showcase my work to new clients. What is the best way to not get my clients original websites penalised for duplicate content? I am planning to have a robots.txt file that disallows all robots, as well as using <link href="http://www.client-canonical-site.com/" rel="canonical" /> in the <head> of the pages. Is that sufficient? Should I use rel=nofollow on all the links as well?

Read the article

Google indexed my home page incorrectly: How can I fix it?

- by louis_coetzee

I finished my website and launched it, I think I had a problem with my robots.txt - so I changed it to look like this: 03/08/2012 # Allows all bots Sitemap: http://www.mysite.co.za/sitemap.xml User-agent: * Disallow: /dashboard/ When I google my domain.co.za - I get this back: Home A description for this result is not available because of this site's robots.txt – learn more. You've visited this page 3 times. Last visit: 2012/08/15 Now since I fixed this and added a 301 redirect to redirect mysite.co.za to www.mysite.co.za I would love it if google bot would come do a visit. Is there anything I can do to get this fixed?

Read the article

Why are new pages not being indexed and old pages stay in the index?

- by ZakGottlieb

I currently have a site that was recently restructured, causing much of its content to be reposted, creating new URL's for each page. To avoid duplicates, all of the existing pages were added to the robots file. That said, it has now been over a week - I know Google has recrawled the site - and when I search for term X, it is stil the old page that is ranking, with the new one nowhere to be seen. I'm assuming it's a cached version, but why are so many of the old pages still appearing in the index? Furthermore, all "tags" pages (it's a Q&A site, like this one) were also added to the robots a few months ago, yet I think they are all still appearing in the index. Anyone got any ideas about why this is happening, and how I can get my new pages indexed?

Read the article

How can I automatically restart Apache and Varnish if can't fetch a file?

- by Tyler

I need to restart Apache and Varnish and email some logs when the script can't fetch robots.txt but I am getting an error ./healthcheck: 43 [[: not found My server is Ubuntu 12.04 64-bit #!/bin/sh # Check if can fetch robots.txt if not then restart Apache and Varnish # Send last few lines of logs with date via email PATH=/bin:/usr/bin THEDIR=/tmp/web-server-health [email protected] mkdir -p $THEDIR if ( wget --timeout=30 -q -P $THEDIR http://website.com/robots.txt ) then # we are up touch ~/.apache-was-up else # down! but if it was down already, don't keep spamming if [[ -f ~/.apache-was-up ]] then # write a nice e-mail echo -n "Web server down at " > $THEDIR/mail date >> $THEDIR/mail echo >> $THEDIR/mail echo "Apache Log:" >> $THEDIR/mail tail -n 30 /var/log/apache2/error.log >> $THEDIR/mail echo >> $THEDIR/mail echo "AUTH Log:" >> $THEDIR/mail tail -n 30 /var/log/auth.log >> $THEDIR/mail echo >> $THEDIR/mail # kick apache echo "Now kicking apache..." >> $THEDIR/mail /etc/init.d/varnish stop >> $THEDIR/mail 2>&1 killall -9 varnishd >> $THEDIR/mail 2>&1 /etc/init.d/varnish start >> $THEDIR/mail 2>&1 /etc/init.d/apache2 stop >> $THEDIR/mail 2>&1 killall -9 apache2 >> $THEDIR/mail 2>&1 /etc/init.d/apache2 start >> $THEDIR/mail 2>&1 # prepare the mail echo >> $THEDIR/mail echo "Good luck troubleshooting!" >> $THEDIR/mail # send the mail sendemail -o message-content-type=html -f [email protected] -t $EMAIL -u ALARM -m < $THEDIR/mail rm ~/.apache-was-up fi fi rm -rf $THEDIR

Read the article

Detecting 'stealth' web-crawlers

- by Jacco

What options are there to detect web-crawlers that do not want to be detected? (I know that listing detection techniques will allow the smart stealth-crawler programmer to make a better spider, but I do not think that we will ever be able to block smart stealth-crawlers anyway, only the ones that make mistakes.) I'm not talking about the nice crawlers such as googlebot and Yahoo! Slurp. I consider a bot nice if it: identifies itself as a bot in the user agent string reads robots.txt (and obeys it) I'm talking about the bad crawlers, hiding behind common user agents, using my bandwidth and never giving me anything in return. There are some trapdoors that can be constructed updated list (thanks Chris, gs): Adding a directory only listed (marked as disallow) in the robots.txt, Adding invisible links (possibly marked as rel="nofollow"?), style="display: none;" on link or parent container placed underneath another element with higher z-index detect who doesn't understand CaPiTaLiSaTioN, detect who tries to post replies but always fail the Captcha. detect GET requests to POST-only resources detect interval between requests detect order of pages requested detect who (consistently) requests https resources over http detect who does not request image file (this in combination with a list of user-agents of known image capable browsers works surprisingly nice) Some traps would be triggered by both 'good' and 'bad' bots. you could combine those with a whitelist: It trigger a trap It request robots.txt? It doest not trigger another trap because it obeyed robots.txt One other important thing here is: Please consider blind people using a screen readers: give people a way to contact you, or solve a (non-image) Captcha to continue browsing. What methods are there to automatically detect the web crawlers trying to mask themselves as normal human visitors. Update The question is not: How do I catch every crawler. The question is: How can I maximize the chance of detecting a crawler. Some spiders are really good, and actually parse and understand html, xhtml, css javascript, VB script etc... I have no illusions: I won't be able to beat them. You would however be surprised how stupid some crawlers are. With the best example of stupidity (in my opinion) being: cast all URLs to lower case before requesting them. And then there is a whole bunch of crawlers that are just 'not good enough' to avoid the various trapdoors.

Read the article

Best practice for setting up wsgi on root directory?

- by Timmy

what's the best ways to mix static files and wsgi app served on the root directory? http://code.google.com/p/modwsgi/wiki/QuickConfigurationGuide recommends setting up WSGIScriptAlias / /usr/local/www/wsgi-scripts/myapp.wsgi and alias other directories and files: Alias /robots.txt /usr/local/www/documents/robots.txt Alias /favicon.ico /usr/local/www/documents/favicon.ico Alias /media/ /usr/local/www/documents/media/ is there a cleaner way to do this?

Read the article

Java loading user-specified classes at runtime

- by user349043

I'm working on robot simulation in Java (a Swing application). I have an abstract class "Robot" from which different types of Robots are derived, e.g. public class StupidRobot extends Robot { int m_stupidness; int m_insanityLevel; ... } public class AngryRobot extends Robot { float m_aggression; ... } As you can see, each Robot subclass has a different set of parameters. What I would like to do is control the simulation setup in the initial UI. Choose the number and type of Robots, give it a name, fill in the parameters etc. This is one of those times where being such a dinosaur programmer, and new to Java, I wonder if there is some higher level stuff/thinking that could help me here. So here is what I've got: (1) User Interface Scrolling list of Robot types on the left. "Add " and "<< Remove" buttons in the middle. Default-named scrolling list of Robots on the right. "Set Parameters" button underneath. (So if you wanted an AngryRobot, you'd select AngryRobot on the left list, click "Add" and "AngryRobot1" would show up on the right.) When selecting a Robot on the right, click "Set Parameters..." button which would call yet another model dialog where you'd fill in the parameters. Different dialog called for each Robot type. (2) Data structures an implementation As an end-product I think a HashMap would be most convenient. The keys would be Robot types and the accompanying object would be all of the parameters. The initializer could just retrieve each item one and a time and instantiate. Here's what the data structures would look like: enum ROBOT_TYPE {STUPID, ANGRY, etc} public class RobotInitializer { public ROBOT_TYPE m_type; public string m_name; public int[] m_int_params; public float[] m_float_params; etc. The initializer's constructor would create the appropriate length parameter arrays based on the type: public RobotInitializer(ROBOT_TYPE type, int[] int_array, float[] float_array, etc){ switch (type){ case STUPID: m_int_params = new int[STUPID_INT_PARAM_LENGTH]; System.arraycopy(int_array,0,m_int_params,0,STUPID_INT_PARAM_LENGTH); etc. Once all the RobotInitializers are instantiated, they are added to the HashMap. Iterating through the HashMap, the simulation initializer takes items from the Hashmap and instantiates the appropriate Robots. Is this reasonable? If not, how can it be improved? Thanks

Read the article

How can I get FOUND_ROW()s from an active record object in rails?

- by Justin Tanner

When querying the database with: @robots = Robot.all(:limit => 50, :offset => 0) What is the best way to get the total number of rows without the :limit? In raw MySQL you could do something like this: SELECT SQL_CALC_FOUND_ROWS * FROM robots LIMIT 0, 50 SELECT FOUND_ROWS();

Read the article

Drupal Modules for SEO & Content

- by Aditi

When we talk about Drupal SEO, there are two things to consider one is about the relevant SEO practices and about appropriate Drupal Modules available. Optimizing your website for search engines is one of the most important aspect of launching & promoting your website especially if ranking matters to you. Understanding SEO For starters, you have begin with Keyword research and then optimize your content according to your findings by tagging, meta tags etc, Drupal modules once installed help you manage a lot of such parameters. Identifying the target keywords Using the Page Title and Token modules PathAuto configuration <H1> heading tags Optimizing Drupal’s default robots.txt file Etc. While Drupal gives you a lot of ability to make your website content worthy & search engine friendly it is important for you to make sure you are not crossing the line or you could get penalized. Modules Overview Drupal Power is at its best when you have these modules & great brain working together. The basic SEO improvements can be achieved easily with the modules enlisted below, but you can win magical rankings if you use them logically & wisely. Understanding your keyword competition & enhancing your content is the basic key to success and ofcourse the modules: Pathauto Automatically create search enging friendly readable URLS from tokens. A token is a piece of data from content, say the author’s username, or the content’s title. For example mysite.com/an-article, rather than mysite.com/node/114 for every node you make. NodeWords Amazingly useful drupal module that allows you to create custom meta tags and descriptions for your nodes, which gives you the ability to target specific keywords and phrases. Page Title Enables you to set an alternative title for the <title></title> tags and for the <h1></h1> tags on a node. Global Redirect Manage content duplication, 301 redirects, and URL validation with this small, but powerful module. Taxonomy manager Make large additions, or changes to taxonomy very easy. This module provides a powerful interface for managing taxonomies. A vocabulary gets displayed in a dynamic tree view, where parent terms can be expanded to list their nested child terms or can be collapsed. robotstxt A robots.txt file is vital for ensuring that search engine spiders don’t index the unwanted areas of your site. This Drupal module gives you the ability to manage your robots.txt file through the CMS admin. xmlsitemap An XML Sitemap lets the search engines index your website content. This module helps in generating and maintaining a complete sitemap for your website and gives you control over exactly which parts of the site you want to be included in the index. It even gives you the ability to automatically submit your sitemap to Google, Yahoo!, Ask.com and Windows Live every time you update a node or at specific interval. Node Import This module allows you to import a set of nodes from a Comma Seperated Values (CSV) or Tab Seperated Values (TSV) text file. Makes it easy to import hundreds-thousands of csv rows and you get to tie up these rows to CCK fields (or locations), and it can file it under the right taxonomy hierarchy. This is Super life saver module.

Read the article

Trouble with piping through sed

- by Joel

I am having trouble piping through sed. Once I have piped output to sed, I cannot pipe the output of sed elsewhere. wget -r -nv http://127.0.0.1:3000/test.html Outputs: 2010-03-12 04:41:48 URL:http://127.0.0.1:3000/test.html [99/99] -> "127.0.0.1:3000/test.html" [1] 2010-03-12 04:41:48 URL:http://127.0.0.1:3000/robots.txt [83/83] -> "127.0.0.1:3000/robots.txt" [1] 2010-03-12 04:41:48 URL:http://127.0.0.1:3000/shop [22818/22818] -> "127.0.0.1:3000/shop.29" [1] I pipe the output through sed to get a clean list of URLs: wget -r -nv http://127.0.0.1:3000/test.html 2>&1 | grep --line-buffered -v ERROR | sed 's/^.*URL:$[^ ]*$.*/\1/g' Outputs: http://127.0.0.1:3000/test.html http://127.0.0.1:3000/robots.txt http://127.0.0.1:3000/shop I would like to then dump the output to file, so I do this: wget -r -nv http://127.0.0.1:3000/test.html 2>&1 | grep --line-buffered -v ERROR | sed 's/^.*URL:$[^ ]*$.*/\1/g' > /tmp/DUMP_FILE I interrupt the process after a few seconds and check the file, yet it is empty. Interesting, the following yields no output (same as above, but piping sed output through cat): wget -r -nv http://127.0.0.1:3000/test.html 2>&1 | grep --line-buffered -v ERROR | sed 's/^.*URL:$[^ ]*$.*/\1/g' | cat Why can I not pipe the output of sed to another program like cat?

Read the article

cannot get FilesMatch and mod_rewrite working together

- by neil

Hello, I'm having a little difficulty with my .htaccess when combining a password protection on a file and a mod_rewrite: If I only have this as my .htaccess <FilesMatch "manage.php"> AuthName "Member Only" AuthType Basic AuthUserFile /home/myuser/.htpasswd require valid-user </FilesMatch> It password protects manage.php fine. If I have this as my .htaccess RewriteEngine on RewriteCond $1 !^(index\.php|manage\.php|robots\.txt) RewriteCond %{REQUEST_FILENAME} !-f RewriteCond %{REQUEST_FILENAME} !-d RewriteRule ^(.*)$ index.php/$1 [L,QSA] It works as expected: robots.txt, index.php and manage.php are left alone and are called as normal, everything else is redirected to index.php/$1 HOWEVER, If I combine them then when I visit manage.php it gets redirected as index.php/$1, i.e. the FilesMatch is causing the entry in rewritecond to be ignored. If I actually enter a password in the original test and get to the file, it works in this last test. So I guess something is up with the rewrite and the file match. i.e. entering this causes .\manage.php to be called as .\index.php\manage.php <FilesMatch "manage.php"> AuthName "Member Only" AuthType Basic AuthUserFile /home/myuser/.htpasswd require valid-user </FilesMatch> RewriteEngine on RewriteCond $1 !^(index\.php|manage\.php|robots\.txt) RewriteCond %{REQUEST_FILENAME} !-f RewriteCond %{REQUEST_FILENAME} !-d RewriteRule ^(.*)$ index.php/$1 [L,QSA] Thanks for any help :)

Read the article

recursive wget with hotlinked requisites

- by dongle

I often use wget to mirror very large websites. Sites that contain hotlinked content (be it images, video, css, js) pose a problem, as I seem unable to specify that I would like wget to grab page requisites that are on other hosts, without having the crawl also follow hyperlinks to other hosts. For example, let's look at this page https://dl.dropbox.com/u/11471672/wget-all-the-things.html Let's pretend that this is a large site that I would like to completely mirror, including all page requisites – including those that are hotlinked. wget -e robots=off -r -l inf -pk ^^ gets everything but the hotlinked image wget -e robots=off -r -l inf -pk -H ^^ gets everything, including hotlinked image, but goes wildly out of control, proceeding to download the entire web wget -e robots=off -r -l inf -pk -H --ignore-tags=a ^^ gets the first page, including both hotlinked and local image, does not follow the hyperlink to the site outside of scope, but obviously also does not follow the hyperlink to the next page of the site. I know that there are various other tools and methods of accomplishing this (HTTrack and Heritrix allow for the user to make a distinction between hotlinked content on other hosts vs hyperlinks to other hosts) but I'd like to see if this is possible with wget. Ideally this would not be done in post-processing, as I would like the external content, requests, and headers to be included in the WARC file I'm outputting.

Search Results

Search found 499 results on 20 pages for 'robots'.

Page 9/20 | < Previous Page | 5 6 7 8 9 10 11 12 13 14 15 16 | Next Page >

- by Robert Munteanu

- by Bob

- by konrad

- by Great Turtle

- by sathya

- by James Skemp

- by Jason Fitzpatrick

- by Lynda

- by Jim

- by Jim

- by Amanda

- by Don

- by Amanda

- by louis_coetzee

- by ZakGottlieb

- by Tyler

- by Jacco

- by Timmy

- by user349043

- by Justin Tanner

- by Aditi

- by Joel

- by neil

- by dongle

< Previous Page | 5 6 7 8 9 10 11 12 13 14 15 16 | Next Page >