Search Results

Search found 178 results on 8 pages for 'hobo spider'.

Page 7/8 | < Previous Page | 3 4 5 6 7 8 | Next Page >

Unknown Apache2 + PHP5 FastCGI 500 error .. caused by search engine bots?

- by rdjurovich

My Ubuntu server is configured with Apache 2.2.8 and PHP 5.2.4-2ubuntu5.18 in FastCGI mode. Everything works well, except I am seeing 500 errors that only seem to come from bots accessing the server.. for example (access.log): x.125.71.104 - - [16/Nov/2011:10:27:39 +1100] "GET / HTTP/1.1" 500 41377 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" x.40.103.239 - - [16/Nov/2011:11:05:56 +1100] "GET / HTTP/1.0" 500 14717 "-" "Mozilla/5.0 (compatible; mon.itor.us - free monitoring service; http://mon.itor.us)" x.249.67.114 - - [14/Nov/2011:20:57:17 +1100] "GET / HTTP/1.1" 500 101 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" x.55.39.85 - - [14/Nov/2011:19:31:06 +1100] "GET / HTTP/1.1" 500 7032 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)._" It is my understanding that a 500 error will be thrown when the PHP process fails to respond to Apache, which could be caused by a fatal PHP error or if PHP runs out of processes.. so my assumption is that either the bots are hitting the server too hard, killing the PHP processes, or something in the request header from bots is causing a fatal error in my PHP script? If anyone can offer advice on this it would be greatly appreciated! Ryan

Read the article
Software to automate website screenshot capture

- by Leniel Macaferi

Do you know any software that can automate the process of getting screenshots of every page of a website? It would act like a spider/crawler/robot. You name it... For example: I developed a website and now I'd like to get a screenshot of every page of the site. I of course could do it manually (a lot of work). For each module of the site (Student, Payment, etc) I have different pages (Create, Edit, Details, Delete, etc) forms. The thing I'm looking for is a software that can visit every link of the site and then capture the screen - a software that can automate the whole process. It would also be good if the software allowed the user to pass a list of URLs to capture screenshots allowing even more fine grained configuration. EDIT: I tried Selenium mentioned by Aaron in his answer but I managed to find an app that does exactly what I needed. It's called Paparazzi!. I wrote a blog post to showcase my attempt at Selenium and the findings regarding Paparazzi!'s batch capture functionality: Software to automate website screenshot capture

Read the article
illegitimate traffic from user agent Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10 (.NET CLR 3.5.30729)

- by user114293

Since the beginning of the year, I'm getting a lot of traffic with the user agent Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10 (.NET CLR 3.5.30729). My access logs show 40% - 60% from that user agent. That's strange because the user agent states a Firefox 3.0.10 browser (is anybody using that browser in 2012? Definitely not 40%-60% of visitors on a normal website). Also, the logs show that this user agent only requested the HTML document and no referenced assets like images, css, js files. I checked the IPs of those requests (with that UA). It's coming from all over the world. I recognized that those IPs sometimes have a mobile user agent. So my suspicion is a mobile app that is doing a lot of "spider requests" - but if that would be the case than other web sites should have the same problem. That's actually my question: Does anybody experience same/similar problems?

Read the article
Strange Domain name under the same IP Address

- by Mike Chip

There's something really weird happening in my server. But first things first: I wanted to have my website and chose the domain name "myowndomain.com", Now on my domain registrar I point "myowndomain.com" to the address of my recently setup VPS, let's say 50.50.50.50 So I installed everything I needed to run my website, and I started to notice strange queries coming from different IP Addresses. Like these [client 123.123.123.123] File does not exist: /var/www/html/api, referer: http://www.strangedomain.com/api/manyou/my.php [client 456.456.456.456] File does not exist: /var/www/html/api, referer: http://www.strangedomain.com/api/manyou/my.php or like this (Really a long line, I cut some things) GET /?s=vod-show-id-22-area-%E5%85%B6%E4%BB%96-language-%E9%9F%A9%E8%AF%AD.html HTTP/1.1" 301 295 "http://v.strangedomain.com/?s=vod-s ...[cut]... spider" That above is happening the most. The 'strangedomain.com' returns the same IP address of my VPS which my website is hosted on. The whois of such domain shows it's registered to a chinese. But the street name didn't look so right (like a huge single word), so I think all of that info might be fake, but still might be a chinese. I also noticed that all 'clients' trying to access the 'strangedomain.com' is coming from china. If I type in the browser 'strangedomain.com', I see my website. I'm worried, because my website is actually an e-commerce. I don't know if 'strangedomain.com' WAS a website on 50.50.50.50 in the not so far past, or if it's something else.

Read the article
How to avoid duplicates when copying files that have been renamed at the destination

- by Benoitt

I have to get pictures from a folder – with subfolders which are updated automatically – with their extensions. These files have to be copied in a folder where a website based on PHP will edit them (by renaming and creating an XML file) to be downloadable and integrated in an XML feed. Because of the rename function of the script, when I perform the copy gain, all the files are duplicated, because the script has renamed the original ones already. I've tried a few things with rsync but I'm looking for something more powerful because I can't copy files with an external "history". #!/bin/bash find '/home/name/picture' -name '*.jpg' | while read FILE ; do rsync --backup --backup-dir=incremental --suffix=.old "$FILE" /var/www/media ; done wget --spider 'http://myscript.php' ; #exit 0 PS: As a little addition, I'd like to replace '.' with a 'space' just after the *.jpeg copy. My PHP script has some problem to define files with comma because of the extension. I'm finking about a command with find – like I did before – with a sed function? Is that a good idea?

Read the article
How do I get an overview and a methodology for programming in Python

- by Peter Nielsen

I've started to learn Python and programming from scratch. I have not programmed before so it's a new experience. I do seem to grasp most of the concepts, from variables to definitions and modules. I still need to learn a lot more about what the different libraries and modules do and also I lack knowledge on OOP and classes in Python. I see people who just program in Python like that's all they have ever done and I am still just coming to grips with it. Is there a way, some tools, a logical methodology that would give me an overview or a good hold of how to handle programming problems ? For instance, I'm trying to create a parser which we need at the office . I also need to create a spider that would collect links from various websites. Is there a formidable way of studying the various modules to see what is needed ? Or is it just nose to the grind stone and understand what the documentation says ? Sorry for the lengthy question..

Read the article
Web application architecture, and application servers?

- by seanieb

Hi, I'm building a web application, and I need to use an architecture that allows me to run it on two servers. The application scrapes information from other sites periodically, and on input from the end user. To do this I'm using Php+curl to scrape the information, Php or python to parse it and store the results in a MySQLDB. Then I will use Python to run some algorithms on the data, this will happen both periodically and on input from the end user. I'm going to cache some of the results in the MySQL DB and sometimes if it is specific to the user, skip storing the data and serve it to the user. I'm think of using Php for the website front end on a separate web server, running the Php spider, MySQL DB and python on another server. As you can see I'm fairly clueless. I'm familiar with using Php, MySQL and the basics of Python, but bringing this all together using something more complex than a cron job is new to me. How do go about implementing this? What frame work(s) should I use? Is MVC a good architecture for this? (I'm new to MVC, architectures etc.) Is Cakephp a good solution? If so will I be able to control and monitor the Python code using it?

Read the article
Hide a single content block from search engines?

- by jonas

A header is automatically added on top of each content URL, but its not relevant for search and messing up the all the results beeing the first line of every page (in the code its the last line but visually its the first, which google is able to notice) Solution1: You could put the header (content to exculde from google searches) in an iframe with a static url domain.com/header.html and a <meta name="robots" content="noindex" /> ? - are there takeoffs of this solution? Solution2: You could deliver it conditionally by apache mod rewrite, php or javascript -takeoff(?): google does not like it? will google ever try pages with a standard users's useragent and compare? -takeoff: The hidden content will be missing in the google cache version as well... example: add-header.php: <?php $path = $_GET['path']; echo file_get_contents($_SERVER["DOCUMENT_ROOT"].$path); ?> apache virtual host config: RewriteCond %{HTTP_USER_AGENT} !.*spider.* [NC] RewriteCond %{HTTP_USER_AGENT} !Yahoo.* [NC] RewriteCond %{HTTP_USER_AGENT} !Bing.* [NC] RewriteCond %{HTTP_USER_AGENT} !Yandex.* [NC] RewriteCond %{HTTP_USER_AGENT} !Baidu.* [NC] RewriteCond %{HTTP_USER_AGENT} !.*bot.* [NC] RewriteCond %{SCRIPT_FILENAME} \.htm$ [NC,OR] RewriteCond %{SCRIPT_FILENAME} \.html$ [NC,OR] RewriteCond %{SCRIPT_FILENAME} \.php$ [NC] RewriteRule ^(.*)$ /var/www/add-header.php?path=%1 [L]

Read the article
Accessing bitmap array in another class? C#

- by Marius Mathisen

I have this array : Bitmap[] bildeListe = new Bitmap[21]; bildeListe[0] = Properties.Resources.ål; bildeListe[1] = Properties.Resources.ant; bildeListe[2] = Properties.Resources.bird; bildeListe[3] = Properties.Resources.bear; bildeListe[4] = Properties.Resources.butterfly; bildeListe[5] = Properties.Resources.cat; bildeListe[6] = Properties.Resources.chicken; bildeListe[7] = Properties.Resources.dog; bildeListe[8] = Properties.Resources.elephant; bildeListe[9] = Properties.Resources.fish; bildeListe[10] = Properties.Resources.goat; bildeListe[11] = Properties.Resources.horse; bildeListe[12] = Properties.Resources.ladybug; bildeListe[13] = Properties.Resources.lion; bildeListe[14] = Properties.Resources.moose; bildeListe[15] = Properties.Resources.polarbear; bildeListe[16] = Properties.Resources.reke; bildeListe[17] = Properties.Resources.sheep; bildeListe[18] = Properties.Resources.snake; bildeListe[19] = Properties.Resources.spider; bildeListe[20] = Properties.Resources.turtle; I want that array and it´s content in a diffenrent class, and access it from my main form. I don´t know if should use method, function or what to use with arrays. Are there some good way for me to access for instanse bildeListe[0] in my new class?

Read the article
Proper API Design for Version Independence?

- by Justavian

I've inherited an enormous .NET solution of about 200 projects. There are now some developers who wish to start adding their own components into our application, which will require that we begin exposing functionality via an API. The major problem with that, of course, is that the solution we've got on our hands contains such a spider web of dependencies that we have to be careful to avoid sabotaging the API every time there's a minor change somewhere in the app. We'd also like to be able to incrementally expose new functionality without destroying any previous third party apps. I have a way to solve this problem, but i'm not sure it's the ideal way - i was looking for other ideas. My plan would be to essentially have three dlls. APIServer_1_0.dll - this would be the dll with all of the dependencies. APIClient_1_0.dll - this would be the dll our developers would actual refer to. No references to any of the mess in our solution. APISupport_1_0.dll - this would contain the interfaces which would allow the client piece to dynamically load the "server" component and perform whatever functions are required. Both of the above dlls would depend upon this. It would be the only dll that the "client" piece refers to. I initially arrived at this design, because the way in which we do inter process communication between windows services is sort of similar (except that the client talks to the server via named pipes, rather than dynamically loading dlls). While i'm fairly certain i can make this work, i'm curious to know if there are better ways to accomplish the same task.

Read the article
Memory leak in PHP with Symfony + Doctrine

- by Nicklas Ansman

I'm using PHP with the Symfony framework (with Doctrine as my ORM) to build a spider that crawls some sites. My problem is that the following code generates a memory leak: $q = $this -> createQuery('Product p'); if($store) { $q -> andWhere('p.store_id = ?', $store -> getId()) -> limit(1); } $q -> andWhere('p.name = ?', $name); $data = $q -> execute(); $q -> free(true); $data -> free(true); return NULL; This code is placed in a subclass of Doctrine_Table. If I comment out the execute part (and of course the $data - free(true)) the leak stops. This has led me to the conclusion that it's the Doctrine_Collection that's causing the leak. Any ideas on what I can do to fix this? I'd be happy to provide more info on the problem if you need it. Regards Nicklas

Read the article
SEO Problem for new dictionary site, google hasn't indexed content.

- by John

I loaded about 15,000 pages, letters A & B of a dictionary and submitted to google a text site map. I'm using google's search with advertisement as the planned mechanism to go through my site. Google's webmaster accepted the site mapps as good but then did not index. My index page has been indexed by google and at this point have not linked to any pages. So to get google's search to work I need to get all my content indexed. It appears google will not just index from the site map and so I was thinking of adding pages that spider in links from the main index page. But I don't want to create a bunch of pages that programicly link all of the pages without knowing if this has a chance to work. Eventually I plan on having about 150,000 pages each page being a word or phrase being defined. I wrote a program that is pulling this from a dictionary database. I would like to prove the content that I have to anyone interested to show the value of the dictionary in releation to the dictionary software that I'm completing. Suggestions for getting the entire site indexed by google so I can appear in the search results? Thanks

Read the article
PHP and storing stats

- by John

Using PHP5 and the latest version of MySQL I want to be able to track impressions and clicks for business listings. My question is if I did this myself what would be the best method in storing it so I can run reports? Before I just had a table that had the listing id, user ip address and if it was a click or impression as well as the date it was tracked. However the database itself is approaching 2GB of data and its very slow, part of the problem is its a pretty simple script that includes impressions and clicks from anyone including search engines and basically anyone or anything that accesses the listing page. Is there an api or file out there that has an update to date list that can detect if the person viewing is a actually person and not a spider so I dont fill up the database with unneeded stats? Just looking for suggestions, do I just have a raw database that gets just the hits then a cron job at night tally up for the day for each listing for each ip and store the cumulative stats in a different table? Also what type of database should it be? Innodb? MyISAM?

Read the article
.htaccess template, suggestions needed

- by purpler

DefaultLanguage en-US FileETag None Header unset ETag ServerSignature Off SetEnv TZ Europe/Belgrade # Rewrites Options +FollowSymLinks RewriteEngine On RewriteBase / # Redirect to WWW RewriteCond %{HTTP_HOST} ^serpentineseo.com RewriteRule (.*) http://www.serpentineseo.com/$1 [R=301,L] # Redirect index to root RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /.*index\.html\ HTTP/ RewriteRule ^(.*)index\.html$ /$1 [R=301,L] # Cache media files: ExpiresActive On ExpiresDefault A0 # Month <filesMatch "\.(gif|jpg|jpeg|png|ico|swf|js)$"> Header set Cache-Control "max-age=2592000, public" </filesMatch> # Week <FilesMatch "\.(css|pdf)$"> Header set Cache-Control "max-age=604800" </FilesMatch> # 10 Min <FilesMatch "\.(html|htm|txt)$"> Header set Cache-Control "max-age=600" </FilesMatch> # Do not cache <FilesMatch "\.(pl|php|cgi|spl|scgi|fcgi)$"> Header unset Cache-Control </FilesMatch> # Compress output <IfModule mod_deflate.c> <FilesMatch "\.(html|js|css)$"> SetOutputFilter DEFLATE </FilesMatch> </IfModule> # Error Documents ErrorDocument 206 /error/206.html ErrorDocument 401 /error/401.html ErrorDocument 403 /error/403.html ErrorDocument 404 /error/404.html ErrorDocument 500 /error/500.html # Prevent hotlinking RewriteCond %{HTTP_REFERER} !^$ RewriteCond %{HTTP_REFERER} !^http://(www\.)?serpentineseo.com/.*$ [NC] RewriteRule \.(gif|jpg|png)$ http://www.serpentineseo.com/images/angryman.png [R,L] # Prevent offline browsers RewriteCond %{HTTP_USER_AGENT} ^BlackWidow [OR] RewriteCond %{HTTP_USER_AGENT} ^Bot\ mailto:[email protected] [OR] RewriteCond %{HTTP_USER_AGENT} ^ChinaClaw [OR] RewriteCond %{HTTP_USER_AGENT} ^Custo [OR] RewriteCond %{HTTP_USER_AGENT} ^DISCo [OR] RewriteCond %{HTTP_USER_AGENT} ^Download\ Demon [OR] RewriteCond %{HTTP_USER_AGENT} ^eCatch [OR] RewriteCond %{HTTP_USER_AGENT} ^EirGrabber [OR] RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR] RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR] RewriteCond %{HTTP_USER_AGENT} ^Express\ WebPictures [OR] RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR] RewriteCond %{HTTP_USER_AGENT} ^EyeNetIE [OR] RewriteCond %{HTTP_USER_AGENT} ^FlashGet [OR] RewriteCond %{HTTP_USER_AGENT} ^GetRight [OR] RewriteCond %{HTTP_USER_AGENT} ^GetWeb! [OR] RewriteCond %{HTTP_USER_AGENT} ^Go!Zilla [OR] RewriteCond %{HTTP_USER_AGENT} ^Go-Ahead-Got-It [OR] RewriteCond %{HTTP_USER_AGENT} ^GrabNet [OR] RewriteCond %{HTTP_USER_AGENT} ^Grafula [OR] RewriteCond %{HTTP_USER_AGENT} ^HMView [OR] RewriteCond %{HTTP_USER_AGENT} HTTrack [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^Image\ Stripper [OR] RewriteCond %{HTTP_USER_AGENT} ^Image\ Sucker [OR] RewriteCond %{HTTP_USER_AGENT} Indy\ Library [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^InterGET [OR] RewriteCond %{HTTP_USER_AGENT} ^Internet\ Ninja [OR] RewriteCond %{HTTP_USER_AGENT} ^JetCar [OR] RewriteCond %{HTTP_USER_AGENT} ^JOC\ Web\ Spider [OR] RewriteCond %{HTTP_USER_AGENT} ^larbin [OR] RewriteCond %{HTTP_USER_AGENT} ^LeechFTP [OR] RewriteCond %{HTTP_USER_AGENT} ^Mass\ Downloader [OR] RewriteCond %{HTTP_USER_AGENT} ^MIDown\ tool [OR] RewriteCond %{HTTP_USER_AGENT} ^Mister\ PiX [OR] RewriteCond %{HTTP_USER_AGENT} ^Navroad [OR] RewriteCond %{HTTP_USER_AGENT} ^NearSite [OR] RewriteCond %{HTTP_USER_AGENT} ^NetAnts [OR] RewriteCond %{HTTP_USER_AGENT} ^NetSpider [OR] RewriteCond %{HTTP_USER_AGENT} ^Net\ Vampire [OR] RewriteCond %{HTTP_USER_AGENT} ^NetZIP [OR] RewriteCond %{HTTP_USER_AGENT} ^Octopus [OR] RewriteCond %{HTTP_USER_AGENT} ^Offline\ Explorer [OR] RewriteCond %{HTTP_USER_AGENT} ^Offline\ Navigator [OR] RewriteCond %{HTTP_USER_AGENT} ^PageGrabber [OR] RewriteCond %{HTTP_USER_AGENT} ^Papa\ Foto [OR] RewriteCond %{HTTP_USER_AGENT} ^pavuk [OR] RewriteCond %{HTTP_USER_AGENT} ^pcBrowser [OR] RewriteCond %{HTTP_USER_AGENT} ^RealDownload [OR] RewriteCond %{HTTP_USER_AGENT} ^ReGet [OR] RewriteCond %{HTTP_USER_AGENT} ^SiteSnagger [OR] RewriteCond %{HTTP_USER_AGENT} ^SmartDownload [OR] RewriteCond %{HTTP_USER_AGENT} ^SuperBot [OR] RewriteCond %{HTTP_USER_AGENT} ^SuperHTTP [OR] RewriteCond %{HTTP_USER_AGENT} ^Surfbot [OR] RewriteCond %{HTTP_USER_AGENT} ^tAkeOut [OR] RewriteCond %{HTTP_USER_AGENT} ^Teleport\ Pro [OR] RewriteCond %{HTTP_USER_AGENT} ^VoidEYE [OR] RewriteCond %{HTTP_USER_AGENT} ^Web\ Image\ Collector [OR] RewriteCond %{HTTP_USER_AGENT} ^Web\ Sucker [OR] RewriteCond %{HTTP_USER_AGENT} ^WebAuto [OR] RewriteCond %{HTTP_USER_AGENT} ^WebCopier [OR] RewriteCond %{HTTP_USER_AGENT} ^WebFetch [OR] RewriteCond %{HTTP_USER_AGENT} ^WebGo\ IS [OR] RewriteCond %{HTTP_USER_AGENT} ^WebLeacher [OR] RewriteCond %{HTTP_USER_AGENT} ^WebReaper [OR] RewriteCond %{HTTP_USER_AGENT} ^WebSauger [OR] RewriteCond %{HTTP_USER_AGENT} ^Website\ eXtractor [OR] RewriteCond %{HTTP_USER_AGENT} ^Website\ Quester [OR] RewriteCond %{HTTP_USER_AGENT} ^WebStripper [OR] RewriteCond %{HTTP_USER_AGENT} ^WebWhacker [OR] RewriteCond %{HTTP_USER_AGENT} ^WebZIP [OR] RewriteCond %{HTTP_USER_AGENT} ^Wget [OR] RewriteCond %{HTTP_USER_AGENT} ^Widow [OR] RewriteCond %{HTTP_USER_AGENT} ^WWWOFFLE [OR] RewriteCond %{HTTP_USER_AGENT} ^Xaldon\ WebSpider [OR] RewriteCond %{HTTP_USER_AGENT} ^Zeus RewriteRule ^.*$ http://www.google.com [R,L] # Protect against DOS attacks by limiting file upload size LimitRequestBody 10240000 # Deny access to sensitive files <FilesMatch "\.(htaccess|psd|log)$"> Order Allow,Deny Deny from all </FilesMatch>

Read the article
.htaccess template, suggestions needed

- by purpler

# Defaults AddDefaultCharset UTF-8 DefaultLanguage en-US FileETag None Header unset ETag ServerSignature Off SetEnv TZ Europe/Belgrade # Rewrites Options +FollowSymLinks RewriteEngine On RewriteBase / # Redirect to WWW RewriteCond %{HTTP_HOST} ^serpentineseo.com RewriteRule (.*) http://www.serpentineseo.com/$1 [R=301,L] # Redirect index to root RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /.*index\.html\ HTTP/ RewriteRule ^(.*)index\.html$ /$1 [R=301,L] # Cache media files: ExpiresActive On ExpiresDefault A0 # Month <filesMatch "\.(gif|jpg|jpeg|png|ico|swf|js)$"> Header set Cache-Control "max-age=2592000, public" </filesMatch> # Week <FilesMatch "\.(css|pdf)$"> Header set Cache-Control "max-age=604800" </FilesMatch> # 10 Min <FilesMatch "\.(html|htm|txt)$"> Header set Cache-Control "max-age=600" </FilesMatch> # Do not cache <FilesMatch "\.(pl|php|cgi|spl|scgi|fcgi)$"> Header unset Cache-Control </FilesMatch> # Compress output <IfModule mod_deflate.c> <FilesMatch "\.(html|js|css)$"> SetOutputFilter DEFLATE </FilesMatch> </IfModule> # Error Documents ErrorDocument 206 /error/206.html ErrorDocument 401 /error/401.html ErrorDocument 403 /error/403.html ErrorDocument 404 /error/404.html ErrorDocument 500 /error/500.html # Prevent hotlinking RewriteCond %{HTTP_REFERER} !^$ RewriteCond %{HTTP_REFERER} !^http://(www\.)?serpentineseo.com/.*$ [NC] RewriteRule \.(gif|jpg|png)$ http://www.serpentineseo.com/images/angryman.png [R,L] # Prevent offline browsers RewriteCond %{HTTP_USER_AGENT} ^BlackWidow [OR] RewriteCond %{HTTP_USER_AGENT} ^Bot\ mailto:[email protected] [OR] RewriteCond %{HTTP_USER_AGENT} ^ChinaClaw [OR] RewriteCond %{HTTP_USER_AGENT} ^Custo [OR] RewriteCond %{HTTP_USER_AGENT} ^DISCo [OR] RewriteCond %{HTTP_USER_AGENT} ^Download\ Demon [OR] RewriteCond %{HTTP_USER_AGENT} ^eCatch [OR] RewriteCond %{HTTP_USER_AGENT} ^EirGrabber [OR] RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR] RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR] RewriteCond %{HTTP_USER_AGENT} ^Express\ WebPictures [OR] RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR] RewriteCond %{HTTP_USER_AGENT} ^EyeNetIE [OR] RewriteCond %{HTTP_USER_AGENT} ^FlashGet [OR] RewriteCond %{HTTP_USER_AGENT} ^GetRight [OR] RewriteCond %{HTTP_USER_AGENT} ^GetWeb! [OR] RewriteCond %{HTTP_USER_AGENT} ^Go!Zilla [OR] RewriteCond %{HTTP_USER_AGENT} ^Go-Ahead-Got-It [OR] RewriteCond %{HTTP_USER_AGENT} ^GrabNet [OR] RewriteCond %{HTTP_USER_AGENT} ^Grafula [OR] RewriteCond %{HTTP_USER_AGENT} ^HMView [OR] RewriteCond %{HTTP_USER_AGENT} HTTrack [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^Image\ Stripper [OR] RewriteCond %{HTTP_USER_AGENT} ^Image\ Sucker [OR] RewriteCond %{HTTP_USER_AGENT} Indy\ Library [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^InterGET [OR] RewriteCond %{HTTP_USER_AGENT} ^Internet\ Ninja [OR] RewriteCond %{HTTP_USER_AGENT} ^JetCar [OR] RewriteCond %{HTTP_USER_AGENT} ^JOC\ Web\ Spider [OR] RewriteCond %{HTTP_USER_AGENT} ^larbin [OR] RewriteCond %{HTTP_USER_AGENT} ^LeechFTP [OR] RewriteCond %{HTTP_USER_AGENT} ^Mass\ Downloader [OR] RewriteCond %{HTTP_USER_AGENT} ^MIDown\ tool [OR] RewriteCond %{HTTP_USER_AGENT} ^Mister\ PiX [OR] RewriteCond %{HTTP_USER_AGENT} ^Navroad [OR] RewriteCond %{HTTP_USER_AGENT} ^NearSite [OR] RewriteCond %{HTTP_USER_AGENT} ^NetAnts [OR] RewriteCond %{HTTP_USER_AGENT} ^NetSpider [OR] RewriteCond %{HTTP_USER_AGENT} ^Net\ Vampire [OR] RewriteCond %{HTTP_USER_AGENT} ^NetZIP [OR] RewriteCond %{HTTP_USER_AGENT} ^Octopus [OR] RewriteCond %{HTTP_USER_AGENT} ^Offline\ Explorer [OR] RewriteCond %{HTTP_USER_AGENT} ^Offline\ Navigator [OR] RewriteCond %{HTTP_USER_AGENT} ^PageGrabber [OR] RewriteCond %{HTTP_USER_AGENT} ^Papa\ Foto [OR] RewriteCond %{HTTP_USER_AGENT} ^pavuk [OR] RewriteCond %{HTTP_USER_AGENT} ^pcBrowser [OR] RewriteCond %{HTTP_USER_AGENT} ^RealDownload [OR] RewriteCond %{HTTP_USER_AGENT} ^ReGet [OR] RewriteCond %{HTTP_USER_AGENT} ^SiteSnagger [OR] RewriteCond %{HTTP_USER_AGENT} ^SmartDownload [OR] RewriteCond %{HTTP_USER_AGENT} ^SuperBot [OR] RewriteCond %{HTTP_USER_AGENT} ^SuperHTTP [OR] RewriteCond %{HTTP_USER_AGENT} ^Surfbot [OR] RewriteCond %{HTTP_USER_AGENT} ^tAkeOut [OR] RewriteCond %{HTTP_USER_AGENT} ^Teleport\ Pro [OR] RewriteCond %{HTTP_USER_AGENT} ^VoidEYE [OR] RewriteCond %{HTTP_USER_AGENT} ^Web\ Image\ Collector [OR] RewriteCond %{HTTP_USER_AGENT} ^Web\ Sucker [OR] RewriteCond %{HTTP_USER_AGENT} ^WebAuto [OR] RewriteCond %{HTTP_USER_AGENT} ^WebCopier [OR] RewriteCond %{HTTP_USER_AGENT} ^WebFetch [OR] RewriteCond %{HTTP_USER_AGENT} ^WebGo\ IS [OR] RewriteCond %{HTTP_USER_AGENT} ^WebLeacher [OR] RewriteCond %{HTTP_USER_AGENT} ^WebReaper [OR] RewriteCond %{HTTP_USER_AGENT} ^WebSauger [OR] RewriteCond %{HTTP_USER_AGENT} ^Website\ eXtractor [OR] RewriteCond %{HTTP_USER_AGENT} ^Website\ Quester [OR] RewriteCond %{HTTP_USER_AGENT} ^WebStripper [OR] RewriteCond %{HTTP_USER_AGENT} ^WebWhacker [OR] RewriteCond %{HTTP_USER_AGENT} ^WebZIP [OR] RewriteCond %{HTTP_USER_AGENT} ^Wget [OR] RewriteCond %{HTTP_USER_AGENT} ^Widow [OR] RewriteCond %{HTTP_USER_AGENT} ^WWWOFFLE [OR] RewriteCond %{HTTP_USER_AGENT} ^Xaldon\ WebSpider [OR] RewriteCond %{HTTP_USER_AGENT} ^Zeus RewriteRule ^.*$ http://www.google.com [R,L] # Protect against DOS attacks by limiting file upload size LimitRequestBody 10240000 # Deny access to sensitive files <FilesMatch "\.(htaccess|psd|log)$"> Order Allow,Deny Deny from all </FilesMatch>

Read the article
Repeated requests on our server?

- by pitty.platsch

I encountered something strange in the access log of our Apache server which I cannot explain. Requests for webpages that I or my colleagues do from the office's Windows network get repeated by another IP (that we don't know) a couple of seconds later. The user agent repeating our requests is Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 2.0.50727; .NET CLR 3.0.04506.648; .NET CLR 3.5.21022; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; InfoPath.2) Has anyone an idea? Update: I've got some more information now. The referrer of the replicate is set to the URL I requested before and it's not the exact same request as the protocol version is changed from 'HTTP/1.1' to 'HTTP/1.0'. The IP is not just one, it's just one of a subnet (80.40.134.*). It's just the first request to a resource that's get repeated, so it seems the "spy" is building up some kind of cache of visited places. The repeater is also picky. I tried randomly URLs with different HTTP status codes and different file patterns. 301s and 200s are redone, 404s not. Image extensions seem to be ignored. While doing my tests I discovered that this behavior seems to be common as I found other clients visiting just after the first requests: 66.249.73.184 - - [25/Oct/2012:10:51:33 +0100] "GET /foobar/ HTTP/1.1" 200 10952 "-" "Mediapartners-Google" 50.17.125.180 - - [25/Oct/2012:10:51:33 +0100] "GET /foobar/ HTTP/1.1" 200 41312 "-" "Mozilla/5.0 (compatible; proximic; +http://www.proximic.com/info/spider.php)" I wasn't aware about this practice, so I don't see it that much as a threat anymore. I still want to find out who this is, so any further help is appreciated. I'll try later if this also happens if I query some other server where I have access to the access logs and will update here then.

Read the article
WordPress SEO Plugins to make your Blog Search Engine Friendly

- by Vaibhav

WordPress is the most common blogging system in use today and its use as a CMS is also wide spread. With hundreds of millions of sites using wordpress, getting correct SEO for your WordPress based Blog or Site is very important. We get regular queries from people who want Search Engine Optimisation for their site or blog which is made using wordpress. Here is a list of 16 of the best WordPress Plug-ins That can help you achieve better rankings: All in one SEO Pack This is most popular plugin among all SEO plugins for WordPress. It is easy to use and is compatible with most of the WordPress plugins. It works as a complete package of SEO plugin – automatically generating META tags and optimizing search engines for your titles and avoiding duplicate content. You can also include META tags manually (Met title, Meta description and Met keywords) for all pages and post in your website. HeadSpace2 HeasSpace2 is available in different languages , you can manage a wide range of SEO Tasks related with meta data, you can tag your posts, Custom descriptions and titles. So your page can rank the created relevancy on Search engines and you can load different settings for different pages. Platinum SEO plugin Automatic 301 redirects permalink changes, META tags generation, avoids duplicate content, and does SEO optimization of post and page titles and a lots of other features. TGFI.net SEO WordPress Plugin It’s a modified version of all-in-one SEO Pack. It has some unique feature over All-in-one SEO plugin, It generate titles, meta descriptions and meta keywords automatically when overrides are not present. Google XML Sitemaps Sitemaps Generated by this tool are supported by Google, Yahoo, Bing, and Ask. We all know Sitemaps make indexing of web pages easier for web crawlers. Crawlers can retrieve complete structure of site and more information by sitemaps. They notify all major search engines about new posts every time you create a new post. Sitemap Generator You can generate highly customizable sitemap for your WordPress page. You can choose what to show and what not to show, you can list the items in your choice of orde. It supports pages and permalinks and multi-level categories. SEO Slugs They can generate more search engine friendly URLs for your site. Slugs are filename assigned to your post , this plugin removes all common words like ‘a’, ‘the’, ‘in’, ‘what’, ‘you’ from slug which are assigned automatically to your post. SEO Post Links This is a similar plugin to SEO Slug, it removes unnecessary keywords from slug to make it short and SEO friendly and you can fix the number of characters in your post. Automatic SEO links With this tool you can create auto linking in your post. You can use this tool for inter linking or external linking too. Just select your words, anchor text target URL nature of links ( Do fallow / No follow ). This plugin will replace the matches found in post, WP Backlinks A helpful plugin for link exchange , whenever any webmaster submits a link for link exchange, the plugin will spider webmasters site for reciprocal link, and if everything is found good , your link will be exchanged. SEO Title Tag You can optimize your Title tags of Word press blog through this plugin . You can also override the title tag with custom titles , mass editing and title tags for 404 pages which are the main feature of this plugin. 404 SEO plugin With this Plugin you can customize 404 page of your site; you can give customized error message and links to relevant pages of your site. Redirection A powerful plugins to manage 301 redirection and logs related with redirection, with this plugin you can track 404 errors and track the log of all redirected URLs , this plugin can redirect post automatically when URL changes for that post. AddToAny This plugin helps your readers to share, save, email and bookmark your posts and pages. It supports more than a hundred social bookmarking , networking and sharing sites. SEO Friendly Images You can make SEO friendly images available on your site with the help of this tool. It updates images with proper titles and ALT tags. Robots Meta A plugin which prevents Search engines to index comments on your post, login and admin pages. It also allows to add tags for individual pages.

Read the article
Selling Visual Studio ALM

- by Tarun Arora

Introduction As a consultant I have been selling Application Lifecycle Management services using Visual Studio and Team Foundation Server. I’ve been contacted various times by friends working in organization telling me that ALM processes in their company were benchmarked when dinosaurs walked the earth. Most of these individuals already know the great features Microsoft ALM tools offer and are keen to start a conversation with the CIO but don’t exactly know where to start. It is very important how you engage in your first conversation, if you start the conversation with ‘There is this great tooling from Microsoft which offers amazing features to boost developer productivity, … ‘ from experience I can tell you the reply from your CIO would be ‘I already know! Our existing landscape has a combination of bleeding edge open source and cutting edge licensed tools which already cover these features quite well, more over Microsoft products have a high licensing cost associated to them.’ You will always find it harder to sell by feature, the trick is to highlight the gap in the existing processes & tools and then highlight the impact of these gaps to the overall development processes, by now you would have captured enough attention to show off how the ALM tooling offered by Microsoft not only fills those gaps but offers great value adds to take their development practices to the next level. Rangers ALM Assessment Guide Image 1 – Welcome! First look at the Rangers ALM assessment guide Most organization already have some processes in place to cover aspects of ALM. How do you go about proving that there isn’t enough cover in place? This is where Visual Studio ALM Rangers ALM Assessment guide can help. The ALM assessment guide is really a tool that helps you gather information about Development practices and processes within a customer's environment. Several questionnaires are used to identify the current state of individual development lifecycle areas and decide on a desired state for those processes. It also presents guidance and roll-up summaries to help with recommendations moving forward. The ALM Rangers assessment guide can be downloaded from here. Image 2 – ALM Assessment guide divided into different functions of SDLC The assessment guide is divided into different functions of Software Development Lifecycle (listed below), this gives you the ability to access how mature the company is in different areas of SDLC. Architecture & Design Requirement Engineering & UX Development Software Configuration Management Governance Deployment & Operations Testing & Quality Assurance Project Planning & Management Each section has a set of questions, fill in the assessment by selecting “Never/Sometimes/Always” from the Answer column in the question sheets. Each answer has weightage to the overall score. Each question has a link next to it, clicking the link takes you to the Reference sheet which gives you more details about the question along with a reason for “why you need to ask this question?”, “other ways to phrase the question” and “what to expect as an answer from the customer”. The trick is to engage the customer in a discussion. You need to probe a lot, listen to the customer and have a discussion with several team members, preferably without management to ensure that you receive candid feedback. This reminds me of a funny incident when during an ALM review a customer told me that they have a sophisticated semi-automated application deployment process, further discussions revealed that deployment actually involved 72 manual configuration steps per production node. Such observations can be recorded in the Issue Brainstorming worksheet for further consideration later. It is also worth mentioning the different levels of ALM maturity to the customer. By default the desired state of ALM maturity is set to Standard, it is possible to set a desired state by area, you should strive for Advanced or Dynamic, it always helps by explaining the classification and advantages. Image 3 – ALM levels by description The ALM assessment guide helps you arrive at a quantitative measure of the company’s ALM maturity. The resultant graph plotted on a spider’s web shows you the company’s current state of ALM maturity and the desired state of ALM maturity. Further since the results are classified by area you can immediately spot the areas where the customer needs immediate help. Image 4 – The spiders web! The red cross icons are areas shouting out for immediate attention, the yellow exclamation icons are areas that need improvement. These icons are calculated on the difference between the Current State of ALM maturity VS the Desired state of ALM maturity. Image 5 – Results by area Conclusion To conclude the Rangers ALM assessment guide gives you the ability to, Measure the customer’s current ALM maturity level Understand the ALM maturity level the customer desires to achieve Capture a healthy list of issues the customer wants to brainstorm further Now What’s next…? Download and get started with the Rangers ALM Assessment Guide. If you have successfully captured the above listed three pieces of information you are in a great state to make recommendations on the identified areas highlighting the benefits that Visual Studio ALM tools would offer. In the next post I will be covering how to take the ALM assessment results as the base to actually convert your recommendation into a sell. Remember to subscribe to http://feeds.feedburner.com/TarunArora. I would love to hear your feedback! If you have any recommendations on things that I should consider or any questions or feedback, feel free to leave a comment. *** A special thanks goes out to fellow ranges Willy, Ethem and Philip for reviewing the blog post and providing valuable feedback. ***

Read the article
What IDE to use for Python

- by husayt

As a Python newbie, it is interesting to know what IDE's ("GUIs/editors") others use for Python coding. If you can just give the name (e.g. Textpad, Eclipse ..) that will be enough. If it is already mentioned, you can just vote for it. But if you can also give some more comparative information, that will be much appreciated. Thanks. Update: Results so far PyDev with Eclipse (CP, F, AC, PD, EM, SI, MLS, UML, SC, UT, LN, CF, BM) Komodo (CP, C/F, MLS, PD, AC, SC, SI, BM, LN, CF, CT) Emacs (CP, F, AC, MLS, PD, EM, SC, SI, BM, LN, CF, CT, UT, UML) Vim (CP, F, AC, MLS, SI, BM, LN, CF ) TextMate (Mac, CT, CF, MLS, SI, BM, LN) Gedit (Linux, F, AC, MLS, BM, LN, CT [sort of]) Idle (CP, F, AC) PIDA (Linux, CP, F, AC, MLS, SI, BM, LN, CF)(VIM Based) NotePad++ (Windows) BlueFish (Linux) JEdit (CP, F, BM, LN, CF, MLS) E-Texteditor (TextMate Clone for Windows) WingIde (CP, C, AC, MLS (support for C), PD, EM, SC, SI, BM, LN, CF, CT, UT) Eric Ide (CP, F, AC, PD, EM, SI, LN, CF, UT) Pyscripter (Windows, F, AC, PD, EM, SI, LN, CT, UT) ConTEXT (Windows, C) SPE (F, AC, UML) SciTE (CP, F, MLS, EM, BM, LN, CF, CT, SH) Zeus (W, C, BM, LN, CF, SI, SC, CT) NetBeans (CP, F, PD, UML, AC, MLS, SC, SI, BM, LN, CF, CT, UT, RAD) DABO (CP) BlackAdder (C, CP, CF, SI) PythonWin (W, F, AC, PD, SI, BM, CF) Geany (CP, F, very limited AC, MLS, SI, BM, LN, CF) UliPad (CP, F, AC, PD, MLS, SI, LI, CT, UT, BM) Boa Constructor (CP, F, AC, PD, EM, SI, BM, LN, UML, CF, CT) ScriptDev (W, C, AC, MLS, PD, EM, SI, BM, LN, CF, CT) Spider (CP, F, AC) Editra (CP, F, AC, MLS, SC, SI, BM, LN, CF) Pfaide (Windows, C, AC, MLS, SI, BM, LN, CF, CT) KDevelop (CP, F, MLS, SC, SI, BM, LN, CF) Acronyms used: CP - Cross Platfom C - Commercial F - Free AC - Automatic Code-completion MLS - Multi-Language Support PD - Integrated Python Debugging EM - ErrorMarkup SC - Source Control integration SI - Smart Indent BM - Bracket Matching LN - Line Numbering UML - UML editing / viewing CF - Code Folding CT - Code Templates UT - Unit Testing UID - Gui Designer (e.g. QT, Eric, ..) DB - integrated database support RAD - Rapid app development support I don't mention basics like Syntax highlighting as I expect these by default. This is a just dry list reflecting your feedback and comments, I am not advocating any of these tools. I will keep updating this list as you keep posting your answers. PS. Can you help me to add features of the above editors to the list (like autocomplete, debugging, or etc)?

Read the article
SQLAlchemy unsupported type error - and table design issues?

- by Az

Hi there, back again with some more SQLAlchemy shenanigans. Let me step through this. My table is now set up as so: engine = create_engine('sqlite:///:memory:', echo=False) metadata = MetaData() students_table = Table('studs', metadata, Column('sid', Integer, primary_key=True), Column('name', String), Column('preferences', Integer), Column('allocated_rank', Integer), Column('allocated_project', Integer) ) metadata.create_all(engine) mapper(Student, students_table) Fairly simple, and for the most part I've been enjoying the ability to query almost any bit of information I want provided I avoid the error cases below. The class it is mapped from is: class Student(object): def __init__(self, sid, name): self.sid = sid self.name = name self.preferences = collections.defaultdict(set) self.allocated_project = None self.allocated_rank = 0 def __repr__(self): return str(self) def __str__(self): return "%s %s" %(self.sid, self.name) Explanation: preferences is basically a set of all the projects the student would prefer to be assigned. When the allocation algorithm kicks in, a student's allocated_project emerges from this preference set. Now if I try to do this: for student in students.itervalues(): session.add(student) session.commit() It throws two errors, one for the allocated_project column (seen below) and a similar error for the preferences column: sqlalchemy.exc.InterfaceError: (InterfaceError) Error binding parameter 4 - probably unsupported type. u'INSERT INTO studs (sid, name, allocated_rank, allocated_project) VALUES (?, ?, ?, ?, ?, ?, ?)' [1101, 'Muffett,M.', 1, 888 Human-spider relationships (Supervisor id: 123)] If I go back into my code I find that, when I'm copying the preferences from the given text files, it actually refers to the Project class which is mapped to a dictionary, using the unique project id's (pid) as keys. Thus, as I iterate through each student via their rank and to the preferences set, it adds not a project id, but the reference to the project id from the projects dictionary. students[sid].preferences[int(rank)].add(projects[int(pid)]) Now this is very useful to me since I can find out all I want to about a student's preferred projects without having to run another check to pull up information about the project id. The form you see in the error has the object print information passed as: return "%s %s (Supervisor id: %s)" %(self.proj_id, self.proj_name, self.proj_sup) My questions are: I'm trying to store an object in a database field aren't I? Would the correct way then, be copying the project information (project id, name, etc) into its own table, referenced by the unique project id? That way I can just have the project id field for one of the student tables just be an integer id and when I need more information, just join the tables? So and so forth for other tables? If the above makes sense, then how does one maintain the relationship with a column of information in one table which is a key index on another table? Does this boil down into a database design problem? Are there any other elegant ways of accomplishing this? Apologies if this is a very long-winded question. It's rather crucial for me to solve this, so I've tried to explain as much as I can, whilst attempting to show that I'm trying (key word here sadly) to understand what could be going wrong.

Read the article
Nginx traffic is going to wrong upsteam when mixing named servers and default servers

- by Morgan

I have the below config file for nginx. The problem is all traffic is going to upstream clustera. How do I configure nginx to only send traffic for example.com to clustera and all the rest to clusterb? user www-data; worker_processes 1; error_log /var/log/nginx/error.log; pid /var/run/nginx.pid; events { worker_connections 1024; } http { include /etc/nginx/mime.types; log_format cache '\n*** $remote_addr [$time_local] ' '[$upstream_cache_status] $upstream_response_time ' '$host "$request" ($status) $body_bytes_sent ' '"$http_referer" "$http_user_agent" ' 'Cache-Control: $upstream_http_cache_control ' 'Expires: $upstream_http_expires ' ; access_log /var/log/nginx/access.log cache; sendfile on; keepalive_timeout 65; gzip on; gzip_vary on; gzip_comp_level 6; gzip_proxied any; gzip_disable "MSIE [1-6]\.(?!.*SV1)"; gzip_buffers 16 8k; include /etc/nginx/conf.d/*.conf; proxy_cache_key "$scheme$host$request_uri"; proxy_cache_path /var/cache/nginx levels=1:2 keys_zone=main:10m max_size=1g inactive=30m; upstream clustera { ip_hash; server a.example.com:80; } upstream clusterb { ip_hash; server b.example.com:80; } client_max_body_size 20m; client_body_buffer_size 128k; proxy_connect_timeout 300; proxy_send_timeout 300; proxy_read_timeout 300; # host for example.com should send traffic to clustera server { listen 80; server_name example.com; location ~*(png|jpeg|jpg|gif|ico|css|js)$ { proxy_pass http://clustera; proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; proxy_cache main; proxy_cache_valid 200 5m; proxy_cache_valid 302 1m; } location / { proxy_pass http://clustera; proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; } } # host for everyone else. traffic goes to clusterb server { listen 80; server_name _; if ( $http_user_agent ~* (spider|crawler|slurp) ) { return 503; } set $slow 0; if ( $http_user_agent ~* (bot) ) { set $slow 1; } if ( $slow ) { set $limit_rate 1k; } location ~*(png|jpeg|jpg|gif|ico|css|js)$ { proxy_pass http://clusterb; proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; proxy_cache main; proxy_cache_valid 200 5m; proxy_cache_valid 302 1m; } location /images { proxy_pass http://clisterb; proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; proxy_cache main; proxy_cache_valid 200 5m; proxy_cache_valid 302 1m; } location / { proxy_pass http://clusterb; proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; } } }

Read the article
.htaccess template, suggestions needed.

- by purpler

I compiled myself a .htaccess template and would like to know whether the caching and compressions is set up right, constructive suggestions and critics needed. # Defaults AddDefaultCharset UTF-8 DefaultLanguage en-US FileETag None Header unset ETag ServerSignature Off SetEnv TZ Europe/Belgrade # Rewrites Options +FollowSymLinks RewriteEngine On RewriteBase / # Redirect to WWW RewriteCond %{HTTP_HOST} ^serpentineseo.com RewriteRule (.*) http://www.serpentineseo.com/$1 [R=301,L] # Redirect index to root RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /.*index\.html\ HTTP/ RewriteRule ^(.*)index\.html$ /$1 [R=301,L] # Cache media files: ExpiresActive On ExpiresDefault A0 # Month <filesMatch "\.(gif|jpg|jpeg|png|ico|swf|js)$"> Header set Cache-Control "max-age=2592000, public" </filesMatch> # Week <FilesMatch "\.(css|pdf)$"> Header set Cache-Control "max-age=604800" </FilesMatch> # 10 Min <FilesMatch "\.(html|htm|txt)$"> Header set Cache-Control "max-age=600" </FilesMatch> # Do not cache <FilesMatch "\.(pl|php|cgi|spl|scgi|fcgi)$"> Header unset Cache-Control </FilesMatch> # Compress output <IfModule mod_deflate.c> <FilesMatch "\.(html|js|css)$"> SetOutputFilter DEFLATE </FilesMatch> </IfModule> # Error Documents ErrorDocument 206 /error/206.html ErrorDocument 401 /error/401.html ErrorDocument 403 /error/403.html ErrorDocument 404 /error/404.html ErrorDocument 500 /error/500.html # Prevent hotlinking RewriteCond %{HTTP_REFERER} !^$ RewriteCond %{HTTP_REFERER} !^http://(www\.)?serpentineseo.com/.*$ [NC] RewriteRule \.(gif|jpg|png)$ http://www.serpentineseo.com/images/angryman.png [R,L] # Prevent offline browsers RewriteCond %{HTTP_USER_AGENT} ^BlackWidow [OR] RewriteCond %{HTTP_USER_AGENT} ^Bot\ mailto:[email protected] [OR] RewriteCond %{HTTP_USER_AGENT} ^ChinaClaw [OR] RewriteCond %{HTTP_USER_AGENT} ^Custo [OR] RewriteCond %{HTTP_USER_AGENT} ^DISCo [OR] RewriteCond %{HTTP_USER_AGENT} ^Download\ Demon [OR] RewriteCond %{HTTP_USER_AGENT} ^eCatch [OR] RewriteCond %{HTTP_USER_AGENT} ^EirGrabber [OR] RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR] RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR] RewriteCond %{HTTP_USER_AGENT} ^Express\ WebPictures [OR] RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR] RewriteCond %{HTTP_USER_AGENT} ^EyeNetIE [OR] RewriteCond %{HTTP_USER_AGENT} ^FlashGet [OR] RewriteCond %{HTTP_USER_AGENT} ^GetRight [OR] RewriteCond %{HTTP_USER_AGENT} ^GetWeb! [OR] RewriteCond %{HTTP_USER_AGENT} ^Go!Zilla [OR] RewriteCond %{HTTP_USER_AGENT} ^Go-Ahead-Got-It [OR] RewriteCond %{HTTP_USER_AGENT} ^GrabNet [OR] RewriteCond %{HTTP_USER_AGENT} ^Grafula [OR] RewriteCond %{HTTP_USER_AGENT} ^HMView [OR] RewriteCond %{HTTP_USER_AGENT} HTTrack [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^Image\ Stripper [OR] RewriteCond %{HTTP_USER_AGENT} ^Image\ Sucker [OR] RewriteCond %{HTTP_USER_AGENT} Indy\ Library [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^InterGET [OR] RewriteCond %{HTTP_USER_AGENT} ^Internet\ Ninja [OR] RewriteCond %{HTTP_USER_AGENT} ^JetCar [OR] RewriteCond %{HTTP_USER_AGENT} ^JOC\ Web\ Spider [OR] RewriteCond %{HTTP_USER_AGENT} ^larbin [OR] RewriteCond %{HTTP_USER_AGENT} ^LeechFTP [OR] RewriteCond %{HTTP_USER_AGENT} ^Mass\ Downloader [OR] RewriteCond %{HTTP_USER_AGENT} ^MIDown\ tool [OR] RewriteCond %{HTTP_USER_AGENT} ^Mister\ PiX [OR] RewriteCond %{HTTP_USER_AGENT} ^Navroad [OR] RewriteCond %{HTTP_USER_AGENT} ^NearSite [OR] RewriteCond %{HTTP_USER_AGENT} ^NetAnts [OR] RewriteCond %{HTTP_USER_AGENT} ^NetSpider [OR] RewriteCond %{HTTP_USER_AGENT} ^Net\ Vampire [OR] RewriteCond %{HTTP_USER_AGENT} ^NetZIP [OR] RewriteCond %{HTTP_USER_AGENT} ^Octopus [OR] RewriteCond %{HTTP_USER_AGENT} ^Offline\ Explorer [OR] RewriteCond %{HTTP_USER_AGENT} ^Offline\ Navigator [OR] RewriteCond %{HTTP_USER_AGENT} ^PageGrabber [OR] RewriteCond %{HTTP_USER_AGENT} ^Papa\ Foto [OR] RewriteCond %{HTTP_USER_AGENT} ^pavuk [OR] RewriteCond %{HTTP_USER_AGENT} ^pcBrowser [OR] RewriteCond %{HTTP_USER_AGENT} ^RealDownload [OR] RewriteCond %{HTTP_USER_AGENT} ^ReGet [OR] RewriteCond %{HTTP_USER_AGENT} ^SiteSnagger [OR] RewriteCond %{HTTP_USER_AGENT} ^SmartDownload [OR] RewriteCond %{HTTP_USER_AGENT} ^SuperBot [OR] RewriteCond %{HTTP_USER_AGENT} ^SuperHTTP [OR] RewriteCond %{HTTP_USER_AGENT} ^Surfbot [OR] RewriteCond %{HTTP_USER_AGENT} ^tAkeOut [OR] RewriteCond %{HTTP_USER_AGENT} ^Teleport\ Pro [OR] RewriteCond %{HTTP_USER_AGENT} ^VoidEYE [OR] RewriteCond %{HTTP_USER_AGENT} ^Web\ Image\ Collector [OR] RewriteCond %{HTTP_USER_AGENT} ^Web\ Sucker [OR] RewriteCond %{HTTP_USER_AGENT} ^WebAuto [OR] RewriteCond %{HTTP_USER_AGENT} ^WebCopier [OR] RewriteCond %{HTTP_USER_AGENT} ^WebFetch [OR] RewriteCond %{HTTP_USER_AGENT} ^WebGo\ IS [OR] RewriteCond %{HTTP_USER_AGENT} ^WebLeacher [OR] RewriteCond %{HTTP_USER_AGENT} ^WebReaper [OR] RewriteCond %{HTTP_USER_AGENT} ^WebSauger [OR] RewriteCond %{HTTP_USER_AGENT} ^Website\ eXtractor [OR] RewriteCond %{HTTP_USER_AGENT} ^Website\ Quester [OR] RewriteCond %{HTTP_USER_AGENT} ^WebStripper [OR] RewriteCond %{HTTP_USER_AGENT} ^WebWhacker [OR] RewriteCond %{HTTP_USER_AGENT} ^WebZIP [OR] RewriteCond %{HTTP_USER_AGENT} ^Wget [OR] RewriteCond %{HTTP_USER_AGENT} ^Widow [OR] RewriteCond %{HTTP_USER_AGENT} ^WWWOFFLE [OR] RewriteCond %{HTTP_USER_AGENT} ^Xaldon\ WebSpider [OR] RewriteCond %{HTTP_USER_AGENT} ^Zeus RewriteRule ^.*$ http://www.google.com [R,L] # Protect against DOS attacks by limiting file upload size LimitRequestBody 10240000 # Deny access to sensitive files <FilesMatch "\.(htaccess|psd|log)$"> Order Allow,Deny Deny from all </FilesMatch>

Read the article
CodePlex Daily Summary for Monday, June 09, 2014

CodePlex Daily Summary for Monday, June 09, 2014Popular ReleasesSmarterSql: Release 2014-06-09: Added a new strip connected to a text window that can be configured to show a color per connection (per server and database).SCCM 2012 Application Importer: SCCM 2012 Application Importer Alpha 0.3: Fixes since 0.2: Added GUI browser for choosing Limiting Collection Fix bug related to Deployments Bug when getting DP groups in settings tool Fixed bug with deployment and client did not download content. NOTE: Now application is required to run elevated Previous version must first be uninstalled and everything must be reconfigured since there have been changes in the config.xml.Papercut: Papercut v3.0.0.0 beta: Papercut has switched to semantic versioning! That means you will have to uninstall old "clickonce" versions to get the latest as it will see it as an older version. Latest Has Tons of New Features: Modern UI MVVM Architecture Watch Directories for New Messages Optional Backend Papercut Service Load on Windows Startup Attachments/Mime SectionsDIII Save Editor: DIII.SaveEdit_Alpha1.1.10.91.zip: Everything should work report what dontExperfwiz (Exchange Performance Data Collection tool): Experfwiz 1.3.8: List of updates in 1.3.8 Added support for Windows 2012 & 2012 R2 (for future use) Added support for Exchange 2013 Full is now enabled by default. To disable full mode, use 'nofull'. Exchange 2013 requires full. Added "\Processor Information()\" counters to Exchange 2010 full Blocked Exmon execution on Exchange 2013BugNET Issue Tracker: BugNET 1.6: Version 1.6 is a major upgrade to the latest frameworks and components by Microsoft. It includes a major UI overhaul using the bootstrap framework to have a modern, mobile friendly and easily customized layout. Upgrade to .NET 4.5 ASP.NET social auth Add script bundling and optimization Improvements for mobile devices Bootstrap (UI overhaul) Rewritten / friendly URL's Please read our release notes for BugNET 1.6: http://blog.bugnetproject.com/2014/06/08/bugnet-1-6-and-bugnet-pro-1...SFDL.NET: SFDL.NET (2.2.9.3): Changelog: Retry Bugfix (Error Counter wurde nicht korrekt zurückgesetzt) Neue Einstellung: Retry Wartezeit ist nun EinstellbarAspose for Apache POI: Aspose.Words vs Apache POI WP - v 1.1: Release contain the Code Comparison for Features in Apache POI WP SDK and Aspose.Words What's New ?Following Examples: Working with Headers Working with Footers Access Ranges in Document Delete Ranges in Document Insert Before and After Ranges Feedback and Suggestions Many more examples are yet to come here. Keep visiting us. Raise your queries and suggest more examples via Aspose Forums or via this social coding site.babelua: 1.5.7.0: V1.5.7.0 - 2014.6.6Stability improvement: use "lua scripts folder" as lua search path when debugging;SEToolbox: SEToolbox 01.033.007 Release 1: Fixed breaking changes in Space Engineers in latest update. Installation of this version will replace older version.PlaySly: GetDirectLink 1.0.1.0: ConfiguraciónNuevo sistema de configuración Nuevo formulario de configuración Añadidas configuraciones visuales Fuente principal, fuente secundaria, fuente menus, fuente botones grandes, fuente botones pequeños Color fondo, color fondo menus, color fondo botones y colores de fuentes Ahora las fuentes soportan estilo (irregular, negrita, tachado y subrayado) Añadidas configuraciones de reproductor Pausar al iniciar para que el vídeo cargue (buffering) Añadidas configuraciones sobre...Virto Commerce Enterprise Open Source eCommerce Platform (asp.net mvc): Virto Commerce 1.10: Virto Commerce Community Edition version 1.10. To install the SDK package, please refer to SDK getting started documentation To configure source code package, please refer to Source code getting started documentation This release includes bug fixes and improvements (including Commerce Manager localization and https support). More details about this release can be found on our blog at http://blog.virtocommerce.com.NPOI: NPOI 2.1: Assembly Version: 2.1.0 New Features a. XSSFSheet.CopySheet b. Excel2Html for XSSF c. insert picture in word 2007 d. Implement IfError function in formula engine Bug Fixes a. fix conditional formatting issue b. fix ctFont order issue c. fix vertical alignment issue in XSSF d. add IndexedColors to NPOI.SS.UserModel e. fix decimal point issue in non-English culture f. fix SetMargin issue in XSSF g.fix multiple images insert issue in XSSF h.fix rich text style missing issue in XSSF i. fix cell...51Degrees - Device Detection and Redirection: 3.1.2.3: Version 3.1 HighlightsDevice detection algorithm is over 100 times faster. Regular expressions and levenshtein distance calculations are no longer used. The device detection algorithm performance is no longer limited by the number of device combinations contained in the dataset. Two modes of operation are available: Memory – the detection data set is loaded into memory and there is no continuous connection to the source data file. Slower initialisation time but faster detection performanc...CS-Script for Notepad++ (C# intellisense and code execution): Release v1.0.27.0: CodeMap now indicates the type name for all members Implemented running scripts 'as administrator'. Just add '//css_npp asadmin' to the script and run it as usual. 'Prepare script for distribution' now aggregates script dependency assemblies. Various improvements in CodeSnipptet, Autcompletion and MethodInfo interactions with each other. Added printing line number for the entries in CodeMap (subject of configuration value) Improved debugging step indication for classless scripts ...ClosedXML - The easy way to OpenXML: ClosedXML 0.72.3: 70426e13c415 ClosedXML for .Net 4.0 now uses Open XML SDK 2.5 b9ef53a6654f Merge branch 'master' of https://git01.codeplex.com/forks/vbjay/closedxml 727714e86416 Fix range.Merge(Boolean) for .Net 3.5 eb1ed478e50e Make public range.Merge(Boolean checkIntersects) 6284cf3c3991 More performance improvements when saving.TEncoder: 4.0.0: --4.0.0 -Added: Video downloader -Added: Total progress will be updated more smoothly -Added: MP4Box progress will be shown -Added: A tool to create gif image from video -Added: An option to disable trimming -Added: Audio track option won't be used for mpeg sources as default -Fixed: Subtitle position wasn't used -Fixed: Duration info in the file list wasn't updated after trimming -Updated: FFMpegVeraCrypt: VeraCrypt version 1.0d: Changes between 1.0c and 1.0d (03 June 2014) : Correct issue while creating hidden operating system. Minor fixes (look at git history for more details).Aspose for DNN: DNN Import from PDF using Aspose.Pdf: Aspose DNN PDF Import Module allows developers to get/read contents of PDF document without requiring any other software such as Adobe Acrobat or PDF reader. This module demonstrates the powerful import feature provided by Aspose.Pdf. It adds a simple file browser control and Import from PDF button on the page where the module is added. When clicking the button, users get the document contents displayed on screen immediately.Windows ??? ???????????? ??????: ??? ??? ?????? 1.1: C#/XAML 　?????????、??? ???????????????。???????、CSV????????????????????????????????。 ???????????、?????????????????。New ProjectsAmericanMarket: american marketCatering: CateringeeCommerce: blog with ecommerceGFramework: ????，????！littleAdminHelper: Little Admin Helper is a Software to gather Systeminformations for Helpdesk Calls. The Application is able to send Mails to a specified Support adress.mkv2mp4: converter for mkv2mp4My Explorer: This is a modern version of the desktop's File Explorer.NAAE v2 companion code: This project contains the companion code for the "Microsoft .NET - Architecting Applications for the Enterprise (2nd Edition)" book.OpenSSDT: OpenSSDT is open source project that collect set of tools and documentation that where created from expirance with SSDT and the DacFX framework. Feel free to adOperations Documentation Templates: A set of document templates to be used by operations groups to have a rapid and formalized process for solution delivery. Redback: A simple web spider written in C# and exclusively targeting Windows RT/Windows Phone platforms.SQL Library: Collection of t-sql queries.systempro: systempro?????-?????【??】: ?????????????????????????,???????????,??????,??????????????...????????。??????！ ?????-?????【??】: ?????????:?????,??????!???????????????,???????、?????、??????“??”????,????,????! ??????-??????【??】: ?????????????????,??????????、??????,??????????、????、????、???????。??????-??????【??】: ?????????????????,??????????、??????,??????????、????、????、???????。?????-?????【??】: ????????????????、?????????,??????、??????????????????,???????.??????????,????????。

Read the article
View Generated Source (After AJAX/JavaScript) in C#

- by Michael La Voie

Is there a way to view the generated source of a web page (the code after all AJAX calls and JavaScript DOM manipulations have taken place) from a C# application without opening up a browser from the code? Viewing the initial page using a WebRequest or WebClient object works ok, but if the page makes extensive use of JavaScript to alter the DOM on page load, then these don't provide an accurate picture of the page. I have tried using Selenium and Watin UI testing frameworks and they work perfectly, supplying the generated source as it appears after all JavaScript manipulations are completed. Unfortunately, they do this by opening up an actual web browser, which is very slow. I've implemented a selenium server which offloads this work to another machine, but there is still a substantial delay. Is there a .Net library that will load and parse a page (like a browser) and spit out the generated code? Clearly, Google and Yahoo aren't opening up browsers for every page they want to spider (of course they may have more resources than me...). Is there such a library or am I out of luck unless I'm willing to dissect the source code of an open source browser? SOLUTION Well, thank you everyone for you're help. I have a working solution that is about 10X faster then Selenium. Woo! Thanks to this old article from beansoftware I was able to use the System.Windows.Forms.WebBrwoswer control to download the page and parse it, then give em the generated source. Even though the control is in Windows.Forms, you can still run it from Asp.Net (which is what I'm doing), just remember to add System.Window.Forms to your project references. There are two notable things about the code. First, the WebBrowser control is called in a new thread. This is because it must run on a single threaded apartment. Second, the GeneratedSource variable is set in two places. This is not due to an intelligent design decision :) I'm still working on it and will update this answer when I'm done. wb_DocumentCompleted() is called multiple times. First when the initial HTML is downloaded, then again when the first round of JavaScript completes. Unfortunately, the site I'm scraping has 3 different loading stages. 1) Load initial HTML 2) Do first round of JavaScript DOM manipulation 3) pause for half a second then do a second round of JS DOM manipulation. For some reason, the second round isn't cause by the wb_DocumentCompleted() function, but it is always caught when wb.ReadyState == Complete. So why not remove it from wb_DocumentCompleted()? I'm still not sure why it isn't caught there and that's where the beadsoftware article recommended putting it. I'm going to keep looking into it. I just wanted to publish this code so anyone who's interested can use it. Enjoy! using System.Threading; using System.Windows.Forms; public class WebProcessor { private string GeneratedSource{ get; set; } private string URL { get; set; } public string GetGeneratedHTML(string url) { URL = url; Thread t = new Thread(new ThreadStart(WebBrowserThread)); t.SetApartmentState(ApartmentState.STA); t.Start(); t.Join(); return GeneratedSource; } private void WebBrowserThread() { WebBrowser wb = new WebBrowser(); wb.Navigate(URL); wb.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler( wb_DocumentCompleted); while (wb.ReadyState != WebBrowserReadyState.Complete) Application.DoEvents(); //Added this line, because the final HTML takes a while to show up GeneratedSource= wb.Document.Body.InnerHtml; wb.Dispose(); } private void wb_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e) { WebBrowser wb = (WebBrowser)sender; GeneratedSource= wb.Document.Body.InnerHtml; } }

Read the article
urllib2 misbehaving with dynamically loaded content

- by Sheena

Some Code headers = {} headers['user-agent'] = 'User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:16.0) Gecko/20100101 Firefox/16.0' headers['Accept'] = 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' headers['Accept-Language'] = 'en-gb,en;q=0.5' #headers['Accept-Encoding'] = 'gzip, deflate' request = urllib.request.Request(sURL, headers = headers) try: response = urllib.request.urlopen(request) except error.HTTPError as e: print('The server couldn\'t fulfill the request.') print('Error code: {0}'.format(e.code)) except error.URLError as e: print('We failed to reach a server.') print('Reason: {0}'.format(e.reason)) else: f = open('output/{0}.html'.format(sFileName),'w') f.write(response.read().decode('utf-8')) A url http://groupon.cl/descuentos/santiago-centro The situation Here's what I did: enable javascript in browser open url above and keep an eye on the console disable javascript repeat step 2 use urllib2 to grab the webpage and save it to a file enable javascript open the file with browser and observe console repeat 7 with javascript off results In step 2 I saw that a whole lot of the page content was loaded dynamically using ajax. So the HTML that arrived was a sort of skeleton and ajax was used to fill in the gaps. This is fine and not at all surprising Since the page should be seo friendly it should work fine without js. in step 4 nothing happens in the console and the skeleton page loads pre-populated rendering the ajax unnecessary. This is also completely not confusing in step 7 the ajax calls are made but fail. this is also ok since the urls they are using are not local, the calls are thus broken. The page looks like the skeleton. This is also great and expected. in step 8: no ajax calls are made and the skeleton is just a skeleton. I would have thought that this should behave very much like in step 4 question What I want to do is use urllib2 to grab the html from step 4 but I cant figure out how. What am I missing and how could I pull this off? To paraphrase If I was writing a spider I would want to be able to grab plain ol' HTML (as in that which resulted in step 4). I dont want to execute ajax stuff or any javascript at all. I don't want to populate anything dynamically. I just want HTML. The seo friendly site wants me to get what I want because that's what seo is all about. How would one go about getting plain HTML content given the situation I outlined? To do it manually I would turn off js, navigate to the page and copy the html. I want to automate this. stuff I've tried I used wireshark to look at packet headers and the GETs sent off from my pc in steps 2 and 4 have the same headers. Reading about SEO stuff makes me think that this is pretty normal otherwise techniques such as hijax wouldn't be used. Here are the headers my browser sends: Host: groupon.cl User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:16.0) Gecko/20100101 Firefox/16.0 Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 Accept-Language: en-gb,en;q=0.5 Accept-Encoding: gzip, deflate Connection: keep-alive Here are the headers my script sends: Accept-Encoding: identity Host: groupon.cl Accept-Language: en-gb,en;q=0.5 Connection: close Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 User-Agent: User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:16.0) Gecko/20100101 Firefox/16.0 The differences are: my script has Connection = close instead of keep-alive. I can't see how this would cause a problem my script has Accept-encoding = identity. This might be the cause of the problem. I can't really see why the host would use this field to determine the user-agent though. If I change encoding to match the browser request headers then I have trouble decoding it. I'm working on this now... watch this space, I'll update the question as new info comes up

Read the article

Search Results

Search found 178 results on 8 pages for 'hobo spider'.

Page 7/8 | < Previous Page | 3 4 5 6 7 8 | Next Page >

- by rdjurovich

- by Leniel Macaferi

- by user114293

- by Mike Chip

- by Benoitt

- by Peter Nielsen

- by seanieb

- by jonas

- by Marius Mathisen

- by Justavian

- by Nicklas Ansman

- by John

- by John

- by purpler

- by purpler

- by pitty.platsch

- by Vaibhav

- by Tarun Arora

- by husayt

- by Az

- by Morgan

- by purpler

- by Michael La Voie

- by Sheena

< Previous Page | 3 4 5 6 7 8 | Next Page >