Search Results

Search found 24376 results on 976 pages for 'site crawler'.

Page 18/976 | < Previous Page | 14 15 16 17 18 19 20 21 22 23 24 25 | Next Page >

How To Discover RSS Feeds for a given site.

- by ktolis

The quest is, given a site url (say http://stackoverflow.com/ ) to return the list of all the feeds available on the site. Methods acceptable: a) use a 3rd party service (google?, yahoo?, ...) programmatically b) using a crawler/spider (and some tips on how to configure the spider to return the rss/xml feeds only) c) programmatically using c/c++/php (any language/library) The task here is not to get the feeds contained on the page returned by the url but ALL the feeds that are available on the server at any depth... in any cases please provide a simple usage example.

Read the article
Ranking hit after site migration

- by Ben

I migrated my site from its old domain over a month ago. I followed Google Webmaster Tools completely, including 301 redirects from every existing URL to the new domain, and then submitting a change of address. Traffic continued as normal, but then a few days after submitting the change of address traffic plummeted to about 20-30% of what it was previously. Most of my traffic comes from organic search, and I can see that for the keywords I had targeted before and performed well with and am now ranking much much lower for. In some cases for low competition keywords I've only lost a few places, for higher competition terms I have really suffered. This has started to pick up a bit (one of my keywords I have risen from 195 to 100 in the last week), but it seems to be a very slow process. How seamless is this process normally? I was under the impression that this would not affect my rankings too severely, but it has now been a month since the move and recovery seems to be very slow, if at all. Is it likely that I've missed something? The only change is that I have moved what was the home page to be more of a sub-page, and now in its place is a magazine-style home page. I understand that links to the old site will now be pointing to the latter which means that rankings for some keywords attributed to the old home page will take a hit, but even on other pages that seem to fit in exactly the same page structure as the previous site I have seen a drop in rankings.

Read the article
Install NPM Packages Automatically for Node.js on Windows Azure Web Site

- by Shaun

In one of my previous post I described and demonstrated how to use NPM packages in Node.js and Windows Azure Web Site (WAWS). In that post I used NPM command to install packages, and then use Git for Windows to commit my changes and sync them to WAWS git repository. Then WAWS will trigger a new deployment to host my Node.js application. Someone may notice that, a NPM package may contains many files and could be a little bit huge. For example, the “azure” package, which is the Windows Azure SDK for Node.js, is about 6MB. Another popular package “express”, which is a rich MVC framework for Node.js, is about 1MB. When I firstly push my codes to Windows Azure, all of them must be uploaded to the cloud. Is that possible to let Windows Azure download and install these packages for us? In this post, I will introduce how to make WAWS install all required packages for us when deploying. Let’s Start with Demo Demo is most straightforward. Let’s create a new WAWS and clone it to my local disk. Drag the folder into Git for Windows so that it can help us commit and push. Please refer to this post if you are not familiar with how to use Windows Azure Web Site, Git deployment, git clone and Git for Windows. And then open a command windows and install a package in our code folder. Let’s say I want to install “express”. And then created a new Node.js file named “server.js” and pasted the code as below. 1: var express = require("express"); 2: var app = express(); 3: 4: app.get("/", function(req, res) { 5: res.send("Hello Node.js and Express."); 6: }); 7: 8: console.log("Web application opened."); 9: app.listen(process.env.PORT); If we switch to Git for Windows right now we will find that it detected the changes we made, which includes the “server.js” and all files under “node_modules” folder. What we need to upload should only be our source code, but the huge package files also have to be uploaded as well. Now I will show you how to exclude them and let Windows Azure install the package on the cloud. First we need to add a special file named “.gitignore”. It seems cannot be done directly from the file explorer since this file only contains extension name. So we need to do it from command line. Navigate to the local repository folder and execute the command below to create an empty file named “.gitignore”. If the command windows asked for input just press Enter. 1: echo > .gitignore Now open this file and copy the content below and save. 1: node_modules Now if we switch to Git for Windows we will found that the packages under the “node_modules” were not in the change list. So now if we commit and push, the “express” packages will not be uploaded to Windows Azure. Second, let’s tell Windows Azure which packages it needs to install when deploying. Create another file named “package.json” and copy the content below into that file and save. 1: { 2: "name": "npmdemo", 3: "version": "1.0.0", 4: "dependencies": { 5: "express": "*" 6: } 7: } Now back to Git for Windows, commit our changes and push it to WAWS. Then let’s open the WAWS in developer portal, we will see that there’s a new deployment finished. Click the arrow right side of this deployment we can see how WAWS handle this deployment. Especially we can find WAWS executed NPM. And if we opened the log we can review what command WAWS executed to install the packages and the installation output messages. As you can see WAWS installed “express” for me from the cloud side, so that I don’t need to upload the whole bunch of the package to Azure. Open this website and we can see the result, which proved the “express” had been installed successfully. What’s Happened Under the Hood Now let’s explain a bit on what the “.gitignore” and “package.json” mean. The “.gitignore” is an ignore configuration file for git repository. All files and folders listed in the “.gitignore” will be skipped from git push. In the example below I copied “node_modules” into this file in my local repository. This means, do not track and upload all files under the “node_modules” folder. So by using “.gitignore” I skipped all packages from uploading to Windows Azure. “.gitignore” can contain files, folders. It can also contain the files and folders that we do NOT want to ignore. In the next section we will see how to use the un-ignore syntax to make the SQL package included. The “package.json” file is the package definition file for Node.js application. We can define the application name, version, description, author, etc. information in it in JSON format. And we can also put the dependent packages as well, to indicate which packages this Node.js application is needed. In WAWS, name and version is necessary. And when a deployment happened, WAWS will look into this file, find the dependent packages, execute the NPM command to install them one by one. So in the demo above I copied “express” into this file so that WAWS will install it for me automatically. I updated the dependencies section of the “package.json” file manually. But this can be done partially automatically. If we have a valid “package.json” in our local repository, then when we are going to install some packages we can specify “--save” parameter in “npm install” command, so that NPM will help us upgrade the dependencies part. For example, when I wanted to install “azure” package I should execute the command as below. Note that I added “--save” with the command. 1: npm install azure --save Once it finished my “package.json” will be updated automatically. Each dependent packages will be presented here. The JSON key is the package name while the value is the version range. Below is a brief list of the version range format. For more information about the “package.json” please refer here. Format Description Example version Must match the version exactly. "azure": "0.6.7" >=version Must be equal or great than the version. "azure": ">0.6.0" 1.2.x The version number must start with the supplied digits, but any digit may be used in place of the x. "azure": "0.6.x" ~version The version must be at least as high as the range, and it must be less than the next major revision above the range. "azure": "~0.6.7" * Matches any version. "azure": "*" And WAWS will install the proper version of the packages based on what you defined here. The process of WAWS git deployment and NPM installation would be like this. But Some Packages… As we know, when we specified the dependencies in “package.json” WAWS will download and install them on the cloud. For most of packages it works very well. But there are some special packages may not work. This means, if the package installation needs some special environment restraints it might be failed. For example, the SQL Server Driver for Node.js package needs “node-gyp”, Python and C++ 2010 installed on the target machine during the NPM installation. If we just put the “msnodesql” in “package.json” file and push it to WAWS, the deployment will be failed since there’s no “node-gyp”, Python and C++ 2010 in the WAWS virtual machine. For example, the “server.js” file. 1: var express = require("express"); 2: var app = express(); 3: 4: app.get("/", function(req, res) { 5: res.send("Hello Node.js and Express."); 6: }); 7: 8: var sql = require("msnodesql"); 9: var connectionString = "Driver={SQL Server Native Client 10.0};Server=tcp:tqy4c0isfr.database.windows.net,1433;Database=msteched2012;Uid=shaunxu@tqy4c0isfr;Pwd=P@ssw0rd123;Encrypt=yes;Connection Timeout=30;"; 10: app.get("/sql", function (req, res) { 11: sql.open(connectionString, function (err, conn) { 12: if (err) { 13: console.log(err); 14: res.send(500, "Cannot open connection."); 15: } 16: else { 17: conn.queryRaw("SELECT * FROM [Resource]", function (err, results) { 18: if (err) { 19: console.log(err); 20: res.send(500, "Cannot retrieve records."); 21: } 22: else { 23: res.json(results); 24: } 25: }); 26: } 27: }); 28: }); 29: 30: console.log("Web application opened."); 31: app.listen(process.env.PORT); The “package.json” file. 1: { 2: "name": "npmdemo", 3: "version": "1.0.0", 4: "dependencies": { 5: "express": "*", 6: "msnodesql": "*" 7: } 8: } And it failed to deploy to WAWS. From the NPM log we can see it’s because “msnodesql” cannot be installed on WAWS. The solution is, in “.gitignore” file we should ignore all packages except the “msnodesql”, and upload the package by ourselves. This can be done by use the content as below. We firstly un-ignored the “node_modules” folder. And then we ignored all sub folders but need git to check each sub folders. And then we un-ignore one of the sub folders named “msnodesql” which is the SQL Server Node.js Driver. 1: !node_modules/ 2: 3: node_modules/* 4: !node_modules/msnodesql For more information about the syntax of “.gitignore” please refer to this thread. Now if we go to Git for Windows we will find the “msnodesql” was included in the uncommitted set while “express” was not. I also need remove the dependency of “msnodesql” from “package.json”. Commit and push to WAWS. Now we can see the deployment successfully done. And then we can use the Windows Azure SQL Database from our Node.js application through the “msnodesql” package we uploaded. Summary In this post I demonstrated how to leverage the deployment process of Windows Azure Web Site to install NPM packages during the publish action. With the “.gitignore” and “package.json” file we can ignore the dependent packages from our Node.js and let Windows Azure Web Site download and install them while deployed. For some special packages that cannot be installed by Windows Azure Web Site, such as “msnodesql”, we can put them into the publish payload as well. With the combination of Windows Azure Web Site, Node.js and NPM it makes even more easy and quick for us to develop and deploy our Node.js application to the cloud. Hope this helps, Shaun All documents and related graphics, codes are provided "AS IS" without warranty of any kind. Copyright © Shaun Ziyan Xu. This work is licensed under the Creative Commons License.

Read the article
Price comparison sites and its effect on Google ranking

- by Jivago

I am the webmaster of a website that contains roughly 10,000 products. I would be possibly interested to index those products in a price comparison site like PriceGrabber, Nextag, Shopbot, etc. The principle of price comparison sites is great for an actual user that want to compare prices but my main concern is the effect it could have on my actual ranking on Google... Since a site like Shopbot uses a CPC model (Cost-per-click), all the links on the website are builted to track clicks (IE: http://www.shopbot.ca/r.html?i=3&catc=2&refshop=5706&refshopcodeid=42587349), it uses redirection, no direct links (So no direct backlinking). In your opinion and/or experience, is this a smart, business wise, seo wise move or not? THANKS!

Read the article
Is there any descent open-source search engine solutions?

- by Nazariy

Few weeks ago my friend asked me how hard is it to launch your own search engine service with list of websites that suppose to be crawled time to time. First what come at my mind was Google Custom Search however pricing policy is quite tricky and would drain your budget if you reach 500K queries per year. Another solution I found here was SearchBlox, which can be compared to Google Mini service. It's quite good solution if you planing to cover search over small amount of websites but for larger projects it is not very handy. I also found few other search platforms like Lucene, Hadoop and Xapian which seems to be quite powerful solutions to reach Google search quality, and Nutch as a web crawler. As most of open-source projects they share same problem, luck of comprehensive guidance of usage, examples and it's expected that you are expert in this subject. I'm wondering if any of you using this solutions, which of them would you recommend, and what should I be aware of?

Read the article
Google bots are severely affecting site performance

- by Lynn

I have an aggregate site on a linux server that pulls in feeds from a universe of about 2,000 blogs. It's in Wordpress 3.4.2 and I have a cron job that is staggered to run five times an hour on another server to pull in the stories and then publish them to the front page of this site. This is so I didn't put too much pressure all on one server. However, the Google bots, which visit a few times every hour bring the server to its knees in the morning and evenings when there is an increase in traffic on the site. The bots have something like 30,000 links to follow at this point. How do I throttle the bots to simply grab the new stories off the front page and stop there? EDIT- Details of my server configuration: The way we have this set up is the server that handles all the publishing is an unmanaged instance via AWS. It mounts the NFS server and connects to the RDS to update content, etc. You get to this publishing instance via a plugin that detects the wp-admin link and then redirects you into there. The front end app server also mounts the NFS and requests data from the RDS. It is the only one that has the WP Super Cache on it.... The OS is Ubuntu on the App server and the NFS runs CentOs. The front end is Nginx and the publishing server is Apache.

Read the article
Ranking hit after WP site migration

- by Ben

I migrated my site from its old domain over a month ago. I followed WMT completely, including 301 redirects from every existing URL to the new domain, and then submitting a change of address. Traffic continued as normal, but then a few days after submitting the change of address traffic plummeted to about 20-30% of what it was previously. Most of my traffic come from organic search, and I can see that for the keywords I had targeted before and performed well with and am now ranking much much lower for. In some cases for low competition keywords I've only lost a few places, for higher competition terms I have really suffered. This has started to pick up a bit (one of my keywords I have risen from 195 to 100 in the last week), but it seems to be a very slow process. How seamless is this process normally? I was under the impression that this would not affect my rankings too severely, but it has now been a month since the move and recovery seems to be very slow, if at all. Is it likely that I've missed something? The only change is that I have moved what was the home page to be more of a sub-page, and now in its place is a magazine-style home page. I understand that links to the old site will now be pointing to the latter which means that rankings for some keywords attributed to the old home page will take a hit, but even on other pages that seem to fit in exactly the same page structure as the previous site I have seen a drop in rankings. Any help would be greatly appreciated. Thanks!

Read the article
Why old (301) links stay on Google when breaking site down to multiple domains

- by Sampo Sarrala

Some background: We did have single site and single domain (let's call it mainsite.com) with product information, however things have changed since and product database has grown fast. So we decided to move some major products/manufacturers under their own domains (let's call one of them subsite.com) while still using our main database/codebase. What we've done: Added subsite.com domain for product 1 by Great Products Co. Some new nice looking front pages, info pages, etc. Detail pages that will use information from original db. Redirected product/group links from mainsite.com using 301 redirect. Verified that redirects works as expected. Waited some time for Google reindexing (over 30 days, I've heard it should be more than enough). Results: If I search our moved products from Google then it will found them and list them but with old links to our main page like mainsite.com/group/product1 but it should show link to new site subsite.com/product1. Links from Goole redirects as they should, as said redirects are verified [301]. Main question: Any reasons why Google would not follow 301 redirects and update links so that they will point to our new mfg/product site subsite.com?

Read the article
Advice on software infrastructure for a FLOSS bounty site

- by michaeljt

I am planning to set up a simple web site where people can offer bounties for work on FLOSS projects. Unfortunately I have no experience at web development (I am a C/C++ developer), so I was hoping someone might be able to suggest out-of-the-box packages (preferably Debian ones) I could use to build the site from. My idea of how the site would work is to keep things as simple as possible. The person proposing a bounty would enter a description with relevant links (particularly to a bugtracker entry with the project the work is to be done on, where the real discussion and work would take place) and information and place an initial contribution. Other people would be able to add (donate, not pledge) contributions, but any discussion would take place on the project's bugtracker. I am also planning to run a mailing list rather than a forum (at least initially), so that is not a requirement. Paypal seems to me to be the handiest payment mechanism. So overall what I need is probably a simple interface with Paypal integration and a simple database backend. I hope this is the right place for my question, if not I would be grateful for pointers to somewhere better. And of course, this is purely about the technical side, though I am more than happy to discuss other aspects of the project elsewhere.

Read the article
How much HDD space would I need to cache the web while respecting robot.txts?

- by Koning Baard XIV

I want to experiment with creating a web crawler. I'll start with indexing a few medium sized website like Stack Overflow or Smashing Magazine. If it works, I'd like to start crawling the entire web. I'll respect robot.txts. I save all html, pdf, word, excel, powerpoint, keynote, etc... documents (not exes, dmgs etc, just documents) in a MySQL DB. Next to that, I'll have a second table containing all restults and descriptions, and a table with words and on what page to find those words (aka an index). How much HDD space do you think I need to save all the pages? Is it as low as 1 TB or is it about 10 TB, 20? Maybe 30? 1000? Thanks

Read the article
Moving from a static site to a CMS with new URLs and meta-data for pages

- by Chris J

Hi I am in the process of rebuilding a site from static pages to a CMS which will be using mod_rewrite to generate new page URLs. In this process our marketing people and myself have decided to tidy up the descriptions, keywords and titles. Eg: a page which who's URL is currently "website-name/about_us.html" and has a title of "website-name - something not quite page specific" will change to "website-name/about-us/" and title: "about us - website-name" and may have a few keywords and the description changed. Our goal with updating the meta data is to improve our page rankings and try to keep in line with some best practices for SEO. Though our current page rankings are quite good in many aspects, there is room for improvement. All of the pages will also have content changes (like rearranging heading tags, new menu on all pages, new content in footer, extra pieces of dynamic content relating to other pages). In this new site process I plan to use 301 redirects for all the old URLs pointing to the new URLs. My question is what can I expect to happen to the page rankings in Google, in the sort term and long term? Will this be like kicking off a new site which will have to build up trust over time or will the original page rankings have affect?

Read the article
New site not appearing in index after change of address, no feedback from google webmaster tools

- by Duffy

Our change of address seems to not be taking effect. Here's the story so far: We're a web company and our product is called The New Hive. Our site used to be at thenewhive.com, but we decided to switch to newhive.com (drop the "the", it's cleaner). So the timeline of what I've tried, starting on July 29th: used 301 redirects for all pages (e.g. thenewhive.com/tag/art = newhive.com/tag/art) At this point we noticed that we had disappeared from search results when searching "The New Hive", the front page used to be all links to our site plus a couple news articles about the company. So on August 5th I: verified new domain in webmaster tools (old domain was already verified) submitted a change of address request on August 5th with Webmaster Tools / Configuration / Change of Address Then after another week, on August 13th I did this: Went to Webmaster Tools / Health / Fetch as google fetched our homepage and a couple sub pages, all successfully clicked "Submit to Index" for homepage As of today (August 23rd) we're still not showing up in the index. We're getting no warnings or feedback of any kind from the dashboard so I'm inclined to think something's broken with the dashboard rather than that something's wrong with our site from an SEO perspective. From the dashboard: No new messages or recent critical issues. Crawl Errors: No data available. From Health - Index Status: Total indexed 0 Ever crawled 42,490 Not selected 12 Blocked by robots 0 I'm really at a loss here, any help would be appreciated.

Read the article
SEO: Getting site to show in location-specific searches

- by willvv

I'm really new to this SEO world and I've been reading a lot to try and figure it out. We have a site moodbond.com that allows users to browse/create events anywhere. And we fill it with content from the main cities in the US. We would like it to show for searches for things like "events in san francisco" or "what to do in new york", however, since the site is not really location-specific, I'm not really sure where to begin. I've been thinking a couple of things, maybe you can help me decide if these would be a good way to start or if I should try something different. 1- Allow something like location-specific urls (e.g. moodbond.com/browse/san-francisco) could just show the main page centered in San Francisco. 2- Change the headers/title of the page so it adapts automatically to the city being browsed (and change this dynamically as the user changes the location of the map). 3- Add internal links to different locations (e.g. add a link at the footer of the page that says "Events in Seattle" that makes the site load events in that city. (this would probably depend on implementing #1). What do you guys think? will any of these really help or should I look for a different approach? any advice is welcome. Thanks

Read the article
Drupal site Instant Messaging [migrated]

- by pthurmond

I am trying to find a module or a standalone solution that I can turn into a module that will allow me to have an instant messaging system like Facebook does on a Drupal site that I am working on. I have never setup a chat system before. My particular requirements are rather stringent. It needs to be a solution where we host the chatting server (if one is needed separate from the website itself). It must use the site's login state (can't use an external system at all, that means no GTalk, Yahoo IM, or AIM). It also must be able to handle up to 1,000 users at any given time. I have looked through the Drupal community and I tried the DXMPP module, but it requires Jquery UI 1.8 and that doesn't work with all of the other things that my site uses (such as Homebox). We do have a Jabber server already setup and ready to go. Does anyone have any thoughts or options here? Thanks! EDIT: We are using Drupal 6.

Read the article
TypeError: coercing to Unicode: need string or buffer, User found

- by Clemens

hi, i have to crawl last.fm for users (university exercise). I'm new to python and get following error: Traceback (most recent call last): File "crawler.py", line 23, in <module> for f in user_.get_friends(limit='200'): File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/pylast.py", line 2717, in get_friends for node in _collect_nodes(limit, self, "user.getFriends", False): File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/pylast.py", line 3409, in _collect_nodes doc = sender._request(method_name, cacheable, params) File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/pylast.py", line 969, in _request return _Request(self.network, method_name, params).execute(cacheable) File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/pylast.py", line 721, in __init__ self.sign_it() File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/pylast.py", line 727, in sign_it self.params['api_sig'] = self._get_signature() File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/pylast.py", line 740, in _get_signature string += self.params[name] TypeError: coercing to Unicode: need string or buffer, User found i use the pylast lib for crawling. what i want to do: i want to get a users friends and the friends of the users friends. the error occurs, when i have a for loop in another for loop. here's the code: network = pylast.get_lastfm_network(api_key = API_KEY, api_secret = API_SECRET, username = username, password_hash = password_hash) user = network.get_user("vidarnelson") friends = user.get_friends(limit='200') i = 1 for friend in friends: user_ = network.get_user(friend) print '#%d %s' % (i, friend) i = i + 1 for f in user_.get_friends(limit='200'): print f any advice? thanks in advance. regards!

Read the article
SharePoint Content and Site Editing Tips

- by Bil Simser

A few content management and site editing tips for power users on this bacon flavoured unicorn morning. The theme here is keep it clean!Write "friendly" email addressesRemember it's human beings reading your content. So seeing something like "If you have questions please send an email to [email protected]" breaks up the readiblity. Instead just do the simple steps of writing the content in plain English and going back, highlighting the name and insert a link (note: you might have to prefix the link with mailto:[email protected]). It makes for a friendlier looking page and hides the ugliness that are sometimes in email addresses.Use friendly column and list namesThis is a big pet peeve of mine. When you first create a column or list with spaces the internal name is changed. The display name might be "My Amazing List of Animals with Large Testicles" but the internal (and link) name becomes "My_x00x20_Amazing_x00x20_List_x00x20_of_x00x20_Animals_x00x20_with_x00x20_Large_x00x20_Testicles". What's worse is if you create a publishing page named "This Website is Fueled By a Dolphin's Spleen". Not only is it incorrect grammar, but the apostrophe wreaks havoc on both the internal name for the list (with lots of crazy hex codes) as well as the hyperlink (where everything is uuencoded). Instead create the list with a distinct and compact name then go back and change it to whatever you want. The end result is a better formed name that you can both script and access in code easier.Keep your Views CleanWhen you add a column to a list or create a new list the default is to add it to the default view. Do everyone a favour and don't check this box! The default view of a list should be something similar to the Title field and nothing else. Keep it clean. If you want to set a defalt view that's different, go back and create one with all the fields and filtering and sorting columns you want and set it as default. It's a good idea to keep the original AllItems.aspx (note the lack of space in the filename!) easy and unfiltered. It's also a good idea to keep your column count down in views. Don't let every column be added by default and don't add every column just because you can. Create separate views for distinct responsibilities and try to keep the number of columns down to a single screen to prevent horizontal scrolling.Simple NavigationThe Quick Launch is a great tool for navigating around your site but don't use the default of adding all lists to it. Uncheck that box and keep navigation simple. Create custom groupings that make sense so if you don't have a site with "Documents and Lists" but "Reports and Notices" makes more sense then do it. Also hide internal lists from the Quick Launch. For example, if most users don't need to see all the lookup tables you might have on a site don't show them. You can use audience filtering on the Quick Launch if you want to hide admin items from non-admin users so consider that as an option.Enjoy!

Read the article
"Server not found" for live version of site

- by user1491819

I can access my local dev site on my local pc, eg: http://mysite But I cannot access the live site, even though it works fine on other pc's: http://www.mysite.com The live site gives the error in Firefox: Server Not Found. Pinging www.mysite.com gives the error:"Ping request could not find host www.mysite.com" hosts file: 127.0.0.1 mysite I changed the hosts file to the following and rebooted: 127.0.0.1 mysitedev I'm running on XP, and have cleared the DNS cache using: ipconfig /flushdns I have verified the live site is up using: http://www.isup.me/ and the site loads fine using my phone. What could be preventing my local pc from accessing the live site?

Read the article
Apache virtual host for drupal test site

- by bsreekanth

Hello, I am a programmer, trying to launch my first website.. through different helpful posts in sf and others, I setup an account with Linode and set up a slice (Debian, Apache, ..etc). I have a Drupal site under development, and like to have a test site in the Linode server as well. Now, I like to have a site setup with the following requirement. What is the best way to setup and protect the test site along with the actual (production) site?. Is virtual host is the answer? To protect the test site, is .htaccess authentication sufficient to prevent access from public and robots? I also modifying the theme, database contents etc, so having two sites under one drupal installation may not be good idea . what do u suggest? thanks in advance. bsreekanth.

Read the article
Configure IIS to rewrite IP Address to Site Name

- by Bath Man

So i've started my first web site from home, and I'm trying to get it up and running and google crawlable and the like, but I can't seem to figure out how to have my site name returned in the address bar instead of my IP address. I've purchased a domain name for my site on Godaddy and then set it to redirect to my site. When you type in the domain name, you get redirected to http://0.0.0.0/default.aspx (not my real IP obviously), and that stays in the user's address bar. In order to fix that temporarily, I've set up masking on Go Daddy which keeps the URL in the address bar, but just shows my website in a frame. This is fine for users visiting the site, however any kind of automated robot such as GoogleBot cannot discover my content because of the frame. I've looked into ISAPI filters and server-site-rewriting, and the like... but I just can't quite figure out how to do what I need it to do. Any simple suggestions or links would be appreciated.

Read the article
Can I use a Google Appliance/Mini to crawl and index sites I don't own?

- by SkippyFire

Maybe this is a stupid question, but... I am working with this company and they said they needed to get "permission" to crawl other people's sites. They have a Google Search Appliance And some Google Minis and want to point them at other sites to aggregate content. The end result will be something like a targeted search engine. (All the indexed sites relate to a specific topic) The only thing they will be doing is: Indexing Content from the other sites/domains Providing search functionality on their own site that searches the indexed content (like Google, displaying summaries and not the full content) The search results will provide links back to the original content Their intent is not malicious in nature, and is to provide a single site/resource for people to reference on their given topic. Is there anything illegal or fishy about this process?

Read the article
Understanding Ajax crawling of search site

- by vacuum

I have a couple of questions about Ajax crawling of site, which is kind of search engine itself. The base article explains the mechanism of making AJAX application crawlable. All this stuff with HTML-snapshots is clear and easy to implement, but I cant understand where will Google bot will get "the crawler finds a pretty AJAX URL"( ie www.example.com/ajax.html#key=value) to work with. First thing, that came on mind - is breadcrumb. In sitemap we can specify pages with breadcrumb on it. so bot will go to these pages and get HTML-snapshots from here. But I'm sure, there are exists other ways to give bot this "pretty AJAX URL". In our case, we have simple search site, where user enters keyword, presses "Find", js execute Ajax request, receives JSON reponce and fill page with results(without any refresh of course). In this case - how to make google bot crawle all the presults in addition to sitemap? Is there some example of solution, described in article above?

Read the article
WSS 3.0 to SharePoint 2010: Tips for delaying the Visual Upgrade

- by Kelly Jones

My most recent project has been to migrate a bunch of sites from WSS 3.0 (SharePoint 2007) to SharePoint Server 2010. The users are currently working with WSS 3.0 and Office 2003, so the new ribbon based UI in 2010 will be completely new. My client wants to avoid the new SharePoint 2010 look and feel until they’ve had time to train their users, so we’ve been testing the upgrades by keeping them with the 2007 user interface. Permission to perform the Visual Upgrade One of the first things we noticed was the default permissions for who was allowed to switch the UI from 2007 to 2010. By default, site collection administrators and site owners can do this. Since we wanted to more tightly control the timing of the new UI, I added a few lines to the PowerShell script that we are using to perform the migration. This script creates the web application, sets the User Policy, and then does a Mount-SPDatabase to attach the old 2007 content database to the 2010 farm. I added the following steps after the Mount-SPDatabase step: #Remove the visual upgrade option for site owners # it remains for Site Collection administrators foreach ($sc in $WebApp.Sites){ foreach ($web in $sc.AllWebs){ #Visual Upgrade permissions for the site/subsite (web) $web.UIversionConfigurationEnabled = $false; $web.Update(); } } These script steps loop through each Site Collection in a particular web application ($WebApp) and then it loops through each subsite ($web) in the Site Collection ($sc) and disables the Site Owner’s permission to perform the Visual Upgrade. This is equivalent to going to the Site Collection administrator settings page –> Visual Upgrade and selecting “Hide Visual Upgrade”. Since only IT people have Site Collection administrator privileges, this will allow IT to control the timing of the new 2010 UI rollout. Newly created subsites Our next issue was brought to our attention by SharePoint Joel’s blog post last week (http://www.sharepointjoel.com/Lists/Posts/Post.aspx?ID=524 ). In it, he lists some updates about the 2010 upgrade, and his fourth point was one that I hadn’t seen yet: 4. If a 2007 upgraded site has not been visually upgraded, the sites created underneath it will look like 2010 sites – While this is something I’ve been aware of, I think many don’t realize how this impacts common look and feel for master pages, and how it impacts good navigation and UI. As well depending on your patch level you may see hanging behavior in the list picker. The site and list creation Silverlight control in Internet Explorer is looking for resources that don’t exist in the galleries in the 2007 site, and hence it continues to spin and spin and eventually time out. The work around is to upgrade to SP1, or use Chrome or Firefox which won’t attempt to render the Silverlight control. When the root site collection is a 2007 site and has it’s set of galleries and the children are 2010 sites there is some strange behavior linked to the way that the galleries work and pull from the parent. Our production SharePoint 2010 Farm has SP1 installed, as well as the December 2011 Cumulative Update, so I think the “hanging behavior” he mentions won’t affect us. However, since we want to control the roll out of the UI, we are concerned that new subsites will have the 2010 look and feel, no matter what the parent site has. Ok, time to dust off my developer skills. I first looked into using feature stapling, but I couldn’t get that to work (although I’m pretty sure I had everything wired up correctly). Then I stumbled upon SharePoint 2010’s web events – a great way to handle this. Using Visual Studio 2010, I created a new SharePoint project and added a Web Event Receiver: In the Event Receiver class, I used the WebProvisioned method to check if the parent site is a 2007 site (UIVersion = 3), and if so, then set the newly created site to 2007: /// <summary> /// A site was provisioned. /// </summary> public override void WebProvisioned(SPWebEventProperties properties) { base.WebProvisioned(properties); try { SPWeb curweb = properties.Web; if (curweb.ParentWeb != null) { //check if the parent website has the 2007 look and feel if (curweb.ParentWeb.UIVersion == 3) { //since parent site has 2007 look and feel // we'll apply that look and feel to the current web curweb.UIVersion = 3; curweb.Update(); } } } catch (Exception) { //TODO: Add logging for errors } } This event is part of a Feature that is scoped to the Site Level (Site Collection). I added a couple of lines to my migration PowerShell script to activate the Feature for any site collections that we migrate. Plan Going Forward The plan going forward is to perform the visual upgrade after the users for a particular site collection have gone through 2010 training. If we need to do several site collections at once, we’ll use a PowerShell script to loop through each site collection to update the sites to 2010. If it’s just one or two, we’ll be using the “Update All Sites” button on the Visual Upgrade page for Site Collection Administrators. The custom code for newly created sites won’t need to be changed, since it relies on the UI version of the parent site. If the parent is 2010, then the new site will look 2010.

Read the article
Detecting 'stealth' web-crawlers

- by Jacco

What options are there to detect web-crawlers that do not want to be detected? (I know that listing detection techniques will allow the smart stealth-crawler programmer to make a better spider, but I do not think that we will ever be able to block smart stealth-crawlers anyway, only the ones that make mistakes.) I'm not talking about the nice crawlers such as googlebot and Yahoo! Slurp. I consider a bot nice if it: identifies itself as a bot in the user agent string reads robots.txt (and obeys it) I'm talking about the bad crawlers, hiding behind common user agents, using my bandwidth and never giving me anything in return. There are some trapdoors that can be constructed updated list (thanks Chris, gs): Adding a directory only listed (marked as disallow) in the robots.txt, Adding invisible links (possibly marked as rel="nofollow"?), style="display: none;" on link or parent container placed underneath another element with higher z-index detect who doesn't understand CaPiTaLiSaTioN, detect who tries to post replies but always fail the Captcha. detect GET requests to POST-only resources detect interval between requests detect order of pages requested detect who (consistently) requests https resources over http detect who does not request image file (this in combination with a list of user-agents of known image capable browsers works surprisingly nice) Some traps would be triggered by both 'good' and 'bad' bots. you could combine those with a whitelist: It trigger a trap It request robots.txt? It doest not trigger another trap because it obeyed robots.txt One other important thing here is: Please consider blind people using a screen readers: give people a way to contact you, or solve a (non-image) Captcha to continue browsing. What methods are there to automatically detect the web crawlers trying to mask themselves as normal human visitors. Update The question is not: How do I catch every crawler. The question is: How can I maximize the chance of detecting a crawler. Some spiders are really good, and actually parse and understand html, xhtml, css javascript, VB script etc... I have no illusions: I won't be able to beat them. You would however be surprised how stupid some crawlers are. With the best example of stupidity (in my opinion) being: cast all URLs to lower case before requesting them. And then there is a whole bunch of crawlers that are just 'not good enough' to avoid the various trapdoors.

Read the article
Can google “see” this custom javascript code which displays links from an external site to mine

- by webmasters

I have a javascript code on my site who displays links from another site. This is what I have on my source before: <script language="JavaScript" type="text/javascript">showLink(1);</script> This is what I have copied from my source after the page has loaded: <script language="JavaScript" type="text/javascript">showLink(1);</script><a rel="nofollow" target="_blank" class="anc" href="http://x5.external_site.net/sc/out.php?s=5483&o=http%3A%2F%2Fwww.bluetooth.com">Bluetooth Devices</a> Can google see this link?

Read the article
.htaccess: Redirect Hotlink Flash --> Site with embed Flash

- by user5571

I have some .php sites that embeds .swf files. These .swf files are now linked to by some other guys. And I don't want them to simply open the SWF, I want them to force being redirect to the page where the flash is embed. Data: Site: www.example.com/1 (www.example.com/2, www.example.com/3 and so on) Flash: www.example.com/flash/flash_NUMBER.swf So for www.example.com/1: Site: www.example.com/1 Flash: www.example.com/flash/flash_1.swf I now want to redirect the user who types "www.example.com/flash/flash_1.swf" into his URL to be redirect to www.example.com/1. The Problem I have that the flash needs to be still accesseable via www.example.com/1 <-- I don't get that working (the Flash is embed into that page). The tool I would like to use for this is the .htaccess & RewriteRule. I hope someone can help me out.

Read the article

Search Results

Search found 24376 results on 976 pages for 'site crawler'.

Page 18/976 | < Previous Page | 14 15 16 17 18 19 20 21 22 23 24 25 | Next Page >

- by ktolis

- by Ben

- by Shaun

- by Jivago

- by Nazariy

- by Lynn

- by Ben

- by Sampo Sarrala

- by michaeljt

- by Koning Baard XIV

- by Chris J

- by Duffy

- by willvv

- by pthurmond

- by Clemens

- by Bil Simser

- by user1491819

- by bsreekanth

- by Bath Man

- by SkippyFire

- by vacuum

- by Kelly Jones

- by Jacco

- by webmasters

- by user5571

< Previous Page | 14 15 16 17 18 19 20 21 22 23 24 25 | Next Page >