Search Results

Search found 241 results on 10 pages for 'crawling'.

Page 4/10 | < Previous Page | 1 2 3 4 5 6 7 8 9 10 | Next Page >

Hiding php includes from search spiders?

- by 21stcn

Quick and simple question. I have 80+ html files which I want to be crawled. They are individual product pages. Each of these pages calls its content using php includes. These php include files are in a separate folder on the server and contain the core content for the individual product pages. I just wanted to ask, if I use robots.txt or .htaccess to prevent crawling of the directory that holds the php content files, will there be no issue crawling the html pages which include these files? What I want to achieve is have the html files indexed with the php content included in them, but I don't want visitors landing on the php content pages, nor have these php files indexed as duplicate content. Just clarification needed as to whether it is safe to block spiders from accessing the php folder, without this affecting the html files being indexed with the included content. Is this the best way to do things? Or should I just leave the content php files to be crawled?

Read the article
Google is not indexing my entire site despite having a sitemap

- by Anusha

I have an e-commerce website www.beyondtime.in. I have been constantly monitoring Googlebot crawling on my website and my webmaster account. Lately, I have found two issues that I have not been able to understand. 1.) The Google Bots have been only crawling www.beyondtime.in/telecom.php when the URL is not even valid. What needs to be done to let Google crawl other pages of the website as well? 2.) The second question is about the Google Webmaster account, where I've submitted my sitemap with 227 URLs. Out of that, only 156 have been indexed. None of the images of my website have been indexed by Google.

Read the article
Understanding the maximum hit-rate supported by a web-server

- by SNag

I would like to crawl a publicly available site (and one that's legal to crawl) for a personal project. From a brief trial of the crawler, I gathered that my program hits the server with a new HTTPRequest 8 times in a second. At this rate, as per my estimate, to obtain the full set of data I need about 60 full days of crawling. While the site is legal to crawl, I understand it can still be unethical to crawl at a rate that causes inconvenience to the regular traffic on the site. What I'd like to understand here is -- how high is 8 hits per second to the server I'm crawling? Could I possibly do 4 times that (by running 4 instances of my crawler in parallel) to bring the total effort down to just 15 days instead of 60? How do you find the maximum hit-rate a web-server supports? What would be the theoretical (and ethical) upper-limit for the crawl-rate so as to not adversely affect the server's routine traffic?

Read the article
How URL Redirection affects SEO?

- by Costa

The following paragraph is from SEO Google Guide Google is good at crawling all types of URL structures, even if they're quite complex, but spending the time to make your URLs as simple as possible for both users and search engines can help. Some webmasters try to achieve this by rewriting their dynamic URLs to static ones; while Google is fine with this, we'd like to note that this is an advanced procedure and if done incorrectly, could cause crawling issues with your site. What makes URL re-writing implementation incorrect for GoogleBot? I am using Asp.net 3.5 framework. Thanks

Read the article
How does URL Rewriting affect SEO?

- by Costa

The following paragraph is from SEO Google Guide Google is good at crawling all types of URL structures, even if they're quite complex, but spending the time to make your URLs as simple as possible for both users and search engines can help. Some webmasters try to achieve this by rewriting their dynamic URLs to static ones; while Google is fine with this, we'd like to note that this is an advanced procedure and if done incorrectly, could cause crawling issues with your site. What makes URL re-writing implementation incorrect for GoogleBot? I am using Asp.net 3.5 framework.

Read the article
SharePoint Search Problem: The crawler could not communicate with the server.

- by Clara Oscura

This one was not easy to solve ... Error: The crawler could not communicate with the server. Check that the server is available and that the firewall access is configured. Context: Some pages were not crawled (giving the above error) and, what is worse, all the sub content of that site was not crawled either! (the pages were the homepages of the site) Solution:The pages that could not be crawled due to this error contained a custom web part. This web part used default credentials for a given action. During crawling, the SP_Search account tried to perform this action but did not have the appropriate rights. This gave an error that stopped the crawling for the whole site. This blog helped me: http://patricklamber.blogspot.com/2010/04/why-might-moss-crawler-not-working.html

Read the article
Issue with sitemap in GWT

- by Anusha

I have an e-commerce website www.beyondtime.in, i have been constantly monitoring the google bot crawling on my website and my webmaster account. Lately, i have found two issues that i have not been able to understand and hence want your help. 1.) The Google Bots have been only crawling www.beyondtime.in/telecom.php this URL of my website, when the URL is not even valid. So, kindly help me understand what needs to be done to let Google crawl other pages of the website as well. 2.) The second question is about the Google Webmaster account, where i've submitted my sitmap with 227 URLs, but out of that only 156 have been indexed. Also none of the images of my website have been indexed by Google. So kindly help me with this as well. Thanks

Read the article
Recursive function MultiThreading to perform one task at a time.

- by Ajay

Hi, I am writing a program to crawl the websites. The crawl function is a recursive one and may consume more time to complete, So I used Multi Threading to perform the crawl for multiple websites. What exactly I need is, after completion crawling one website it call next one (which should be in Queqe) instead multiple websites crawling at a time. I am using C# and ASP.NET.

Read the article
Varnish / Apache redirecting to backend port 8080

- by deko

I'm running Varnish 2 with Apache backend at 8080 on the same machine. Everything is working fine, except one problem: Sometimes Apache(?) is redirecting to backend port :8080 especially when I'm using htaccess. Users are displayed the 8080 port in the URL and Google is crawling my site on the backend port as well, which is not desirable. I want Apache 8080 to be accessible only to Varnish on localhost, and not to redirect or display the backend port. What would be a quick way to prevent users being directed to 8080 and search engines denied crawling the backend? Here is an example htaccess line: redirect /promotion /register.php?promotion=june which causes www.domain.com/promotion to redirect to www.domain.com:8080/register.php?promotion=june

Read the article
SharePoint 2010 Search - not search additional content sources

- by Chris W

I've got SP 2010 crawling a secondary intranet system that we'll run in parallel as part of a long running migration to SharePoint when it releases. Whilst it's crawling the pages without problem I can't see how to get the results to appear as part of the Quick Search results if the user does a search from the little search dialog box on the home page. Searches completed within a My Sites pages lists results from port the SharePoint installation and the external content source. Searches from the main search dialog only list results of SharePoint items. I tried adding the drop down option to select the site to search but this list only includes the name of the current site and doesn't offer an 'All Sites' scope option which I think would include the content. What am I doing wrong?

Read the article
Web-Server directory permissions

- by MLS

Hello All, I would like some help understanding web-server directory permissions. Apache, CentOS, PHP, Mysql Example, I have multiple sites in /var/www/html They are in paths like: /var/www/html/www_domainname_com inside each site I might have a path like /lib/mysql/ like PHP connect stuff, database config, etc. What should me permissions be so that someone cannot just browse to that directory? Should I just .htaccess them? I have apache:apache as the owner of all my web directories. Can I prevent someone from crawling certain directories of my web-server? I have a robots.txt, but what is to say the crawler obeys it? So to sum up: 1. What is the best owner/permission set for my sensitive files that the web-server or php or mysql needs, but I dont want people browsing to? Can I prevent straight out crawling of portions?

Read the article
Robots.txt practices with .htaccess redirections (inherits)

- by Jayhal

I have a question regarding how to write robots.txt files for many domains and subdomains with redirects in place. We have a hosting account that enacts primary and add-on domains. All of our domains and subdomains, including the primary domain, is redirected via htaccess 301s to their own subdirectories in the primary domain's root directory. I'm confused about how I would write the robots.txt for certain directories. First, I wanted to confirm I am right in understanding that for domains and subdomains, crawlers will look to the directory that acts as that urls root directory for the crawling rules(robots.txt). Also, that a directory will not be affected by a robots.txt present in their parent directory if the directory has its own domain/subdomain, and that url is the one being accessed by crawlers. (Am pretty sure, but I wanted to confirm I didnt have a fundamentally flawed understanding of robots.txt) In the original root directory on the account(where the primary domain was directed before htaccess was put in place) what should the robots.txt contain? When crawlers look to crawl our primary domain, will they look to the original root directory for the robots.txt or will they reference the file contained in the new subdirectory where all the primary domain's site files are located? If so, what should the root's robot.txt include if anything at all. Would I be right to include a simple 'disallow: /' for all agents, and then include more specific robots.txt files in each subdirectory with more specific instructions. Would that affect the crawling of the directory where the primary domain is now redirected? Any help is greatly appreciated, Thanks!

Read the article
Recovering from an incorrectly deployed robots.txt?

- by Doug T.

We accidentally deployed a robots.txt from our development site that disallowed all crawling. This has caused traffic to dip dramatically, and google results to report: A description for this result is not available because of this site's robots.txt – learn more. We've since corrected the robots.txt about a 1.5 weeks ago, and you can see our robots.txt here. However, search results still report the same robots.txt message. The same appears to be true for Bing. We've taken the following action: Submitted site to be recrawled through google webmaster tools Submitted a site map to google (basically doing everything possible to say "Hey we're here! and we're crawlable!") Indeed a lot of crawl activity seems to be happening lately, but still no description is crawled. I noticed this question where the problem was specific to a 303 redirect back to a disallowed path. We are 301 redirecting to /blog, but crawling is allowed here. This redirect is due to a site redesign, wordpress paths for posts such as /2012/02/12/yadda yadda have been moved to /blog/2012/02/12. We 301 redirect to wordpress for /blog to keep our google juice. However, the sitemap we submitted might have /blog URLs. I'm not sure how much this matters. We clearly want to preserve google juice for URLs linked to us from before our redesign with the /2012/02/... URLs. So perhaps this has prevented some content from getting recrawled? How can we get all of our content, with links pointed to our site from pre-and-post redesign reporting descriptions? How can we resolve this problem and get our search traffic back to where it used to be?

Read the article
SEO & Multilingual: would be this a good practise?

- by Younès

I am currently making a bilingual website and I'd like to get nice SEO results of course. Here's my idea: The internal links would be composed of the "www" subdomain so that people can share links regardless of their language. Anyway, their language is determined by the HTTP_ACCEPT_LANGUAGE PHP variable. So, they would see http:// www.site.com/mydocument/123 in their adress bar and never see any links like "http:// fr.site.com/mydocument/123" or "http://en.site.com/mydocument/123" The user can always switch the page's language thanks to links in the footer. The switching language link would be : http:// fr.site.com/mydocument/123 , and clicking on it would change his language session and redirects the user to http:// www.site.com/mydocument/123 In case of a crawling bot: I read that if the HTTP_USER_LANGUAGE variable was missing then it's a crawling bot. So, in that case, we set the defaut language as English. Each page, as I mentionned earlier, has a link for another language: On the page: http:// www.site.com/document/1323, the link http:// fr.site.com/document/1323 can be seen by the bot and be crawled. What do you think about this practise ? Would I get good SEO results for each language ?

Read the article
problem with crawl many url in .net: Server IP not ping. maybe bandwidth or http connection limit ex

- by Hamid

Hi to all I develop web crawling service (windows service / multi-thread) . its work fine, but sometimes my server network not response. and i can't ping server IP (from internet), but can ping by other network card (local ip) that not access to internet. after i open server with remote desktop and stop crawling service. i could ping. What's my problem? Bandwidth limit or max connection limit exceed or ??? how to prevent this issue? Note: when this problem occur, i open browser for browse web site, but can't open any website!!! Could you please help me. Thanks in advanced

Read the article
Having problem with a crawl service in .net: Server not responding to IP ping. Is it bandwidth or ht

- by Hamid

Hi to all I develop web crawling service (windows service / multi-thread) . its work fine, but sometimes my server network not response. and i can't ping server IP (from internet), but can ping by other network card (local ip) that not access to internet. after i open server with remote desktop and stop crawling service. i could ping. What's my problem? Bandwidth limit or max connection limit exceed or ??? how to prevent this issue? Note: when this problem occur, i open browser for browse web site, but can't open any website!!! Could you please help me. Thanks in advanced

Read the article
Do you need to crawl the whole internet to find backlinks of a URL?

- by Luca Matteis

Say I want to retrieve all the sites on the web that have a specific link on them. For example I want to know all the backlinks made to my blog, on other websites. There are services out there that do this: http://www.backlinkwatch.com/index.php - was wondering how they achieve this functionality. Is crawling the entire internet the only option or are there third-party ways of doing this, say using Google.

Read the article
How to suppress PHPSESSID in URL for Googlebot?

- by Roque Santa Cruz

I use cookie based sessions, and they work for normal interaction with our site. However, when Googlebot comes crawling out PHP framework, Yii, needs to append ?PHPSESSID to each URL, which doesn't look that good in SERP. Any ways to suppress this behavior? PS. I tried to utilize ini_set('session.use_only_cookies', '1');, but it does not work. PPS. To get an impression of the SERP, they look like this: http://www.google.com/search?q=site:wwwdup.uni-leipzig.de+inurl:jobportal

Read the article
Navigation Category page not indexed by google

- by dhananjai gaur

Navigation menu is in the form of categories on website htttp://bankpo.in, none of the category page is being indexed by google. I searched with exact URL for a category, still related results are shown instead of category page. I have checked google webmaster and there are no crawling error or any other error message, so i am really confused what might be the problem. Website is not a new one and is continuously updated, so please can anyone tell me the reason for this

Read the article
Getting a lot of backslash underscore errors from webmaster tools

- by Vermino

I'm using a wordpress site and I thought I got all the kinks out of it. For some reason Webmaster tools is crawling my website and showing a lot of 404 errors which are from "/_" like additional pages that's i've never created. I just can't figure out what is creating these to google crawlers and then displaying a 404. my robots txt http://www.redcherryshrimp.net/robots.txt my sitemap created from Yoast plugin http://www.redcherryshrimp.net/sitemap_index.xml I have Yoast(creates the sitemap) and Jetpack plugins installed

Read the article
Is it possible to tell a search engine not to index a specific section of an HTML page? [closed]

- by Justin

Possible Duplicate: Preventing robots from crawling specific part of a page I know you can use robots.txt to ignore entire pages or sections of your site, but is there a way to tell cralwers like the Googlebot to ignore specific sections of an HTML page? I found this blog post that discusses one method, but it appears only to work for the Google Search Appliance, not the Googlebot. Is there some method for at least Google for to do this?

Read the article
Why Lazy Websites Get Massive Search Engine Traffic - Here's the Secret

Do you want to get thousands of free targeted traffic crawling all over your website? Here's how to get them.

Read the article
When Canonicalization is an Issue

Although extremely hard to pronounce, canonicalization is a hot topic right now. If there are a lot of URLs that lead to pretty much the same page, you're going to make the search engines work extra hard and spend a lot more time crawling all the different URLs. Often times, this means that they'll miss the important pages of your website because your crawl time is limited or too slow.

Read the article
How Many Web Pages Should Be Indexed?

Search engines are crawling websites around the clock for unique web pages and content.Google has always been on the top in indexing deep-links of any website, Google indexed 26 million pages in 1998 and in past 10 years Google have indexed over 1 trillion pages. So, this gives a fair idea that how big this cyber world is.

Read the article
What is the Fastest Way to Get Listed in the Search Engines?

Getting a site up and getting it listed is something that any website owner wants to have happen. Nobody likes to wait around until Google bot decides to come crawling your way.

Read the article

< Previous Page | 1 2 3 4 5 6 7 8 9 10 | Next Page >