Search Results

Search found 446 results on 18 pages for 'crawl'.

Page 3/18 | < Previous Page | 1 2 3 4 5 6 7 8 9 10 11 12 | Next Page >

fastest way to crawl recursive ntfs directories in C++

- by Peter Parker

I have written a small crawler to scan and resort directory structures. It based on dirent(which is a small wrapper around FindNextFileA) In my first benchmarks it is surprisingy slow: around 123473ms for 4500 files(thinkpad t60p local samsung 320 GB 2.5" HD). 121481 files found in 123473 milliseconds Is this speed normal? This is my code: int testPrintDir(std::string strDir, std::string strPattern="*", bool recurse=true){ struct dirent *ent; DIR *dir; dir = opendir (strDir.c_str()); int retVal = 0; if (dir != NULL) { while ((ent = readdir (dir)) != NULL) { if (strcmp(ent->d_name, ".") !=0 && strcmp(ent->d_name, "..") !=0){ std::string strFullName = strDir +"\\"+std::string(ent->d_name); std::string strType = "N/A"; bool isDir = (ent->data.dwFileAttributes & FILE_ATTRIBUTE_DIRECTORY) !=0; strType = (isDir)?"DIR":"FILE"; if ((!isDir)){ //printf ("%s <%s>\n", strFullName.c_str(),strType.c_str());//ent->d_name); retVal++; } if (isDir && recurse){ retVal += testPrintDir(strFullName, strPattern, recurse); } } } closedir (dir); return retVal; } else { /* could not open directory */ perror ("DIR NOT FOUND!"); return -1; } }

Read the article
Does Google crawl AJAX content?

- by Doug

On the home page of my site I use JQuery's ajax function to pull down a list of recent activity of users. The recent activity is displayed on the page, and each line of the recent activity includes a link to the user profile of the user who did the activity. Will Google actually make the ajax call to pull down this info and use it in calculating page relevancy / link juice flow? I'm hoping that it does not because the user profile pages are not very Google index worthy, and I don't want all those links to the User profile pages diluting my home page's link juice flow away from other more important links.

Read the article
Adding more OR searches with CONTAINS Brings Query to Crawl

- by scolja

I have a simple query that relies on two full-text indexed tables, but it runs extremely slow when I have the CONTAINS combined with any additional OR search. As seen in the execution plan, the two full text searches crush the performance. If I query with just 1 of the CONTAINS, or neither, the query is sub-second, but the moment you add OR into the mix the query becomes ill-fated. The two tables are nothing special, they're not overly wide (42 cols in one, 21 in the other; maybe 10 cols are FT indexed in each) or even contain very many records (36k recs in the biggest of the two). I was able to solve the performance by splitting the two CONTAINS searches into their own SELECT queries and then UNION the three together. Is this UNION workaround my only hope? Thanks. SELECT a.CollectionID FROM collections a INNER JOIN determinations b ON a.CollectionID = b.CollectionID WHERE a.CollrTeam_Text LIKE '%fa%' OR CONTAINS(a.*, '"*fa*"') OR CONTAINS(b.*, '"*fa*"') Execution Plan (guess I need more reputation before I can post the image):

Read the article
Does the user agent in any regular browser contain 'bot' or 'crawl'?

- by Echo

Does the user agent in any regular browser contain 'bot' or 'crawl'? I check the user agent on my site to see if it is coming from a bot or not. If it is, I can do some little optimizations since they don't login. (I don't change the content at all) After adding checks for 30-40+ bots, I'm getting tired of added them. So I was wondering if checking if it just contains 'bot' or 'crawl'. I know that wont get all bots, but it would get a lot of them. But if that could cause any false positives, then it would totally mess up the ability to add to cart, place an order, and login in.

Read the article
Massive Crawling requests from Google Apps Engine useragent

- by SilentPlayer

Hi friends, I'm badly affected with 'Google AppEngine-Google' UserAgent.. receiving 5/6 requests per second on http server. This bot is crawling my site just like GoogleBot does. Following is the sample of url in my access logs. 72.14.192.3 - - [19/May/2010:01:27:06 +0000] "GET /some-url/etc-123.htm HTTP/1.1" 200 4707 "-" "AppEngine-Google; (+http://code.google.com/appengine; appid: harpy000)" I have checked the ip address it is registered with Google Inc. Can anyone tell me where i can report Abuse to Google Inc. Or any information about this issue. Thank you!

Read the article
Is there a good open source search engine including indexing bot which can be used to make up speci

- by Skuta

Hello, Our application (C#/.NET) needs a lot of queries to search. Google's 50,000 policy per day is not enough. We need something that would crawl Internet websites by specific rules we set (for ex. country domains) and gather URLs, Texts, keywords, name of websites and create our own internal catalogue so we wouldn't be limited to any massive external search engine like Google or Yahoo. Is there any free open source solution we could use to install it on our server? No point in re-inventing the wheel.

Read the article
Is there any reason to allow Yahoo! Slurp to crawl my site?

- by James Skemp

I thought a year or more ago Yahoo! would be using another search engine for results, and no longer using their own Slurp bot. However, a couple of the sites I manage Yahoo! Slurp continues to crawl pages, and seems to ignore the Gone status code when returned (as it keeps coming back). Is there any reason why I wouldn't want to block Yahoo! Slurp via robots.txt or by IP (since it tends to ignore robots.txt in some cases anyways)? I've confirmed that when the bot does hit it is from Yahoo! IPs, so I believe this is a legit instance of the bot. Is Yahoo Search the same as Bing Search now? is a related question, but I don't think it completely answers whether one should add a new block of the bot.

Read the article
How to fix Google 404 not found Crawl Errors?

- by Freeme

I was checking on Google webmater tool for my blog site to see if there's any indication on why my blog traffic decreased to half in one day and i saw 43 Not Found crawl errors and 5 in Sitemap Not Found errors. The 5 Not Found errors in Sitemap were the links to categories. I guess I renamed categories that's why google can't find the links. As for the 43 other Not Found errors, I see blog post titles that contains (' .) EX: McDonald's, O.N.E. They weren't found by google crawler. Blog post with /CachedYou at the end and blog posts with /www.example.com attached at the end, they weren't found by Google crawlers either. My question is how do I correct those Not Found Errors? Thanks

Read the article
Do search engines crawl PDFs and if so are there any rules to follow when making them

- by RandomBen

The website I am working on has a few hundred PDFs in it. I don't think I have ever seen any of them come back in a search but there are linked to directly from out site. They are also full of keywords because they are product documents. Is there anything special we need to do to get Google or other search engines to crawl them? Is there any hard and fast rules for making PDFs to help Google like them more? For instance should I run them through ghostscript to clean up broken PDF tags that Adobe creates during generation?

Read the article
Is this Anti-Scraping technique viable with Crawl-Delay?

- by skibulk

I want to prevent web scrapers from abusing 1,000,000 on my website. I'd like to do this by returning a "503 Service Unavailable" error code for users that access an abnormal number of pages per minute. I don't want search engine spiders to ever receive the error. My inclination is to set a robots.txt crawl-delay which will ensure spiders access a number of pages per minute under my 503 threshold. Is this an appropriate solution? Do all major search engines support the directive? Could it negatively affect SEO? Are there any other solutions or recommendations?

Read the article
Data extract from website URL

- by user2522395

From this below script I am able to extract all links of particular website, But i need to know how I can generate data from extracted links especially like eMail, Phone number if its there Please help how i will modify the existing script and get the result or if you have full sample script please provide me. Private Sub btnGo_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles btnGo.Click 'url must be in this format: http://www.example.com/ Dim aList As ArrayList = Spider("http://www.qatarliving.com", 1) For Each url As String In aList lstUrls.Items.Add(url) Next End Sub Private Function Spider(ByVal url As String, ByVal depth As Integer) As ArrayList 'aReturn is used to hold the list of urls Dim aReturn As New ArrayList 'aStart is used to hold the new urls to be checked Dim aStart As ArrayList = GrabUrls(url) 'temp array to hold data being passed to new arrays Dim aTemp As ArrayList 'aNew is used to hold new urls before being passed to aStart Dim aNew As New ArrayList 'add the first batch of urls aReturn.AddRange(aStart) 'if depth is 0 then only return 1 page If depth < 1 Then Return aReturn 'loops through the levels of urls For i = 1 To depth 'grabs the urls from each url in aStart For Each tUrl As String In aStart 'grabs the urls and returns non-duplicates aTemp = GrabUrls(tUrl, aReturn, aNew) 'add the urls to be check to aNew aNew.AddRange(aTemp) Next 'swap urls to aStart to be checked aStart = aNew 'add the urls to the main list aReturn.AddRange(aNew) 'clear the temp array aNew = New ArrayList Next Return aReturn End Function Private Overloads Function GrabUrls(ByVal url As String) As ArrayList 'will hold the urls to be returned Dim aReturn As New ArrayList Try 'regex string used: thanks google Dim strRegex As String = "<a.*?href=""(.*?)"".*?>(.*?)</a>" 'i used a webclient to get the source 'web requests might be faster Dim wc As New WebClient 'put the source into a string Dim strSource As String = wc.DownloadString(url) Dim HrefRegex As New Regex(strRegex, RegexOptions.IgnoreCase Or RegexOptions.Compiled) 'parse the urls from the source Dim HrefMatch As Match = HrefRegex.Match(strSource) 'used later to get the base domain without subdirectories or pages Dim BaseUrl As New Uri(url) 'while there are urls While HrefMatch.Success = True 'loop through the matches Dim sUrl As String = HrefMatch.Groups(1).Value 'if it's a page or sub directory with no base url (domain) If Not sUrl.Contains("http://") AndAlso Not sUrl.Contains("www") Then 'add the domain plus the page Dim tURi As New Uri(BaseUrl, sUrl) sUrl = tURi.ToString End If 'if it's not already in the list then add it If Not aReturn.Contains(sUrl) Then aReturn.Add(sUrl) 'go to the next url HrefMatch = HrefMatch.NextMatch End While Catch ex As Exception 'catch ex here. I left it blank while debugging End Try Return aReturn End Function Private Overloads Function GrabUrls(ByVal url As String, ByRef aReturn As ArrayList, ByRef aNew As ArrayList) As ArrayList 'overloads function to check duplicates in aNew and aReturn 'temp url arraylist Dim tUrls As ArrayList = GrabUrls(url) 'used to return the list Dim tReturn As New ArrayList 'check each item to see if it exists, so not to grab the urls again For Each item As String In tUrls If Not aReturn.Contains(item) AndAlso Not aNew.Contains(item) Then tReturn.Add(item) End If Next Return tReturn End Function

Read the article
Can sharepoint search crawl items in a hidden list?

- by Donaldinio

I have had mixed results with this. If i have an item in a hidden list, search does not seem to crawl it. But If i make it visible, and crawl it will get indexed. and if I hide it again and update it it will get crawled again! Does anyone know if search is supposed to be able to search items in a hidden list or not? thanks

Read the article
Do you need to crawl the whole internet to find backlinks of a URL?

- by Luca Matteis

Say I want to retrieve all the sites on the web that have a specific link on them. For example I want to know all the backlinks made to my blog, on other websites. There are services out there that do this: http://www.backlinkwatch.com/index.php - was wondering how they achieve this functionality. Is crawling the entire internet the only option or are there third-party ways of doing this, say using Google.

Read the article
google changing crawl speed: doesn't seem to work. Why?

- by Olivier Pons

I've changed 3 days ago the google crawling speed of mywebsite. Here it is: This means: 2 demands by second. I've got the message on the google webmasters tools that the change speed has been taken in account: But after more than three days, nothing happens: still one request every ten seconds See here: My webserver is very fast and can handle up to twenty simultaneous connexions. And my website is brand new, this means google is almost the only one here crawling my website. After more than 30000 successful requests (= no 404), I think there's something going on... or maybe this is just a bug? Has anyone ever had this problem?

Read the article
Tips & Tricks: How to crawl a SSL enabled Oracle E-Business Suite

- by Rajesh Ghosh

Oracle E-Business Suite can be integrated with Oracle Secure Enterprise Search for a superior end user experience and enhanced data retrieval capabilities. Before end-users can perform search operations, data has to be crawled and indexed into Oracle SES server. However if the Oracle E-Business Suite instance is on SSL, some additional configurations are needed in Oracle SES server as well as in Oracle Search Modeler, before a search object can be deployed and crawled. The process involves the following steps: Step 1: Export the SSL certificate of Oracle E-Business Suite Access the Oracle E-business Suite instance from a web browser. You should be able to locate a security or certificate icon somewhere in the browser toolbar or status bar, depending on which browser you are using. Click on it and you should be able to view the certificate as well as export it to a local file. While exporting make sure that you use “DER encoded” format. Step 2: Import the SSL certificate into Oracle Secure Enterprise server’s java key-store Oracle SES (10.1.8.4) by default ships a JDK under $ORACLE_HOME. The Oracle SES mid-tier uses this jdk to start the oc4j container services. In this step the Oracle E-Business Suite’s SSL certificate which has been exported in step #1, has to be imported into the Oracle SES server’s java key store. Perform the following: Copy the certificate file onto the server where Oracle SES server is running; under $ORACLE_HOME/jdk/jre/lib/security/cacerts. “ORACLE_HOME” points to the Oracle SES oracle home. Set the JAVA_HOME environment variable to $ORACLE_HOME/jdk. Append $JAVA_HOME/bin to the PATH environment variable Issue the command : “keytool -import -keystore keystore.jks -trustcacerts -alias myOHS –file ebs.crt” . Please substitute “ebs.crt” with the name of the certificate file you copied in step #2.1. The default key-store password “changeit”. Enter the same when prompted. If successful this process will end with a message saying “certificate successfully imported”. Step 3: Import the SSL certificate into Search Modeler java key-store Unlike Oracle SES, Search Modeler is not shipped with a bundled JDK. If you are using standalone OC4J, then you actually use an external JDK to start the oc4j container services. If you are using IAS instance then the JDK comes bundled with the IAS installation. Perform the following: Copy the certificate file onto the server where Search Modeler application is running; under $JDK_HOME/jre/lib/security/cacerts. “JDK_HOME” points to the JDK directory depending on whether you are using external JDK or a bundled one. Set the JAVA_HOME environment variable to JDK directory. Append $JAVA_HOME/bin to the PATH environment variable Issue the command : “keytool -import -keystore keystore.jks -trustcacerts -alias myOHS –file ebs.crt” . Please substitute “ebs.crt” with the name of the certificate file you copied in step #3.1. The default key-store password “changeit”. Enter the same when prompted. If successful this process will end with a message saying “certificate successfully imported”. Once you have completed the above steps successfully, you can deploy the search objects using Search Modeler and then start crawling them as well.

Read the article
How do I get Google to crawl my content when it's only displayed when you fill in a form?

- by Sarang Patil

I have a webpage. It has a form and the "results" section is blank. When the user searches for items, and a list that pops up, he/she chooses one option from list and then the corresponding results are displayed in results section. I once decided to log every ip,url of person with time that visits my page. One ip was 66.249.73.26, and on doing google search I came to know it is ip of google bot. link for whatmyipaddress google bot Now when I searched for the links that this ip visited, it was like this: search?id=100 search?id=110 ... search?id=200 ... then afterwards it incremented in steps of 1, like 400,401.. But people search for strings and not numbers. And because googlebot searches for numbers like this, I think the corresponding content is never displayed and so my page content is never indexed, even though it has rich content. So I want to ask you is that in order to show google bot all the content that the webpage has, should I list all the results in index page and ask users to enter string to filter results?

Read the article
Does Nutch automatically crawl my site when new pages are added?

- by murali

Does Nutch crawl automatically when I add new pages to the website?

Read the article
Can I use a Google Appliance/Mini to crawl and index sites I don't own?

- by SkippyFire

Maybe this is a stupid question, but... I am working with this company and they said they needed to get "permission" to crawl other people's sites. They have a Google Search Appliance And some Google Minis and want to point them at other sites to aggregate content. The end result will be something like a targeted search engine. (All the indexed sites relate to a specific topic) The only thing they will be doing is: Indexing Content from the other sites/domains Providing search functionality on their own site that searches the indexed content (like Google, displaying summaries and not the full content) The search results will provide links back to the original content Their intent is not malicious in nature, and is to provide a single site/resource for people to reference on their given topic. Is there anything illegal or fishy about this process?

Read the article
How to make googlebot to crawl a page? [closed]

- by mamadum

What is the best way of forcing googlebot to crawl the page? I've put up Google analytics, registered site with Google webmaster. I've done great deal of SEO work on the website with keywords and titles, I took care of microdata. I submitted the site anonymously, I successfully fetched the site and submit for indexing couple of days ago and still nothing. Last time googlebot visited the site is almost 1 month ago and the indexed content is now obsolete. Am I missing something? or is it just a slow process??

Read the article
Why does my Internet slow to a crawl unless I reboot my router every few days?

- by Lord Torgamus

A few weeks ago, I noticed that my Internet connection had slowed down to a crawl. I waited a few days hoping it would go away on its own, but it didn't get better. So I asked this question about how to make it faster. The problem went away after I updated to the latest firmware, so I didn't follow up too carefully. But every few days since then, my Internet has slowed down again. Unlike before, all I have to do to fix it is open the router administration page and press the "Reboot" button. Nothing else seems to work, though I'm sure there are options I haven't tried. If it makes a difference, my girlfriend and I both transfer large amounts of data fairly routinely for school (videoconferencing, downloading entire recorded lectures). The router is a Cisco/Linksys 160N V3 that's about a year old. Most of the time, it deals with just two standard Windows 7 laptops. The only thing I came across while searching for answers/dupes was this question, which seems similar superficially, but probably doesn't have the same root issue. Anyways, it's not resolved. What could be causing these slowdowns, and how can I get rid of them?

Read the article
Use SharePoint Search to crawl Project Server project metadata?

- by Kit Menke

Our environment consists of Project Server 2007 and MOSS 2007. We have around 750 projects and lots of "Enterprise Custom Fields" set up to track all of the metadata associated with a project. Our main requirement is to be able to search/filter/group/sort all of these projects by metadata in SharePoint. Our current process involves syncing this custom metadata into a SharePoint list (which requires a LOT of maintenance). Question: Is it possible to leverage SharePoint search to crawl/index these metadata fields in Project Server? How would I go about setting this up?

Read the article
sharepoint search is not working

- by Nikkho

Hi all, I have an issue with SharePoint search. The situation The server is installed with SharePoint on a farm with 2 servers. A new app pool is created and that app pool is using a domain account called moss_service. moss_service is set to be in the administrator group in both server. moss_service is also set to be the db_creator in the content database. When I checked it initially, the search's default content access account is using another different account, I changed that to be using moss_service account. I didn't do IIS reset because this is a production server, they dont want frequent iis reset. Strangely, checking the services.msc under "office sharepoint server search" the account is still using an old one. (and apparently it's only running on 1 server, the other server is not running) I then change that to the following: domain\moss_service with the password. and then I rerun the crawl. How do I diagnose the issue Basically everytime I change something I restart the crawl and then check the event viewer. Multiple things come out but the following is the major ones: The start address cannot be crawled. The password for the content access account cannot be decrypted because it was stored with different credentials. Re-type the password for the account used to crawl this content. (0x80042406) Performance monitoring cannot be initialized for the gatherer object, because the counters are not loaded or the shared memory object cannot be opened. This only affects availability of the perfmon counters. Restart the computer. Access is denied. Check that the Default Content Access Account has access to this content, or add a crawl rule to crawl this content. (0x80041205) Crawl Logs Result The crawl log is showing this: The password for the content access account cannot be decrypted because it was stored with different credentials. Re-type the password for the account used to crawl this content. I tried changing it again at service.mstsc and the rerun the full crawl again but then it doesn't work. I have tried entering it using the following way: [email protected] and domain\moss_service My Questions are: How do I fix this? Is this the right way to setup the search? Does the search account has to be using a different domain account? Seemed like one fix complicates the other, how do I set this right? Is it worth it to upgrade to sp2?

Read the article
Where's my memory?! Nginx + PHP-FPM front end webserver slows to a crawl...

- by incredimike

I'm not sure if I have a problem with a memory leak (as my hosting company suggests), or if we both need to read http://linuxatemyram.com. Maybe you clever people can help us out? This is a front-end webserver VM running essentially only nginx & php-fpm on RHEL 5.5. This server is powering Magento, a PHP eCommerce thinggy. The server is running in a shared environment, but we're changing that soon. Anyway.. after a reboot the server runs just fine, but within a day it will grind itself into nothingness. Pages will take literally 2 minutes to load, CPU spikes like crazy, etc.. The console is even sluggish when I SSH in. It's like my whole server is being brought to its knees. I've also been monitoring the DB server via top and tcpdumping incoming traffic. The DB stays idle for a good portion of that "slow" load time. When i start seeing queries coming from the front-end server, the page loads soon afterward. Here are some stats after me logging in during a slow-down, after restarting php-fpm: [mike@front01 ~]$ free -m total used free shared buffers cached Mem: 5963 5217 745 0 192 314 -/+ buffers/cache: 4711 1252 Swap: 4047 4 4042 [mike@front01 ~]$ top top - 11:38:55 up 2 days, 1:01, 3 users, load average: 0.06, 0.17, 0.21 Tasks: 131 total, 1 running, 130 sleeping, 0 stopped, 0 zombie Cpu0 : 0.0%us, 0.3%sy, 0.0%ni, 99.3%id, 0.3%wa, 0.0%hi, 0.0%si, 0.0%st Cpu1 : 0.3%us, 0.0%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu2 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu3 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 6106800k total, 5361288k used, 745512k free, 199960k buffers Swap: 4144728k total, 4976k used, 4139752k free, 328480k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 31806 apache 15 0 601m 120m 37m S 0.0 2.0 0:22.23 php-fpm 31805 apache 15 0 549m 66m 31m S 0.0 1.1 0:14.54 php-fpm 31809 apache 16 0 547m 65m 32m S 0.0 1.1 0:12.84 php-fpm 32285 apache 15 0 546m 63m 33m S 0.0 1.1 0:09.22 php-fpm 32373 apache 15 0 546m 62m 32m S 0.0 1.1 0:09.66 php-fpm 31808 apache 16 0 543m 60m 35m S 0.0 1.0 0:18.93 php-fpm 31807 apache 16 0 533m 49m 30m S 0.0 0.8 0:08.93 php-fpm 32092 apache 15 0 535m 48m 27m S 0.0 0.8 0:06.67 php-fpm 4392 root 18 0 194m 10m 7184 S 0.0 0.2 0:06.96 cvd 4064 root 15 0 154m 8304 4220 S 0.0 0.1 3:55.57 snmpd 4394 root 15 0 119m 5660 2944 S 0.0 0.1 0:02.84 EvMgrC 31804 root 15 0 519m 5180 932 S 0.0 0.1 0:00.46 php-fpm 4138 ntp 15 0 23396 5032 3904 S 0.0 0.1 0:02.38 ntpd 643 nginx 15 0 95276 4408 1524 S 0.0 0.1 0:01.15 nginx 5131 root 16 0 90128 3340 2600 S 0.0 0.1 0:01.41 sshd 28467 root 15 0 90128 3340 2600 S 0.0 0.1 0:00.35 sshd 32602 root 16 0 90128 3332 2600 S 0.0 0.1 0:00.36 sshd 1614 root 16 0 90128 3308 2588 S 0.0 0.1 0:00.02 sshd 2817 root 5 -10 7216 3140 1724 S 0.0 0.1 0:03.80 iscsid 4161 root 15 0 66948 2340 800 S 0.0 0.0 0:10.35 sendmail 1617 nicole 17 0 53876 2000 1516 S 0.0 0.0 0:00.02 sftp-server ... Is there anything else I should be looking at, or any more information that might be useful? I'm just a developer, but the slowdowns on this system worry me and make it hard to do my work.. Help me out, ServerFault!

Read the article
The Sitemap Paradox

- by Jeff Atwood

We use a sitemap on Stack Overflow, but I have mixed feelings about it. Web crawlers usually discover pages from links within the site and from other sites. Sitemaps supplement this data to allow crawlers that support Sitemaps to pick up all URLs in the Sitemap and learn about those URLs using the associated metadata. Using the Sitemap protocol does not guarantee that web pages are included in search engines, but provides hints for web crawlers to do a better job of crawling your site. Based on our two years' experience with sitemaps, there's something fundamentally paradoxical about the sitemap: Sitemaps are intended for sites that are hard to crawl properly. If Google can't successfully crawl your site to find a link, but is able to find it in the sitemap it gives the sitemap link no weight and will not index it! That's the sitemap paradox -- if your site isn't being properly crawled (for whatever reason), using a sitemap will not help you! Google goes out of their way to make no sitemap guarantees: "We cannot make any predictions or guarantees about when or if your URLs will be crawled or added to our index" citation "We don't guarantee that we'll crawl or index all of your URLs. For example, we won't crawl or index image URLs contained in your Sitemap." citation "submitting a Sitemap doesn't guarantee that all pages of your site will be crawled or included in our search results" citation Given that links found in sitemaps are merely recommendations, whereas links found on your own website proper are considered canonical ... it seems the only logical thing to do is avoid having a sitemap and make damn sure that Google and any other search engine can properly spider your site using the plain old standard web pages everyone else sees. By the time you have done that, and are getting spidered nice and thoroughly so Google can see that your own site links to these pages, and would be willing to crawl the links -- uh, why do we need a sitemap, again? The sitemap can be actively harmful, because it distracts you from ensuring that search engine spiders are able to successfully crawl your whole site. "Oh, it doesn't matter if the crawler can see it, we'll just slap those links in the sitemap!" Reality is quite the opposite in our experience. That seems more than a little ironic considering sitemaps were intended for sites that have a very deep collection of links or complex UI that may be hard to spider. In our experience, the sitemap does not help, because if Google can't find the link on your site proper, it won't index it from the sitemap anyway. We've seen this proven time and time again with Stack Overflow questions. Am I wrong? Do sitemaps make sense, and we're somehow just using them incorrectly?

Read the article
Googlebot fetches my pages very frequent, rel-nofollow, meta-noindex or robots.txt-disallow

- by trante

Googlebot fetches pages in my site very frequently. And this slowens my website. I don't want Googlebot to crawl too frequent. I decreased crawl rate from Google webmaster tools. But I'm supposing to use these three tools: Adding rel="nofollow" to my inner pages. So Googlebot won't crawl and index them. Adding meta tag "noindex" so Google will remove this page from index and won't get it again. Adding Disallow: /mySomeFolder/ to robots.txt and Googlebot won't crawl that pages. I'm planning to use these methods for my 56.000 pages, except the most important 6-7 pages. Which method would you prefer and what would be disadvantages or advantages ? Or won't it change my website speed etc..

Read the article

< Previous Page | 1 2 3 4 5 6 7 8 9 10 11 12 | Next Page >