crawling - Page 2 - Developer IT

SharePoint Server Search Not Crawling

- by tekiegreg

Hi there, we recently moved some sites into a new farm, everything seems to be doing fine, but the search for reasons I can't identify are not crawling the migrated content. We're getting this message in our crawl log for every document: http://xxx/sites/...announcements The object was not found. (The item was deleted because it was either not found or the crawler was denied access to it.) Of course the first thing I suspected was the crawler access account, so I logged into SharePoint with the account and was able to access via that URL just fine. I tried upping permissions (even all the way up to Admin) but to no avail. Thoughts?

Read the article

Massive Crawling requests from Google Apps Engine useragent

- by SilentPlayer

Hi friends, I'm badly affected with 'Google AppEngine-Google' UserAgent.. receiving 5/6 requests per second on http server. This bot is crawling my site just like GoogleBot does. Following is the sample of url in my access logs. 72.14.192.3 - - [19/May/2010:01:27:06 +0000] "GET /some-url/etc-123.htm HTTP/1.1" 200 4707 "-" "AppEngine-Google; (+http://code.google.com/appengine; appid: harpy000)" I have checked the ip address it is registered with Google Inc. Can anyone tell me where i can report Abuse to Google Inc. Or any information about this issue. Thank you!

Read the article

Ping and crawling not working, site still resolving

- by Andrew Alexander

Ok, so we're trying to figure out why the site of one of our clients isn't being crawled by Google (we've ruled out robots.txt or meta tags) When we go to the site, either IP address or domain name, the site resolves, everything works. However, Google is getting a 302 redirect (which it apparently isn't following for crawling), and when we ping the address, it times out (note, the site is still resolving in the browser throughout all of this). The site is built in ASP.Net (I assume C#) and so my thoughts were that it was an errant redirect rule, or some other sort of server side issue. We also thought that it might be due to incorrect domain pointing (but if we try to ping the IP, it doesn't work, so that sorta rules that out). We're really not sure what is causing all of these errors, or even if they have one single source. Anyone have any ideas what could be going on? Do you need any more information? To boil it down in a TL; dr: * Site resolving in browser, both IP and domain name. No problems here. * Site not being crawled by Google (gets a 302 it doesn't seem to follow) - it is not due to robots.txt or meta tags * Ping is not working for the IP address. This is very odd, because again, the IP address seems to work fine in the browser. * Our thoughts are either redirect rule issue, domain pointing issue, or possibly some errant code - or some combination of the three

Read the article

Need to sanity-check my .htaccess, especially Limit GET POST line for Google repellent

- by jose

I need a sanity check on this .htaccess (from a WordPress site) I inherited from a 5 month+ old site. What's the symptom? Google + Bing crawl, but don't index any of the pages. Let me be clear: I'm not mad about "not ranking high." I think something is (accidentally) rejecting search engine indexing. I am not an expert on .htaccess, but one part especially looked funny, the Limit GET POST line. Is it not weird to have both Allow and Deny all, with no parameters? Also, I've ruled out robots.txt, but if I were you I'd want to see it, so here it is: User-agent: * Crawl-delay: 30 And here's the more suspect .htaccess: # temp redirect wordpress content feeds to feedburner <IfModule mod_rewrite.c> RewriteEngine on RewriteCond %{HTTP_USER_AGENT} !FeedBurner [NC] RewriteCond %{HTTP_USER_AGENT} !FeedValidator [NC] RewriteRule ^feed/?([_0-9a-z-]+)?/?$ http://feeds.feedburner.com/anonymousblog [R=302,NC,L] </IfModule> # temp redirect wordpress comment feeds to feedburner <IfModule mod_rewrite.c> RewriteEngine on RewriteCond %{HTTP_USER_AGENT} !FeedBurner [NC] RewriteCond %{HTTP_USER_AGENT} !FeedValidator [NC] RewriteRule ^comments/feed/?([_0-9a-z-]+)?/?$ http://feeds.feedburner.com/anonymous_comments [R=302,NC,L] </IfModule> <IfModule mod_rewrite.c> RewriteEngine On RewriteBase / RewriteCond %{REQUEST_FILENAME} !-f RewriteCond %{REQUEST_FILENAME} !-d RewriteRule . /index.php [L] </IfModule> IndexIgnore .htaccess */.??* *~ *# */HEADER* */README* */_vti* <Limit GET POST> order deny,allow deny from all allow from all </Limit> <Limit PUT DELETE> order deny,allow deny from all </Limit> php_value memory_limit 32M Adding header by request: <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> <meta name="robots" content="noindex,nofollow" /> <meta name="description" content="buncha junk i've deleted." /> <meta name="keywords" content="keywords i've deleted" /> <meta name="viewport" content="width=device-width" />

Read the article

Ajax site not being crawled - have escaped fragment, what's wrong? [closed]

- by Harry

My site is anonkun.com. You can see that it's "ajax" and doesn't load much HTML. Here are some example pages: http://anonkun.com http://anonkun.com/?_escaped_fragment_= http://anonkun.com/stories/Dev-kun---FAQ/6ef881f8-cf48-4f87-a688-c585f23809c5 http://anonkun.com/stories/Dev-kun---FAQ/6ef881f8-cf48-4f87-a688-c585f23809c5?_escaped_fragment_= As you can see the original page has the meta fragment tag and the escaped fragment versions loads static html. Why am I not getting crawled? http://cl.ly/image/2n30212q0K2W Webmaster tools show that pages are being seen as duplicate and fetch as google show me the ajax version of the source not the static escaped fragment version. What's wrong and how do I make this work? Thanks.

Read the article

How to interpret number of URL errors in Google webmaster tools

- by user359650

Recently Google has made some changes to Webmaster tools which are explained below: http://googlewebmastercentral.blogspot.com/2012/03/crawl-errors-next-generation.html One thing I could not find out is how to interpret the number of errors over time. At the end of February we've recently migrated our website and didn't implement redirect rules for some pages (quite a few actually). Here is what we're getting from the Crawl errors: What I don't know is if the number of errors is cumulative over time or not (i.e. if Google bots crawl your website on 2 different days and find 1 separate issue on each day, whether they will report 1 error for each day, or 1 for the 1st, and 2 for the 2nd). Based on the Crawl stats we can see that the number of requests made by Google bots doesn't increase: Therefore I believe the number of errors reported is cumulative and that an error detected on 1 day is taken into account and reported on the subsequent days until the underlying problem is fixed and the page it's crawled again (or if you manually Mark as fixed the error) because if you don't make more requests to a website, there is no way you can check new pages and old pages at the same time. Q: Am I interpreting the number of errors correctly?

Read the article

adding tagged / dynamic pages in sitemap

- by sam

ive got a blog thats been running for about a year ive made about 200 posts, and there should be about 220 pages to index (additional pages for about / contact ect). When i go to crawl the site i get 1900 pages because of all the pages that are related to tags ive used in my blogs these 70% of these pages only contain one blog post. When submitting my site map to google should i exclude all pages with /tagged/ in the url so ill only be submitting unqiue pages, or should i submit the full site map ?

Read the article

Extremely large spike in traffic on the 1st - 4th of every month from mobile browsers

- by wsanville

I've noticed that on the 1st - 4th of the recent months (since January), several sites I maintain are getting thousands of requests from mobile browsers, whereas throughout the rest of the month, the numbers are in the single or double digits. Has anybody else noticed this sort of behavior? I don't have the exact user agents logged, but my analysis software (WebTrends) reports the traffic as mostly iPhone/iPad/iPod, Android, and Blackberry.

Read the article

What measures can be taken to increase Google indexing speed for a given newly created page?

- by knorv

Consider a website with a large number of pages. New pages are published regularly. When publishing a new page the website operator wants to get the newly created paged indexed in Google as soon as possible. The website operator wants to minimize the time spent between publication and indexing. Consider the site http://www.example.com/ with hundreds of thousands of pages. The page page http://www.example.com/something/important-page.html is created at say 12:00. How do I get important-page.html indexed as soon as possible after 12:00? Ideally within seconds or minutes. Or more generally: What options are available to try to get Google to index a specific newly created page as soon as possible?

Read the article

why my website doesn't ranked by alexa? [closed]

- by arshen

i created a website with WordPress and post 10+ article in period of two month, but alexa doesn't rank my website. i tried to change my theme, URL and other related things and submit my website URL manually to alexa dashboard, while i have amount of 200 page view in a day but its still not ranked. my website URL: http://daskaht.ir robots file: http://daskhat.ir/robots.txt alexa page: www.alexa.com/siteinfo/daskhat.ir domain whois: whois.domaintools.com/daskhat.ir and website seo rank: www.woorank.com/en/www/daskhat.ir

Read the article

How to crawl a webPage with dynamic content added by javascript

- by blunderboy

I guess there is a news that Google bots have the capability to understand our javascript code. It means this is possible to fully crawl a webpage which has lazy loading feature enabled. I am using Apache Nutch to crawl websites but I don't think it has the capability to fetch the URLs being injected in HTML page by javascript when the page is scrolled down. I see a lot of websites doing lazy loading for performance issue. So Can somebody please explain me how can i crawl the data which comes in HTML page on lazy load. (On scrolling the page down).

Read the article

Is this Anti-Scraping technique viable with Crawl-Delay?

- by skibulk

I want to prevent web scrapers from abusing 1,000,000 on my website. I'd like to do this by returning a "503 Service Unavailable" error code for users that access an abnormal number of pages per minute. I don't want search engine spiders to ever receive the error. My inclination is to set a robots.txt crawl-delay which will ensure spiders access a number of pages per minute under my 503 threshold. Is this an appropriate solution? Do all major search engines support the directive? Could it negatively affect SEO? Are there any other solutions or recommendations?

Read the article

Will google "forget" unlinked pages?

- by Mystere Man

If i remove all links to a page, but do not delete the page from the site (nor block it from being requested), will google eventually "forget" about it when it reindexes the site? Assuming of course there are no other links to the page somewhere else externally. Or will google continue to request the page and verify it in the index and keep it around so long as it returns a valid page? Is this similar for Bing et al?

Read the article

What measures can be taken to make sure Google is aware of the existence of a newly created page?

- by knorv

Consider a website with a large number of pages. New pages are published regularly. When publishing a new page the website operator wants to get the newly created paged indexed in Google as soon as possible. The website operator wants to minimize the time spent between publication and indexing. Consider the site http://www.example.com/ with hundreds of thousands of pages. The page page http://www.example.com/something/important-page.html is created at say 12:00. I want to get important-page.html indexed as soon as possible after 12:00. Ideally within seconds or minutes. What options are available to try to get Google to index a specific newly created page as soon as possible?

Read the article

Using Ajax #! for Google but site is not being crawled any more

- by user28231

We used to have all links in a normal way (thus with parameters?), we changed to Ajax loading cause we installed a music player. Here's an example: Old links: http://stereofox.com/post.php?idPost=5326 New links: http://stereofox.com/post.php#!idPost=5326 The snapshot you can get it by adding _escaped_fragment_= after the ?. All of these have the same content, and this image says what has happend to the site since we changed the linking system.

Read the article

search engine crawling frequency

- by Aditya Pratap Singh

I want to design a search engine for news websites ie. download various article pages from these websites, index the pages, and answer search queries on the index. I want a short pseudocode to find an appropriate crawling frequency -- i do not want to crawl too often because the website may not have changed, and do not want to crawl too infrequently because index would then be out of date. Assume that crawling code looks as follows while(1) { sleep(sleep_interval); // sleep for sleep_interval crawl(website); // crawls the entire website diff = diff(currently_crawled_website, previously_crawled_website); // returns a % value of difference between the latest and previous crawls of the website sleep_interval = infer_sleep_interval(diff, sleep_interval); } looking for a pseudocode for the infer_sleep_interval method: long sleep_interval infer_sleep_interval(int diff_percentage,long previous_sleep_interval) { ... ... ... } i want to design method which adaptively alters the sleeping interval based on the update frequency of the website.

Read the article

guide on crawling the entire web ?

- by bohohasdhfasdf

i just had this thought, and was wondering if it's possible to crawl the entire web (just like the big boys!) on a single dedicated server (like Core2Duo, 8gig ram, 750gb disk 100mbps) . I've come across a paper where this was done....but i cannot recall this paper's title. it was like about crawling the entire web on a single dedicated server using some statistical model. Anyways, imagine starting with just around 10,000 seed URLs, and doing exhaustive crawl.... is it possible ? I am in need of crawling the web but limited to a dedicated server. how can i do this, is there an open source solution out there already ? for example see this real time search engine. http://crawlrapidshare.com the results are exteremely good and freshly updated....how are they doing this ?

Read the article

wget crawling search results of news website

- by kiltek

I am trying to crawl the search results of a news website using wget. The name of the website is www.voanews.com. After typing in my search keyword and clicking search, it proceeds to the results. Then i can specify a "to" and a "from"-date and hit search again. After this the URL becomes: http://www.voanews.com/search/?st=article&k=mykeyword&df=10%2F01%2F2013&dt=09%2F20%2F2013&ob=dt#article and the actual content of the results is what i want to download. To achieve this I created the following wget-command: wget --reject=js,txt,gif,jpeg,jpg \ --accept=html \ --user-agent=My-Browser \ --recursive --level=2 \ www.voanews.com/search/?st=article&k=germany&df=08%2F21%2F2013&dt=09%2F20%2F2013&ob=dt#article Unfortunately, the crawler doesn't download the search results. It only gets into the upper link bar, which contains the "Home,USA,Africa,Asia,..." links and saves the articles they link to. It seems like he crawler doesn't check the search result links at all. What am I doing wrong and how can I modify the wget command to download the results search list links (and of course the sites they link to) only ?

Read the article

Crawling engine architecture - Java/ Perl integration

- by Bigtwinz

Hi all, I am looking to develop a management and administration solution around our webcrawling perl scripts. Basically, right now our scripts are saved in SVN and are manually kicked off by SysAdmin/devs etc. Everytime we need to retrieve data from new sources we have to create a ticket with business instructions and goals. As you can imagine, not an optimal solution. There are 3 consistent themes with this system: the retrieval of data has a "conceptual structure" for lack of a better phrase i.e. the retrieval of information follows a particular path we are only looking for very specific information so we dont have to really worry about extensive crawling for awhile (think thousands-tens of thousands of pages vs millions) crawls are url-based instead of site-based. As I enhance this alpha version to a more production-level beta I am looking to add automation and management of the retrieval of data. Additionally our other systems are Java (which I'm more proficient in) and I'd like to compartmentalize the perl aspects so we dont have to lean heavily on outside help. I've evaluated the usual suspects Nutch, Droid etc but the time spent on modifying those frameworks to suit our specific information retrieval cant be justified. So I'd like your thoughts regarding the following architecture. I want to create a solution which use Java as the interface for managing and execution of the perl scripts use Java for configuration and data access stick with perl for retrieval An example use case would be a data analyst delivers us a requirement for crawling perl developer creates the required script and uses this webapp to submit the script (which gets saved to the filesystem) the script gets kicked off from the webapp with specific parameters .... Webapp should be able to create multiple threads of the perl script to initiate multiple crawlers. So questions are what do you think how solid is integration between Java and Perl specifically from calling perl from java has someone used such a system which actually is part perl repository The goal really is to not have a whole bunch of unorganized perl scripts and put some management and organization on our information retrieval. Also, I know I can use perl do do the web part of what we want - but as I mentioned before - trying to keep perl focused. But it seems assbackwards I'm not adverse to making it an all perl solution. Open to any all suggestions and opinions. Thanks

Read the article

Oracle Secure Enterprise Search(SES) Intranet crawling problem.

- by vipin k.

I am using oracle Oracle Secure Enterprise Search(SES), and using the crawler to crawl the Intranet site. but i am getting the error. EQG-30008: http://site-name/: Not found I have added the Log on password and user name and also added the proxy settings. Any body who worked on SES crawling,please look in.

Read the article

Web Crawling Sites with Javascripts or web forms

- by Jojo

Hello everyone, I have a webcrawler application. It successfully crawled most common and simple sites. Now i ran into some types of websites wherein HTML documents are dynamically generated through FORMS or javascripts. I believe they can be crawled and I just don't know how. Now, these websites do not show the actual HTML page. I mean if I browse that page in IE or firefox, the HTML code does not match what's actually in the IE or firefox. These sites contain textboxes, checkboxes, etc... so I believe they are what they call "Web Forms". Actually I am not much familiar with web development so correct me if I'm wrong. My question is, does anyone in similar situation as I am now and have successfully solved these types of "challenges"? Does anyone know of a certain book or article regarding web crawling? Those that pertains to these advanced type of websites? Thanks.

Read the article

Asynchronous crawling F#

- by jlezard

When crawling on webpages I need to be careful as to not make too many requests to the same domain, for example I want to put 1 s between requests. From what I understand it is the time between requests that is important. So to speed things up I want to use async workflows in F#, the idea being make your requests with 1 sec interval but avoid blocking things while waiting for request response. let getHtmlPrimitiveAsyncTimer (uri : System.Uri) (timer:int) = async{ let req = (WebRequest.Create(uri)) :?> HttpWebRequest req.UserAgent<-"Mozilla" try Thread.Sleep(timer) let! resp = (req.AsyncGetResponse()) Console.WriteLine(uri.AbsoluteUri+" got response") use stream = resp.GetResponseStream() use reader = new StreamReader(stream) let html = reader.ReadToEnd() return html with | _ as ex -> return "Bad Link" } Then I do something like: let uri1 = System.Uri "http://rue89.com" let timer = 1000 let jobs = [|for i in 1..10 -> getHtmlPrimitiveAsyncTimer uri1 timer|] jobs |> Array.mapi(fun i job -> Console.WriteLine("Starting job "+string i) Async.StartAsTask(job).Result) Is this alright ? I am very unsure about 2 things: -Does the Thread.Sleep thing work for delaying the request ? -Is using StartTask a problem ? I am a beginner (as you may have noticed) in F# (coding in general actually ), and everything envolving Threads scares me :) Thanks !!

Read the article

Not crawling the same content twice

- by sirrocco

I'm building a small application that will crawl sites where the content is growing (like on stackoverflow) the difference is that the content once created is rarely modified. Now , in the first pass I crawl all the pages in the site. But next, the paged content of that site - I don't want to re-crawl all of it , just the latest additions. So if the site has 500 pages, on the second pass if the site has 501 pages then I would only crawl the first and second pages. Would this be a good way to handle the situation ? In the end, the crawled content will end up in lucene - creating a custom search engine. So, I would like to avoid crawling multiple times the same content. Any better ideas ? EDIT : Let's say the site has a page : Results that will be accessed like so : Results?page=1 , Results?page=2 ...etc I guess that keeping a track of how many pages there were at the last crawl and just crawl the difference would be enough. ( maybe using a hash of each result on the page - if I start running into the same hashes - I should stop)

Read the article

Where Googlebot starts crawling?

- by Nirmal

Say if I register a domain and have developed it into a complete website. From where and how Googlebot knows that the new domain is up? Does it always start with the domain registry? If it starts with the registry, does that mean that anyone can have complete access to the registry's database? Thanks for any insight.

Read the article

Prevent bot from crawling certain areas of site.

- by Skoder

Hey, I don't know much about SEO and how web spiders work, so forgive my ignorance here. I'm creating a site (using ASP.NET-MVC) which has areas that displays information retrieved from the database. The data is unique to the user, so there's no real server-side output caching going on. However, since the data can contain things the user may not wish to have displayed from search engine results, I'd like to prevent any spiders from accessing the search results page. Are there any special actions I should take to ensure that the search result directory isn't crawled? Also, would a spider even crawl a page that's dynamically generated and would any actions preventing certain directories being search mess up my search engine rankings? edit: I should add, I'm reading up on robots.txt protocol, but it relies on co-operation from the web crawler. However, I'd also like to prevent any data-mining users who will ignore the robots.txt file. I appreciate any help!

Search Results

Search found 241 results on 10 pages for 'crawling'.

Page 2/10 | < Previous Page | 1 2 3 4 5 6 7 8 9 10 | Next Page >

- by tekiegreg

- by SilentPlayer

- by Andrew Alexander

- by jose

- by Harry

- by user359650

- by sam

- by wsanville

- by knorv

- by arshen

- by blunderboy

- by skibulk

- by Mystere Man

- by knorv

- by user28231

- by Aditya Pratap Singh

- by bohohasdhfasdf

- by kiltek

- by Bigtwinz

- by vipin k.

- by Jojo

- by jlezard

- by sirrocco

- by Nirmal

- by Skoder

< Previous Page | 1 2 3 4 5 6 7 8 9 10 | Next Page >