Search Results

Search found 446 results on 18 pages for 'crawl'.

Page 4/18 | < Previous Page | 1 2 3 4 5 6 7 8 9 10 11 12  | Next Page >

  • Java - HtmlUnit - Unable to save HTML to file (in some cases)

    - by Walter White
    Hi all, I am having intermittent issues saving the response HTML in HtmlUnit. Caused by: java.io.IOException: Unable to save file:C:\ccview\PP50773_4.0_walter\TSC_hca\Applications\HCA_J2EE\HCA\target\HtmlUnitTests\single\1\com\pnc\tsc\hca\ui\test\SiteCrawler\crawlSiteAsProvider\10.SiteCrawler.crawl.html at com.pnc.tsc.hca.ui.util.GetUtil.save(GetUtil.java:128) at com.pnc.tsc.hca.ui.util.GetUtil.add(GetUtil.java:75) at com.pnc.tsc.hca.ui.util.GetUtil.click(GetUtil.java:49) at com.pnc.tsc.hca.ui.test.SiteCrawler.crawl(SiteCrawler.java:87) at com.pnc.tsc.hca.ui.test.SiteCrawler.crawl(SiteCrawler.java:61) at com.pnc.tsc.hca.ui.test.SiteCrawler.crawl(SiteCrawler.java:63) at com.pnc.tsc.hca.ui.test.SiteCrawler.crawl(SiteCrawler.java:63) at com.pnc.tsc.hca.ui.test.SiteCrawler.crawl(SiteCrawler.java:63) at com.pnc.tsc.hca.ui.test.SiteCrawler.crawl(SiteCrawler.java:54) at com.pnc.tsc.hca.ui.test.SiteCrawler.crawlSiteAsProvider(SiteCrawler.java:50) ... 15 more Caused by: java.lang.RuntimeException: java.io.IOException: The system cannot find the path specified at com.gargoylesoftware.htmlunit.html.XmlSerializer.getAttributesFor(XmlSerializer.java:165) at com.gargoylesoftware.htmlunit.html.XmlSerializer.printOpeningTag(XmlSerializer.java:126) at com.gargoylesoftware.htmlunit.html.XmlSerializer.printXml(XmlSerializer.java:83) at com.gargoylesoftware.htmlunit.html.XmlSerializer.printXml(XmlSerializer.java:93) at com.gargoylesoftware.htmlunit.html.XmlSerializer.printXml(XmlSerializer.java:93) at com.gargoylesoftware.htmlunit.html.XmlSerializer.asXml(XmlSerializer.java:73) at com.gargoylesoftware.htmlunit.html.XmlSerializer.save(XmlSerializer.java:55) at com.gargoylesoftware.htmlunit.html.HtmlPage.save(HtmlPage.java:2259) at com.pnc.tsc.hca.ui.util.GetUtil.save(GetUtil.java:126) ... 24 more Caused by: java.io.IOException: The system cannot find the path specified at java.io.WinNTFileSystem.createFileExclusively(Native Method) at java.io.File.createNewFile(File.java:883) at com.gargoylesoftware.htmlunit.html.XmlSerializer.createFile(XmlSerializer.java:216) at com.gargoylesoftware.htmlunit.html.XmlSerializer.getAttributesFor(XmlSerializer.java:160) ... 32 more Now, the parent directory exists and some other files have already been written to the directory. Looking at the filename, I don't see anything that would stand out as a red flag indicating the filename is bad. What can I do to correct this error? Thanks, Walter

    Read the article

  • Web crawler update strategy

    - by superb
    I want to crawl useful resource (like background picture .. ) from certain websites. It is not a hard job, especially with the help of some wonderful projects like scrapy. The problem here is I not only just want crawl this site ONE TIME. I also want to keep my crawl long running and crawl the updated resource. So I want to know is there any good strategy for a web crawler to get updated pages? Here's a coarse algorithm I've thought of. I divided the crawl process into rounds. Each round URL repository will give crawler a certain number (like , 10000) of URLs to crawl. And then next round. The detailed steps are: crawler add start URLs to URL repository crawler ask URL repository for at most N URL to crawl crawler fetch the URLs, and update certain information in URL repository, like the page content, the fetch time and whether the content has been changed. just go back to step 2 To further specify that, I still need to solve following question: How to decide the "refresh-ness" of a web page, which indicates the probability that this web page has been updated ? Since that is an open question, hopefully it will brought some fruitful discussion here.

    Read the article

  • problem with crawl many url in .net: Server IP not ping. maybe bandwidth or http connection limit ex

    - by Hamid
    Hi to all I develop web crawling service (windows service / multi-thread) . its work fine, but sometimes my server network not response. and i can't ping server IP (from internet), but can ping by other network card (local ip) that not access to internet. after i open server with remote desktop and stop crawling service. i could ping. What's my problem? Bandwidth limit or max connection limit exceed or ??? how to prevent this issue? Note: when this problem occur, i open browser for browse web site, but can't open any website!!! Could you please help me. Thanks in advanced

    Read the article

  • Does Wicket hamper SEO or search engines ability to crawl?

    - by Nick
    We're coming from GWT projects and because of problems with SEO not liking GWT for our next project we're going to move clear of GWT (mainly because seo is a high priority for this next project). In choosing a new framework, I'm looking at Wicket and liking what I've seen so far. I've only done a few tutorials, but in looking at the war layout (from these tutorials) it looks like most of the html pages are in the WEB-INF folder. It this going to cause problems for SEO and search engines crawling through the sites files? Ideally, I'd like to use Wicket with some AJAX and deploy to Google App Engine.

    Read the article

  • Why does my performance slow to a crawl I move methods into a base class?

    - by Juliet
    I'm writing different implementations of immutable binary trees in C#, and I wanted my trees to inherit some common methods from a base class. However, I find. I have lots of binary tree data structures to implement, and I wanted move some common methods into in a base binary tree class. Unfortunately, classes which derive from the base class are abysmally slow. Non-derived classes perform adequately. Here are two nearly identical implementations of an AVL tree to demonstrate: AvlTree: http://pastebin.com/V4WWUAyT DerivedAvlTree: http://pastebin.com/PussQDmN The two trees have the exact same code, but I've moved the DerivedAvlTree.Insert method in base class. Here's a test app: using System; using System.Collections.Generic; using System.Diagnostics; using System.Linq; using Juliet.Collections.Immutable; namespace ConsoleApplication1 { class Program { const int VALUE_COUNT = 5000; static void Main(string[] args) { var avlTreeTimes = TimeIt(TestAvlTree); var derivedAvlTreeTimes = TimeIt(TestDerivedAvlTree); Console.WriteLine("avlTreeTimes: {0}, derivedAvlTreeTimes: {1}", avlTreeTimes, derivedAvlTreeTimes); } static double TimeIt(Func<int, int> f) { var seeds = new int[] { 314159265, 271828183, 231406926, 141421356, 161803399, 266514414, 15485867, 122949829, 198491329, 42 }; var times = new List<double>(); foreach (int seed in seeds) { var sw = Stopwatch.StartNew(); f(seed); sw.Stop(); times.Add(sw.Elapsed.TotalMilliseconds); } // throwing away top and bottom results times.Sort(); times.RemoveAt(0); times.RemoveAt(times.Count - 1); return times.Average(); } static int TestAvlTree(int seed) { var rnd = new System.Random(seed); var avlTree = AvlTree<double>.Create((x, y) => x.CompareTo(y)); for (int i = 0; i < VALUE_COUNT; i++) { avlTree = avlTree.Insert(rnd.NextDouble()); } return avlTree.Count; } static int TestDerivedAvlTree(int seed) { var rnd = new System.Random(seed); var avlTree2 = DerivedAvlTree<double>.Create((x, y) => x.CompareTo(y)); for (int i = 0; i < VALUE_COUNT; i++) { avlTree2 = avlTree2.Insert(rnd.NextDouble()); } return avlTree2.Count; } } } AvlTree: inserts 5000 items in 121 ms DerivedAvlTree: inserts 5000 items in 2182 ms My profiler indicates that the program spends an inordinate amount of time in BaseBinaryTree.Insert. Anyone whose interested can see the EQATEC log file I've created with the code above (you'll need EQATEC profiler to make sense of file). I really want to use a common base class for all of my binary trees, but I can't do that if performance will suffer. What causes my DerivedAvlTree to perform so badly, and what can I do to fix it?

    Read the article

  • Having problem with a crawl service in .net: Server not responding to IP ping. Is it bandwidth or ht

    - by Hamid
    Hi to all I develop web crawling service (windows service / multi-thread) . its work fine, but sometimes my server network not response. and i can't ping server IP (from internet), but can ping by other network card (local ip) that not access to internet. after i open server with remote desktop and stop crawling service. i could ping. What's my problem? Bandwidth limit or max connection limit exceed or ??? how to prevent this issue? Note: when this problem occur, i open browser for browse web site, but can't open any website!!! Could you please help me. Thanks in advanced

    Read the article

  • Ruby execute code in class getting inherited to

    - by AdamB
    I'm trying to be able to have a global exception capture where I can add extra information when an error happens. I have two classes, "crawler" and "amazon". What I want to do is be able to call "crawl", execute a function in amazon, and use the exception handling in the crawl function. Here are the two classes I have: require 'mechanize' class Crawler Mechanize.html_parser = Nokogiri::HTML def initialize @agent = Mechanize.new end def crawl puts "crawling" begin #execute code in Amazon class here? rescue Exception => e puts "Exception: #{e.message}" puts "On url: #{@current_url}" puts e.backtrace end end def get(url) @current_url = url @agent.get(url) end end class Amazon < Crawler #some code with errors def stuff page = get("http://www.amazon.com") puts page.parser.xpath("//asldkfjasdlkj").first['href'] end end a = Amazon.new a.crawl Is there a way I can call "stuff" inside of "crawl" so I can use that exception handling over the entire stuff function? Is there a better way to accomplish this?

    Read the article

  • Can the .htaccess file slow down a website to a crawl? If so, are there better ways to solve these problems with different rewrite rules and such?

    - by Parimal
    here is my htaccess file...... RewriteCond %{REQUEST_URI} ^/patients/billing/FAQ_billing\.html$ [OR] RewriteCond %{REQUEST_URI} ^/patients/billing/getintouch\.html$ RewriteRule ^patients/billing/(.*)\.html$ $1.php [L,NC] RewriteCond %{REQUEST_URI} ^/patients/findadoctor/a\.html$ [OR] RewriteCond %{REQUEST_URI} ^/patients/findadoctor/b\.html$ [OR] RewriteCond %{REQUEST_URI} ^/patients/findadoctor/c\.html$ [OR] RewriteCond %{REQUEST_URI} ^/patients/findadoctor/d\.html$ [OR] RewriteCond %{REQUEST_URI} ^/patients/findadoctor/e\.html$ [OR] RewriteCond %{REQUEST_URI} ^/patients/findadoctor/f\.html$ [OR] RewriteCond %{REQUEST_URI} ^/patients/findadoctor/g\.html$ [OR] RewriteCond %{REQUEST_URI} ^/patients/findadoctor/h\.html$ [OR] RewriteCond %{REQUEST_URI} ^/patients/findadoctor/i\.html$ [OR] RewriteCond %{REQUEST_URI} ^/patients/findadoctor/j\.html$ [OR] RewriteCond %{REQUEST_URI} ^/patients/findadoctor/k\.html$ [OR] RewriteCond %{REQUEST_URI} ^/patients/findadoctor/l\.html$ [OR] RewriteCond %{REQUEST_URI} ^/patients/findadoctor/m\.html$ [OR] RewriteCond %{REQUEST_URI} ^/patients/findadoctor/n\.html$ [OR] RewriteCond %{REQUEST_URI} ^/patients/findadoctor/o\.html$ [OR] RewriteCond %{REQUEST_URI} ^/patients/findadoctor/p\.html$ [OR] RewriteCond %{REQUEST_URI} ^/patients/findadoctor/q\.html$ [OR] RewriteCond %{REQUEST_URI} ^/patients/findadoctor/r\.html$ [OR] RewriteCond %{REQUEST_URI} ^/patients/findadoctor/s\.html$ [OR] RewriteCond %{REQUEST_URI} ^/patients/findadoctor/t\.html$ [OR] RewriteCond %{REQUEST_URI} ^/patients/findadoctor/u\.html$ [OR] RewriteCond %{REQUEST_URI} ^/patients/findadoctor/v\.html$ [OR] RewriteCond %{REQUEST_URI} ^/patients/findadoctor/w\.html$ [OR] RewriteCond %{REQUEST_URI} ^/patients/findadoctor/x\.html$ [OR] RewriteCond %{REQUEST_URI} ^/patients/findadoctor/y\.html$ [OR] RewriteCond %{REQUEST_URI} ^/patients/findadoctor/z\.html$ RewriteRule ^patients/findadoctor/(.*)\.html$ findadoctor.php?id=$1 [L,NC] like that there is lots of rules around 250 line please help me...

    Read the article

  • Not crawling the same content twice

    - by sirrocco
    I'm building a small application that will crawl sites where the content is growing (like on stackoverflow) the difference is that the content once created is rarely modified. Now , in the first pass I crawl all the pages in the site. But next, the paged content of that site - I don't want to re-crawl all of it , just the latest additions. So if the site has 500 pages, on the second pass if the site has 501 pages then I would only crawl the first and second pages. Would this be a good way to handle the situation ? In the end, the crawled content will end up in lucene - creating a custom search engine. So, I would like to avoid crawling multiple times the same content. Any better ideas ? EDIT : Let's say the site has a page : Results that will be accessed like so : Results?page=1 , Results?page=2 ...etc I guess that keeping a track of how many pages there were at the last crawl and just crawl the difference would be enough. ( maybe using a hash of each result on the page - if I start running into the same hashes - I should stop)

    Read the article

  • Duplicate content issue after URL-change with 301-redirects

    - by David
    We got the following problem: We changed all URLs on our page from oldURL.html to newURL.html and set up 301-redirects (ca. 600 URLs) Google re-crawled our page, indexed all the new URLs (newURL.html), but didn't crawl the old URLs (oldURL.html) again, as there were no internal links pointing at those domains anymore after the URL-change. This resulted in massive ranking-drops, etc. because (i) Google thought oldURL.html has exactly the same content as newURL, causing duplicate content issues, and (ii) Google did not transfer the juice from oldURL to newURL, because the 301-redirect was never noticed. Now we reset all internal Links to the old URLs again, which then redirect to the newURLs, in the hope that Google would re-crawl the pages, once there are internal links pointing at them. This is partially happening, but at a really low speed, so it would take multiple months to notice all-redirects. I guess, because Google thinks: "Aah, I already know oldURL.html, so no need to re-crawl it. Possible solutions we thought of are ... Submitting as many of the old URLs to the index as possible via Webmaster Tools, to manually trigger a crawl. Doing that already Submitting a sitemap with all old URLs - but not sure if good idea, because Google does not seem to like 301-redirects in a sitemap ... Both solutions are not perfect - and we cannot wait for three months, just to regain our old rankings. What are your ideas? Best, David

    Read the article

  • apache-memory-hacker-linux

    - by bibhudatta
    When we start the linux system it take only 435mb memory and it is 4GB memory server. When we start the httpd services it take 1000mb and outmatically it take all the memory and the server crase. even we stop the apache just it release 200mb memory. What will be the problem Can any one tell me what these hacker are doing. I see they are goinging some hit to my apache by some but I thing they are doing from this system. Below is the log. Please help me out for this. [root@host ~]# tail -20 /var/log/httpd/dostizone.com-combined.log 180.76.5.143 - - [14/Nov/2011:02:30:16 +0530] "GET /blogs/10248/209403/nfl-panties-since-the-quality-of HTTP/1.1" 403 2298 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" 180.76.5.88 - - [14/Nov/2011:02:30:31 +0530] "GET /blogs/815/158725/new-jersey-attorney-search HTTP/1.1" 403 2290 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" 220.181.108.186 - - [14/Nov/2011:02:30:32 +0530] "GET / HTTP/1.1" 403 5043 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" crawl-66-249-67-137.googlebot.com - - [14/Nov/2011:02:30:20 +0530] "GET /blogs/805/11279/supra-suprano-high-shoes HTTP/1.1" 200 30642 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" crawl-66-249-68-51.googlebot.com - - [14/Nov/2011:02:30:37 +0530] "GET /blogs/10514/215084/oakland-raiders-sweatpants-tags HTTP/1.1" 403 2297 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 220.181.94.237 - - [14/Nov/2011:02:30:12 +0530] "GET /profile/8509 HTTP/1.1" 200 236894 "-" "Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)" 220.181.94.237 - - [14/Nov/2011:02:30:43 +0530] "GET /mode-switch?return_url=%2Fblogs%2F8529%2F160217%2Fclimate-jordan-6 HTTP/1.1" 302 1 "-" "Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)" crawl-66-249-68-51.googlebot.com - - [14/Nov/2011:02:30:44 +0530] "GET /blogs/390/61573/blackhawk-jerseys-from-the-you HTTP/1.1" 403 2293 "-" "SAMSUNG-SGH-E250/1.0 Profile/MIDP-2.0 Configuration/CLDC-1.1 UP.Browser/6.2.3.3.c.1.101 (GUI) MMP/2.0 (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html)" 124.115.0.159 - - [14/Nov/2011:02:30:24 +0530] "GET /blogs/693/46081/application/modules/Hecore/externals/scripts/core.js HTTP/1.1" 200 26869 "http://dostizone.com/blogs/693/46081/thomas-sabo-charms-hot-chilli" "Sosospider+(+http://help.soso.com/webspider.htm)" 124.115.0.159 - - [14/Nov/2011:02:30:24 +0530] "GET /blogs/693/46081/application/modules/Activity/externals/scripts/core.js HTTP/1.1" 200 26873 "http://dostizone.com/blogs/693/46081/thomas-sabo-charms-hot-chilli" "Sosospider+(+http://help.soso.com/webspider.htm)" 124.115.0.159 - - [14/Nov/2011:02:30:24 +0530] "GET /blogs/693/46081/application/modules/Hecore/externals/scripts/imagezoom/core.js HTTP/1.1" 200 26899 "http://dostizone.com/blogs/693/46081/thomas-sabo-charms-hot-chilli" "Sosospider+(+http://help.soso.com/webspider.htm)" 180.76.5.153 - - [14/Nov/2011:02:30:50 +0530] "GET /blogs/10252/212268/cleveland-browns-authentic-jerse HTTP/1.1" 403 2298 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" crawl-66-249-68-51.googlebot.com - - [14/Nov/2011:02:30:51 +0530] "GET /blogs/741/46260/chocolate-ugg-women-boots-1873 HTTP/1.1" 403 2293 "-" "SAMSUNG-SGH-E250/1.0 Profile/MIDP-2.0 Configuration/CLDC-1.1 UP.Browser/6.2.3.3.c.1.101 (GUI) MMP/2.0 (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html)" 124.115.1.7 - - [14/Nov/2011:02:30:40 +0530] "GET /blogs/682/97454/swarovski-jewellry-sale-articles HTTP/1.1" 200 25770 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)" crawl-66-249-68-51.googlebot.com - - [14/Nov/2011:02:30:56 +0530] "GET /blogs/779/60941/players-a-to-z-michael-cuddyer HTTP/1.1" 403 2293 "-" "SAMSUNG-SGH-E250/1.0 Profile/MIDP-2.0 Configuration/CLDC-1.1 UP.Browser/6.2.3.3.c.1.101 (GUI) MMP/2.0 (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html)" crawl-66-249-68-51.googlebot.com - - [14/Nov/2011:02:31:01 +0530] "GET /blogs/469/58551/chicago-bears-news-there-exist HTTP/1.1" 403 2293 "-" "SAMSUNG-SGH-E250/1.0 Profile/MIDP-2.0 Configuration/CLDC-1.1 UP.Browser/6.2.3.3.c.1.101 (GUI) MMP/2.0 (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html)" 220.181.94.237 - - [14/Nov/2011:02:30:54 +0530] "GET /blogs/8529/160217/climate-jordan-6 HTTP/1.1" 200 30750 "-" "Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)" 180.76.5.59 - - [14/Nov/2011:02:31:05 +0530] "GET /blogs/815/158197/cheap-calgary-flames-jerseys HTTP/1.1" 403 2292 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" crawl-66-249-68-51.googlebot.com - - [14/Nov/2011:02:31:06 +0530] "GET /mode-switch?return_url=%2Fblogs%2F387%2F45679%2Fhandbag-louis-vuitton-judy-mm-m4 HTTP/1.1" 403 2258 "-" "SAMSUNG-SGH-E250/1.0 Profile/MIDP-2.0 Configuration/CLDC-1.1 UP.Browser/6.2.3.3.c.1.101 (GUI) MMP/2.0 (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html)" crawl-66-249-67-137.googlebot.com - - [14/Nov/2011:02:31:10 +0530] "GET /public/temporary/c83b731ecc556d7fd1a7732d9ac16ed6.png HTTP/1.1" 404 2305 "-" "Googlebot-Image/1

    Read the article

  • How to interpret number of URL errors in Google webmaster tools

    - by user359650
    Recently Google has made some changes to Webmaster tools which are explained below: http://googlewebmastercentral.blogspot.com/2012/03/crawl-errors-next-generation.html One thing I could not find out is how to interpret the number of errors over time. At the end of February we've recently migrated our website and didn't implement redirect rules for some pages (quite a few actually). Here is what we're getting from the Crawl errors: What I don't know is if the number of errors is cumulative over time or not (i.e. if Google bots crawl your website on 2 different days and find 1 separate issue on each day, whether they will report 1 error for each day, or 1 for the 1st, and 2 for the 2nd). Based on the Crawl stats we can see that the number of requests made by Google bots doesn't increase: Therefore I believe the number of errors reported is cumulative and that an error detected on 1 day is taken into account and reported on the subsequent days until the underlying problem is fixed and the page it's crawled again (or if you manually Mark as fixed the error) because if you don't make more requests to a website, there is no way you can check new pages and old pages at the same time. Q: Am I interpreting the number of errors correctly?

    Read the article

  • Crawler do not create custom crawled properties

    - by user173739
    These days i have faced with very strange problem. I have development environment with MOSS 2007 SP 2 and WS 2008, i have search configured and everything works great. I have started to configuring staging environment (MOSS 2007 SP2 with June CU) and create new farm and new SSP. I have deployed my changes with package (wsp) and manually create site collections, sub webs, pages and so on. When fill crawl finishes, i see in Crawl log that all my pages have been successfully crawled and when i use some test tools to query search, my pages have been found. In crawl log there is few errors like http://mysite/sites/de/pages "The crawler could not communicate with the server. Check that the server is available and that the firewall access is configured correctly..", but all pages in this Page library were indexed. The problem is that i use custom managed properties (mapped to custom crawled properties) in search queries, but crawler didn't create crawled properties for all my new site columns. For example for site column IsAccent the crawler didn't create cralwed property ows_isAccesnt. I'm sure that i have created pages for specific content type and all my crawl categories have checked "Automatically discover new properties when a crawl takes place ". In site settings - Searchable columns i haven't got any column selected as Nocrowl. I tried to export my managed and crawled properties from dev environment to stage evironment but all my managed properties were empty, after that i recreated SSP...the result was the same... I checked specific page with tools like Sharepoint Manager 2007 and U2U Caml Query Builder 2007 that content type is correct, and i can see values of my custom site collumns.... Using U2U Caml Query Builder 2007 agains some Page library in Result tab i can see ows_IsAccent (my site collumn is IsAccent) and others site columns, but i can't find them in Crawled properties. Any idias?

    Read the article

  • Recursive function MultiThreading to perform one task at a time.

    - by Ajay
    Hi, I am writing a program to crawl the websites. The crawl function is a recursive one and may consume more time to complete, So I used Multi Threading to perform the crawl for multiple websites. What exactly I need is, after completion crawling one website it call next one (which should be in Queqe) instead multiple websites crawling at a time. I am using C# and ASP.NET.

    Read the article

  • Valid robots.txt? [closed]

    - by psot
    I am waiting for Google to crawl my site and display the results in search. Is my robots.txt alright and will it let google, bing etc crawl my site? Thanks! User-agent: * Disallow: /cgi-bin/ Disallow: /wp-admin/ Disallow: /wp-includes/ Disallow: /wp-content/ Disallow: /build/ Disallow: /css/ Disallow: /trackback/ Disallow: /comments Disallow: /assets/graphics/ Disallow: /assets/visual/ Disallow: /category/*/* Disallow: */trackback Disallow: */feed Disallow: */comments Disallow: /*?* Disallow: /*? User-agent: Slurp Disallow: / User-agent: Baiduspider Disallow: / User-agent: ia_archiver Disallow: / User-agent: duggmirror Disallow: / User-agent: Yandex Disallow: / Sitemap: http://example.com/sitemap.xml.gz

    Read the article

  • Why do 410 pages show as errors in Google Webmaster Tools?

    - by ElHaix
    To remove links from our site, we return a 410 code on on the links we want removed, and shows The page you requested was removed.. In Webmaster Tools, I see all the 410 pages in Crawl Errors / Not Found. I'm worried that because they appear in Crawl Errors that they could be negatively affecting SEO rankings. Is that the case, and if so, should I change the return codes from 410 to something else?

    Read the article

  • Hiding a particulat page from search engines not to index

    - by user702325
    I have a page which i don't want search engines to index or crawl. I am not sure hat should i put in my robots.txt file to tell search engines not to crawl/index that page. The page it itself is getting generated dynamically and do not have a predefined template for it all i know about its URL which is pre-defined and will remain unchanged. I have this page say at www.mysite.com/my-nonindexable-page/ Please suggest what i should do to achieve this.I am using WordPress for my website

    Read the article

  • how does private sales ecommerce site work on their SEO?

    - by 142857
    In a private sales ecommerce site, users need to sign up/in before they can access the pages of website. So, even if a user tries to directly navigate to a product page, he is redirected to sign in. I am wondering then how does these sites manage their SEO, as it would imply google too can't crawl these pages, or do they completely ignore the SEO benefit of allowing google to crawl the product and catalogue pages?

    Read the article

  • In Google Webmaster Tools we have 3 sitemaps attributed to 1 domain

    - by Frank
    Thanks for you advice and help ahead of time I have a website that has been on the internet for almost 10 years created in "Microsoft Frontpage" with over 900 pages. Currently in Google Webmaster tools it shows up as 2 domains and 3 sitemaps: http://www.example.com example.com hostedsitmaps.com Furthermore, since we were having hard to placing the xml sitemap on our site(Frontpage Issues) we decided to hire pro-sitemaps.com to create, host and upload the xml file which they did. Thus, I have another site hostedsitemaps.com on our webmaster tools for the site. Hosted sitemaps shows: 900 urls submitted 800 Indexed. Crawl errors and Search queries: No data available http://www.example.com shows: 889 URLs submitted 1 URLs indexed. Crawl Errors: 14 Soft 404 796 Not found Search Queries: 8104 example.com shows: 889 URLs submitted 1 URLs indexed Crawl errors: 48 Soft 404 91 Not found Search Queries: 8104 My questions and need for help are as followed: 1. Why are our domain based sites in webmaster tools ( example.com and http:www.example.com) showing only 1 URL indexed while the hostedsitemap has 800 indexed? 2. Should we have 3 domains configured for this "one" domain in Google Webmaster tools? 3. Should we eliminate/delete the hostedsitemap from webmaster tools completely and take off that XML sitemap? 4 Does having example.com and http://www.example.com impact web ranking? 5. Any other thoughts or help in this very complicated matter for us. Thanks.

    Read the article

  • Do large number of internal broken links affect SEO?

    - by TheBigK
    We've a WordPress blog and had disqus plugin in stalled for several months. Around late August this year, the plugin created a ton of URLs that linked to non-existent location on our website. For example - Correct URL: domain.com/correct-URL/ Disqus created - domain.com/correct-URL/344322/ - Throws 404 domain.com/correct-URL/433466/ - Throws 404 So essentially, Google found a LARGE number of broken links that pointed to unknown locations on our own domain. As the count of those errors (404) rose, our site suffered massive drop in traffic and crawl rate dropped to 10% of what it was earlier. I wish to know - Can large number of (we've over 99k of them) internal broken links cause rankings to drop? I've fixed the issue in one go by creating 301 redirects for each bad URL to correct URL and removing disqus. Google however drops the count by ~1000 daily, as I mark errors as 'fixed' in Google Webmaster Tools. Is there any way to speed this up? Should I setup custom crawl rate to 'Fast' in GWT to make Google crawl our website faster? I'd appreciate your inputs and experience sharing.

    Read the article

  • Facebook Like javascript related to Time Spent Downloading a page Increase in GWT?

    - by donaldthe
    Hi, I installed the Facebook Like button Javascript version on my website on December 15th. Take a look at this report from Google Webmaster Central. Crawl stats Googlebot activity in the last 90 days The crawl stats are from Googlebot which as far as I know doesn't execute Javascript. Could the Facebook Like Javascript code, "The XFBML version" be related to large spike in Time spent downloading a page? (By the way the huge spike in November was caused by a mistake where every image request was getting a 301.) I'm not sure what caused the spike to go down by half somewhere in December. It may have been related to a faulty setting in web.config. I'm at a loss as to what I can do about this or even how to tell if this is my problem or Googlebots crawl problem. Here is the Facebook code I am using to create the like button. It is right after the opening body tag <div id="fb-root"></div> <script> window.fbAsyncInit = function() { FB.init({appId: 'xxxxx', status: true, cookie: true, xfbml: true}); }; (function() { var e = document.createElement('script'); e.async = true; e.src = document.location.protocol + '//connect.facebook.net/en_US/all.js'; document.getElementById('fb-root').appendChild(e); }()); ` and this creates the like box: <fb:like show_faces="false"></fb:like> If the Javascript can't be the problem any ideas on where to start looking would be appreciated.

    Read the article

  • robots.txt file with more restrictive rules for certain user agents

    - by Carson63000
    Hi, I'm a bit vague on the precise syntax of robots.txt, but what I'm trying to achieve is: Tell all user agents not to crawl certain pages Tell certain user agents not to crawl anything (basically, some pages with enormous amounts of data should never be crawled; and some voracious but useless search engines, e.g. Cuil, should never crawl anything) If I do something like this: User-agent: * Disallow: /path/page1.aspx Disallow: /path/page2.aspx Disallow: /path/page3.aspx User-agent: twiceler Disallow: / ..will it flow through as expected, with all user agents matching the first rule and skipping page1, page2 and page3; and twiceler matching the second rule and skipping everything?

    Read the article

  • Erlang OTP application design

    - by Toby Hede
    I am struggling a little coming to grips with the OTP development model as I convert some code into an OTP app. I am essentially making a web crawler and I just don't quite know where to put the code that does the actual work. I have a supervisor which starts my worker: -behaviour(supervisor). -define(CHILD(I, Type), {I, {I, start_link, []}, permanent, 5000, Type, [I]}). init(_Args) -> Children = [ ?CHILD(crawler, worker) ], RestartStrategy = {one_for_one, 0, 1}, {ok, {RestartStrategy, Children}}. In this design, the Crawler Worker is then responsible for doing the actual work: -behaviour(gen_server). start_link() -> gen_server:start_link(?MODULE, [], []). init([]) -> inets:start(), httpc:set_options([{verbose_mode,true}]), % gen_server:cast(?MODULE, crawl), % ok = do_crawl(), {ok, #state{}}. do_crawl() -> % crawl! ok. handle_cast(crawl}, State) -> ok = do_crawl(), {noreply, State}; do_crawl spawns a fairly large number of processes and requests that handle the work of crawling via http. Question, ultimately is: where should the actual crawl happen? As can be seen above I have been experimenting with different ways of triggering the actual work, but still missing some concept essential for grokering the way things fit together. Note: some of the OTP plumbing is left out for brevity - the plumbing is all there and the system all hangs together

    Read the article

  • Trouble with go tour crawler exercise

    - by David Mason
    I'm going through the go tour and I feel like I have a pretty good understanding of the language except for concurrency. On slide 71 there is an exercise that asks the reader to parallelize a web crawler (and to make it not cover repeats but I haven't gotten there yet.) Here is what I have so far: func Crawl(url string, depth int, fetcher Fetcher, ch chan string) { if depth <= 0 { return } body, urls, err := fetcher.Fetch(url) if err != nil { ch <- fmt.Sprintln(err) return } ch <- fmt.Sprintf("found: %s %q\n", url, body) for _, u := range urls { go Crawl(u, depth-1, fetcher, ch) } } func main() { ch := make(chan string, 100) go Crawl("http://golang.org/", 4, fetcher, ch) for i := range ch { fmt.Println(i) } } The issue I have is where to put the close(ch) call. If I put a defer close(ch) somewhere in the Crawl method, then I end up writing to a closed channel in one of the spawned goroutines, since the method will finish execution before the spawned goroutines do. If I omit the call to close(ch), as is shown in my example code, the program deadlocks after all the goroutines finish executing but the main thread is still waiting on the channel in the for loop since the channel was never closed.

    Read the article

< Previous Page | 1 2 3 4 5 6 7 8 9 10 11 12  | Next Page >