crawl - Page 11 - Developer IT

Site monitoring tool to look for javascript errors

- by Agile Noob

I am currently working on a site that includes javascript code that we get from several different sources and need to run on the site I maintain. Every once and a while some of this code breaks without our knowing until its too late. Is there a monitoring tool that will crawl our site and look for javascript errors and report them or could this be incorporated into a selenium test somehow?

Read the article

how to store data crawled from website

- by Richard

I want to crawl a website and store the content on my computer for later analysis. However my OS file system has a limit on the number of sub directories, meaning storing the original folder structure is not going to work. Suggestions? Map the URL to some filename so can store flatly? Or just shove it in a database like sqlite to avoid file system limitations?

Read the article

Nutch search always returns 0 results

- by darbour

I have set up nutch 1.0 on a cluster. It has been setup and has successfully crawled, I copied the crawl directory using the dfs -copyToLocal and set the value of searcher.dir in the nutch-site.xml file located in the tomcat directory to point to that directory. Still when I try to search I receive 0 results. Any help would be greatly appreciated.

Read the article

XML Import how would you do it?

- by Rico

XML is used as one of our main integration points. it comes over by many clients at a time but too many clients importing at the same time can slow down our database to a crawl. Someone has to have solved a problem like this. I am basically using VB to parse through the data and import what i want and don't want. Is there a better way?

Read the article

Do not filter outlinks in Nutch?

- by sigpwned

I'm currently trying to perform a deep crawl within a small list of sites. To accomplish this, I updated conf/domain-urlfilter.txt with the domains of the sites I wish to scrape, which worked nicely. However, I found that not only were the links crawled at every step filtered, but the outlinks captured from each page crawled were filtered as well. Is there a way to avoid filtering captured outlinks while still filtering crawled URLs?

Read the article

Is there a good Python library that can parse C++?

- by csbrooks

Google didn't turn up anything that seemed relevant. I have a bunch of existing, working C++ code, and I'd like to use python to crawl through it and figure out relationships between classes, etc. EDIT: Just wanted to point out: I don't think I need or want to parse every bit of C++; I just need something smart enough to pick up on class, function and member variable declarations, and to skip over function definitions.

Read the article

How to get Type[] of all classes a class extends

- by Jeff Dahmer

You can do Type[] interfaces = typeof(MyClass).GetInterfaces(); to get a list of everything a class implements implements. I am wondering if there is anyway to crawl the "extends" tree to see all the base types a class inherits, i.e. abstract classes etc.?

Read the article

How does Comparison Sites work?

- by Vijay

Need your thinking on how does these Comparision Sites actually work. Sites like Junglee.com policybazaar.com and there are many like these which provides comaprision of products , fares etc. grabbed from different websites. I had read a little about it and what i found is-: These sites uses Feeds of the sites data. These sites uses APIs of the sites which are actually provided by those sites. And for some sites which do not have any of these two posibility then the Comparision sites uses web-crawler to crawl their data. This is what i have found out. If you think there is more things to it please do give your own views. But i want to know these for my learning purpose and a little for curiosity- how does they actually matches the crawled data , feeds, and other so that there is no duplicacy. What is the process or algorithms for it. And where should i go to learn these concepts. References for books , articles or anything else.

Read the article

Can't Delete Old Windows Directory

- by David Mullin

I got a new SSD drive for my computer, and have installed Windows on this drive. This left an old Windows directory on my old normal drive. I am now attempting to delete this old Windows directory, but am getting blocked by security. If I crawl down into each subdirectory, I can manually change the ownership and access rights for each file, but if I attempt to do it from the root directory, I get a "Failed to enumerate objects in the container. Access is denied" error. I have tried logging in as local Administrator, but this had the same effect. I figure that I am missing something stupid, but I just can't determine what it is.

Read the article

How do you find all the links to disavow for a Google reconsideration request? [duplicate]

- by QF_Developer

This question already has an answer here: How to identify spammy domains giving backlinks to my site (to submit in disavow links in WMT) 2 answers A few months ago I received the following notification on Google Webmaster for a website I look after. Unnatural links to your site—impacts links Google has detected a pattern of unnatural artificial, deceptive, or manipulative links pointing to pages on this site. Some links may be outside of the webmaster’s control, so for this incident we are taking targeted action on the unnatural links instead of on the site’s ranking as a whole. Learn more. The question here is, should we actively attempt to disavow these links given that the action is seemingly targeted to just a bunch of keywords? I've downloaded the inbound links sample from Google Webmaster and so far I've been through the disavow and reconsideration requests process 6 times, each taking 2-3 weeks only to be supplied just 2 more links that Google don't approve of. At this rate it will take me the rest of my natural life to cleanup all these spammy links! It seems disavowing is futile as they haven't implemented broad actions against the website as a whole and (from what I can gather) have already nullified the value of those offending links. Under the quoted statement above however is a reconsideration request button that seems to imply I should be actively doing something here? UPDATE 14th October -- I have since created a small .NET application that you can feed the CSV sample links file into from Google Webmaster. What this tool does is crawl all the links and looks for specific linking patterns as per some configurable match strings. I realised that many of the links that Google are taking issue with were created by a rogue SEO firm we hired several years ago. All the links are appended with 1 of 5 different descriptions. The application I built uses some regexes to isolate any link sources with these matching appendages and automatically builds the disavow txt file. In the end it had to come down to an algorithm as manually disavowing links on this scale would take weeks! I will post the app here once I've cleaned it up.

Read the article

Is Flash typically slow on Linux?

- by CSarnia

Specifically, I'm running Mint 8 (Helena). I'm extremely new to Linux, and was searching for a solution that was user-friendly and GUI oriented. The box won't be used for much other than web browsing and word processing. Anyway, it runs relatively smoothly, except for Youtube videos... especially full-screen, which runs at like 1 FPS, and even after closing, slows Firefox to a crawl until I restart it. I'd seen an xkcd comic on the matter, but regarded it as a joke until now. Is this actually a problem? Are there any remedies I can try to smooth the applications?

Read the article

Bandwidth Control on our Internet Connection

- by AlamedaDad

Hi all, I have Covad dual/bonded T1 service in our office coming through a Cisco 1841 and then through a Sonicwall 3060Pro/Enhanced SW firewall. The problem I'm looking for some input on is how to limit the amount of bandwidth any single user/PC can user for downloading a file from the Internet. It's become an issue that when one person happens to download let's say an ~300MB file, normal internet access for the other employees slows to a crawl. I've seen through MRTG that in fact usage of the circuit jumps to the full 3mb for the duration of the download and then drops. Is it possible to control this? I'm not familiar with QOS or the like so I'm not sure. Any help on this would be appreciated. Thanks...Michael

Read the article

Webmaster tools, Duplicate Meta Descriptions, and Short Meta Descriptions [closed]

- by Watsy91

Possible Duplicate: Do meta keywords have any impact on ranking algorithms? I am fairly new to the whole Webmaster Tools concept. I have been looking at all the different options, such as crawl errors, HTML improvements etc... I have been looking at the Duplicate Meta Descriptions and Short Meta Descriptions, was wondering if anyone could suggest ideas on how to go about improving this. It seems that all the Duplicates are from the URL Title and the short description. It would seem to me that most people would have information regarding the page with the same keywords as their titles. Heres an example of one: These are the ultimate hampers in taste, quality and value. Amongst this range of luxury hampers ar /food-hampers/food-hampers-over-100.html /thank-you-gifts/large-gifts-over-100.html To get to the point I just want to know do these things really matter? Would they have a real consequence on my sites rankings? My sites have been falling down the rankings since early this year and I have really started to look at Google Analytics and Webmaster tools to try and indicate certain problems. I have researched the Internet and it seems that some people don't bother and others do!! I know that Stackoverflow has 100s+ people who have went through the above and I would really appricate if they could give me some tips etc. Or in the END does it really matter?? :D

Read the article

Win 7 accessing large files uses 100% RAM

- by user181276

Running Win 7 64-bit SP1 with 8 GB RAM. I first noticed this problem when using the GUI to copy some large (5+ GB) files from one disk to another. What happens is the physical memory in use rises quite quickly to 100% and the system comes to a crawl. If I just start to access the file in a media player (it is a movie) the memory usage climbs up slowly but eventually reaches 100%. When copying the same files via XCOPY I do not have this problem. Using RAMMAP I see most of the memory usage is under "Mapped File" and is allocated under the "Active" column. If I select "Empty System Working Set" the RAM usage drops back down but then starts to climb back up. Any ideas on what I can check/test to eliminate this issue?

Read the article

printing foreign text for PHP on UBUNTU and CENTOS

- by hao

Hey guys, I am using domdocuments and using things like $div-nodeValue to obtain certain info from a web page. On my ubuntu machine when i do php crawl.php everything is displayed properly in Chinese (the page is in UTF-8). However on my CENTOS machine using the same code I get æ´å¤åå¸ when I print in the terminal. and when I save it to the database, the characters are also messed up. One thing I noticed is that when I do print $content, both systems display them properly.

Read the article

Robots.txt practices with .htaccess redirections (inherits)

- by Jayhal

I have a question regarding how to write robots.txt files for many domains and subdomains with redirects in place. We have a hosting account that enacts primary and add-on domains. All of our domains and subdomains, including the primary domain, is redirected via htaccess 301s to their own subdirectories in the primary domain's root directory. I'm confused about how I would write the robots.txt for certain directories. First, I wanted to confirm I am right in understanding that for domains and subdomains, crawlers will look to the directory that acts as that urls root directory for the crawling rules(robots.txt). Also, that a directory will not be affected by a robots.txt present in their parent directory if the directory has its own domain/subdomain, and that url is the one being accessed by crawlers. (Am pretty sure, but I wanted to confirm I didnt have a fundamentally flawed understanding of robots.txt) In the original root directory on the account(where the primary domain was directed before htaccess was put in place) what should the robots.txt contain? When crawlers look to crawl our primary domain, will they look to the original root directory for the robots.txt or will they reference the file contained in the new subdirectory where all the primary domain's site files are located? If so, what should the root's robot.txt include if anything at all. Would I be right to include a simple 'disallow: /' for all agents, and then include more specific robots.txt files in each subdirectory with more specific instructions. Would that affect the crawling of the directory where the primary domain is now redirected? Any help is greatly appreciated, Thanks!

Read the article

MS Bing web crawler out of control causing our site to go down

- by akaDanPaul

Here is a weird one that I am not sure what to do. Today our companies e-commerce site went down. I tailed the production log and saw that we were receiving a ton of request from this range of IP's 157.55.98.0/157.55.100.0. I googled around and come to find out that it is a MSN Web Crawler. So essentially MS web crawler overloaded our site causing it not to respond. Even though in our robots.txt file we have the following; Crawl-delay: 10 So what I did was just banned the IP range in iptables. But what I am not sure to do from here is how to follow up. I can't find anywhere to contact Bing about this issue, I don't want to keep those IPs blocked because I am sure eventually we will get de-indexed from Bing. And it doesn't really seem like this has happened to anyone else before. Any Suggestions? Update, My Server / Web Stats Our web server is using Nginx, Rails 3, and 5 Unicorn workers. We have 4gb of memory and 2 virtual cores. We have been running this setup for over 9 months now and never had an issue, 95% of the time our system is under very little load. On average we receive 800,000 page views a month and this never comes close to bringing / slowing down our web server. Taking a look at the logs we were receiving anywhere from 5 up to 40 request / second from this IP range. In all my years of web development I have never seen a crawler hit a website so many times. Is this new with Bing?

Read the article

GWT: reporting crawling errors for non existing links

- by pixeline

Google Webmaster Tools is reporting crawl errors for links that never existed, and if i check the "Linked from" tab for a given error link, it shows another that never existed. They all mention joomla/ which is not the cms used on this domain (it's wordpress fyi). Exampled: http://example.com/joomla/index.php/component/user/register Linked from: http://example.com/joomla/component/user/login?return=L2###### What is going on? UPDATE 1 I tried something: I provided one of the faulty urls to the "Fetch as Google" functionality. Instead of returning a 404, it returns a 301 to another Joomla page. HTTP/1.1 301 Moved Permanently Server: Apache/2.4.3 X-Powered-By: PHP/5.4.4-10 X-Pingback: http://example.com/xmlrpc.php Expires: Wed, 11 Jan 1984 05:00:00 GMT Cache-Control: no-cache, must-revalidate, max-age=0 Pragma: no-cache Set-Cookie: PHPSESSID=1fgr5v2oip39miibuptd51s8h0; path=/ Set-Cookie: woocommerce_items_in_cart=0; expires=Sat, 12-Jan-2013 11:44:01 GMT; path=/ Location: http://example.com/joomla/component/user/register Content-Type: text/html; charset=iso-8859-1 Content-Length: 387 Date: Sat, 12 Jan 2013 12:44:01 GMT Via: 1.1 varnish Connection: keep-alive Accept-Ranges: bytes Age: 0 <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"> <html><head> <title>301 Moved Permanently</title> </head><body> <h1>Moved Permanently</h1> <p>The document has moved <a href="http://example.com/joomla/component/user/register">here</a>.</p> <p>Additionally, a 301 Moved Permanently error was encountered while trying to use an ErrorDocument to handle the request.</p> </body></html>

Read the article

Is there a "Run Level Configurator" for Windows XP?

- by djangofan

I have been having trouble with something causing my Windows XP system to take 6-8 minutes to startup from a cold boot. Something is happening during startup that is causing the system to crawl. I have a lot of Linux experience , especially configuring run levels so that some programs start before others do. I also know how to do that on Windows XP but its really complex, and with 50 services I'd have to keep a giant spreadsheet to keep it all organized. Is there such a thing as a Windows XP Tool that "emulates" the Linux run-level editors that I can use to control the order than services start on my system?

Read the article

Handling SEO for Infinite pages that cause external slow API calls

- by Noam

I have an 'infinite' amount of pages in my site which rely on an external API. Generating each page takes time (1 minute). Links in the site point to such pages, and when a users clicks them they are generated and he waits. Considering I cannot pre-create them all, I am trying to figure out the best SEO approach to handle these pages. Options: Create really simple pages for the web spiders and only real users will fetch the data and generate the page. A little bit 'afraid' google will see this as low quality content, which might also feel duplicated. Put them under a directory in my site (e.g. /non-generated/) and put a disallow in robots.txt. Problem here is I don't want users to have to deal with a different URL when wanting to share this page or make sense of it. Thought about maybe redirecting real users from this URL back to the regular hierarchy and that way 'fooling' google not to get to them. Again not sure he will like me for that. Letting him crawl these pages. Main problem is I can't control to rate of the API calls and also my site seems slower than it should from a spider's perspective (if he only crawled the generated pages, he'd think it's much faster). Which approach would you suggest?

Read the article

Can .htaccess slow down a site?

- by Cody Sharp

I'm working with a client on an e-commerce website. I implemented clean URLs using .htaccess. I also used .htaccess to solve canonical issues such as redirecting www to non-www and removing index.php from the URL. The website recently began to slow down dramatically, sometimes not even loading. The site is hosted on GoDaddy, and when the client called GoDaddy they told him it was the .htaccess file slowing down the website. I find this highly unlikely because of my past experiences, but I'm not 100% sure. My thinking is that the client's website is most likely on a shared server with a busy neighborhood, thus slowing down the site. It's not always slow, but rather sporadic throughout the day, loading fast at some points and slow at other points in time. Can the .htaccess file slow down a website to a crawl? If so, are there better ways to solve these problems with different rewrite rules and such? Here is what the actual .htaccess file looks like: Options +FollowSymlinks RewriteEngine On RewriteBase / RewriteCond %{HTTP_HOST} ^www.example.net [NC] RewriteRule ^(.*)$ http://example.net/$1 [L,R=301] RewriteRule ^products/([0-9a-zA-Z\_\-]*)\.htm([l]?)$ index.php p=product&product_code=$1 [L] RewriteRule ^catalog/([0-9a-zA-Z\_\-]*)\.htm([l]?)$ index.php p=catalog&catalog_code=$1 [L] RewriteRule ^pages/([0-9a-zA-Z\_\-]*)\.htm([l]?)$ index.php?p=page&page_id=$1 [L] RewriteRule ^index\.htm([l]?)$ index.php?p=home [L] RewriteRule ^site_map\.htm([l]?)$ index.php?p=site_map [L] RewriteCond %{QUERY_STRING} ^p=home$ RewriteRule (.*) ? [R=permanent] I'm a .htaccess and regex novice, so any pointed out mistakes would also help. Thank you.

Read the article

Crawling a Content Folio

- by Kyle Hatlestad

Content Folios in WebCenter Content allow you to assemble, track, and access a logical group of documents and/or links. It allows you to manage them as just a list of items (simple folio) or organized as a hierarchy (advanced folio). The built-in UI in content server allows you to work with these folios, but publishing them or consuming them externally can be a bit of a challenge. The folios themselves are actually XML files that contain the structure, attributes, and pointers to the content items. So to publish this somewhere, such as a Site Studio page, you could perhaps use an XML parser to traverse the structure and create your output. But XML parsers are not always the easiest or most efficient to use. In order to more easily crawl and consume a Content Folio, Ed Bryant - Principal Sales Consultant, wrote a component to do just that. His component adds a service which does all the work for you and returns the folio structure as a simple resultset. So consuming and publishing that folio on a Site Studio page or in your portal using RIDC is a breeze! For example, let's take an advanced Content Folio example like this: If we look at the native file, the XML looks like this: But if we access the folio using the new service - http://server/cs/idcplg?IdcService=FOLIO_CRAWL&dDocName=ecm008003&IsPageDebug=1 - this is what the result set looks like (using the IsPageDebug parameter). Given this as the result set, it makes it very easy to consume and repurpose that folio. You can download a copy of the sample component here. Special thanks to Ed for letting me share this component!

Read the article

How to configure QoS on home router

- by Joril

I have a USR9105 router and I'd like to configure its QoS to prioritize web traffic (browser) over everything else (e.g. torrents). I'm confused by its interface though: "IP precedence" allows selecting a number from 0 to 7, while "IP type of service" can be one of "Normal Service", "Minimize cost", "Maximize reliability", "Maximize throughput" and "Minimize delay". How should I set it up? Is QoS the wrong solution to avoiding torrents slowing down browsing to a crawl? Should I set up a proxy and traffic shaping instead?

Read the article

How to configure QoS on home router

- by Joril

I have a USR9105 router and I'd like to configure its QoS to prioritize web traffic (browser) over everything else (e.g. torrents). I'm confused by its interface though: "IP precedence" allows selecting a number from 0 to 7, while "IP type of service" can be one of "Normal Service", "Minimize cost", "Maximize reliability", "Maximize throughput" and "Minimize delay". How should I set it up? Is QoS the wrong solution to avoiding torrents slowing down browsing to a crawl? Should I set up a proxy and traffic shaping instead?

Read the article

SharePoint Server Search Not Crawling

- by tekiegreg

Hi there, we recently moved some sites into a new farm, everything seems to be doing fine, but the search for reasons I can't identify are not crawling the migrated content. We're getting this message in our crawl log for every document: http://xxx/sites/...announcements The object was not found. (The item was deleted because it was either not found or the crawler was denied access to it.) Of course the first thing I suspected was the crawler access account, so I logged into SharePoint with the account and was able to access via that URL just fine. I tried upping permissions (even all the way up to Admin) but to no avail. Thoughts?

Search Results

Search found 446 results on 18 pages for 'crawl'.

Page 11/18 | < Previous Page | 7 8 9 10 11 12 13 14 15 16 17 18 | Next Page >

- by Agile Noob

- by Richard

- by darbour

- by Rico

- by sigpwned

- by csbrooks

- by Jeff Dahmer

- by Vijay

- by David Mullin

- by QF_Developer

- by CSarnia

- by AlamedaDad

- by Watsy91

- by user181276

- by hao

- by Jayhal

- by akaDanPaul

- by pixeline

- by djangofan

- by Noam

- by Cody Sharp

- by Kyle Hatlestad

- by Joril

- by Joril

- by tekiegreg

< Previous Page | 7 8 9 10 11 12 13 14 15 16 17 18 | Next Page >