nrpe - Page 2 - Developer IT

coordinating a script to run on only one of identical load-balanced servers

- by Amos Shapira

I have two identically configured CentOS 5 servers (possibly more in the future). I need to run a cron job on any one of them and that it'll run only on one of them. I know about RedHat Cluster Suite (we use it on other servers), but it's a too big a gun to use for this task, plus it doesn't really behave well for less than three nodes. Is there anything light-weight I can use for that? The servers can communicate with each other directly. I suppose I can develope something over ssh or nrpe (two server which are already installed on these servers), but I was wondering whether there is something already available.

Read the article

nagios: trouble using check_smtps command

- by ethrbunny

I'm trying to use this command to check on port 587 for my postfix server. Using nmap -P0 mail.server.com I see this: Starting Nmap 5.51 ( http://nmap.org ) at 2013-11-04 05:01 PST Nmap scan report for mail.server.com (xx.xx.xx.xx) Host is up (0.0016s latency). rDNS record for xx.xx.xx.xx: another.server.com Not shown: 990 closed ports PORT STATE SERVICE 22/tcp open ssh 25/tcp open smtp 110/tcp open pop3 111/tcp open rpcbind 143/tcp open imap 465/tcp open smtps 587/tcp open submission 993/tcp open imaps 995/tcp open pop3s 5666/tcp open nrpe So I know the relevant ports for smtps (465 or 587) are open. When I use openssl s_client -connect mail.server.com:587 -starttls smtp I get a connection with all the various SSL info. (Same for port 465). But when I try libexec/check_ssmtp -H mail.server.com -p587 I get: CRITICAL - Cannot make SSL connection. 140200102082408:error:140770FC:SSL routines:SSL23_GET_SERVER_HELLO:unknown protocol:s23_clnt.c:699: What am I doing wrong?

Read the article

High memory usage on the server - can't determine the process

- by HTF

I've noticed high memory usage on the server. Details: OS: CentOS 6.3 - x86_64 Web server: Nginx with PHP-FPM The server is generating PDF documents so the traffic is minimum. top: # top -b -n 1 -a top - 10:04:51 up 21 days, 18:57, 1 user, load average: 0.00, 0.00, 0.00 Tasks: 92 total, 1 running, 91 sleeping, 0 stopped, 0 zombie Cpu(s): 0.3%us, 0.2%sy, 0.0%ni, 99.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 3923092k total, 3720380k used, 202712k free, 133904k buffers Swap: 4194296k total, 12k used, 4194284k free, 147404k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 15855 www-data 20 0 199m 4952 2128 S 0.0 0.1 0:00.06 php-fpm 15853 www-data 20 0 199m 4940 2028 S 0.0 0.1 0:00.06 php-fpm 15850 www-data 20 0 199m 4928 2020 S 0.0 0.1 0:00.05 php-fpm 15851 www-data 20 0 199m 4888 2020 S 0.0 0.1 0:00.06 php-fpm 15852 www-data 20 0 199m 4852 2020 S 0.0 0.1 0:00.06 php-fpm 15857 www-data 20 0 198m 4716 2020 S 0.0 0.1 0:00.06 php-fpm 17553 root 20 0 97816 3860 2924 S 0.0 0.1 0:00.03 sshd 15849 root 20 0 198m 3460 1072 S 0.0 0.1 0:00.12 php-fpm 13441 nginx 20 0 65608 2968 1604 S 0.0 0.1 0:02.06 nginx 13440 nginx 20 0 65608 2964 1600 S 0.0 0.1 0:01.87 nginx 17561 root 20 0 105m 1944 1488 S 0.0 0.0 0:00.01 bash 1150 xfs 20 0 20980 1784 704 S 0.0 0.0 0:00.13 xfs 15863 root 20 0 179m 1424 1028 S 0.0 0.0 0:00.00 rsyslogd 1 root 20 0 19224 1360 1088 S 0.0 0.0 0:17.96 init 1201 nrpe 20 0 40928 1288 704 S 0.0 0.0 3:57.64 nrpe 13226 root 20 0 114m 1216 612 S 0.0 0.0 0:00.01 crond 6691 root 20 0 64068 1156 488 S 0.0 0.0 0:09.59 sshd 13439 root 20 0 65104 1128 292 S 0.0 0.0 0:00.00 nginx 19026 root 20 0 15040 1116 844 R 0.0 0.0 0:00.00 top 451 root 16 -4 11052 1096 316 S 0.0 0.0 0:00.02 udevd 1174 root 18 -2 11048 1064 288 S 0.0 0.0 0:00.00 udevd 1175 root 18 -2 11048 1064 288 S 0.0 0.0 0:00.00 udevd 1065 root 16 -4 93168 824 560 S 0.0 0.0 0:16.00 auditd 1165 root 20 0 4056 564 480 S 0.0 0.0 0:00.00 mingetty 1167 root 20 0 4056 564 480 S 0.0 0.0 0:00.00 mingetty 1169 root 20 0 4056 564 480 S 0.0 0.0 0:00.00 mingetty 1171 root 20 0 4056 564 480 S 0.0 0.0 0:00.00 mingetty 1163 root 20 0 4056 560 480 S 0.0 0.0 0:00.00 mingetty 1176 root 20 0 4056 560 480 S 0.0 0.0 0:00.00 mingetty 2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd 3 root RT 0 0 0 0 S 0.0 0.0 0:11.75 migration/0 4 root 20 0 0 0 0 S 0.0 0.0 44:30.28 ksoftirqd/0 5 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/0 6 root RT 0 0 0 0 S 0.0 0.0 0:03.51 watchdog/0 7 root RT 0 0 0 0 S 0.0 0.0 0:11.63 migration/1 8 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/1 9 root 20 0 0 0 0 S 0.0 0.0 11:35.50 ksoftirqd/1 10 root RT 0 0 0 0 S 0.0 0.0 0:03.34 watchdog/1 11 root 20 0 0 0 0 S 0.0 0.0 1:36.68 events/0 12 root 20 0 0 0 0 S 0.0 0.0 1:50.57 events/1 13 root 20 0 0 0 0 S 0.0 0.0 0:00.00 cgroup 14 root 20 0 0 0 0 S 0.0 0.0 0:00.00 khelper 15 root 20 0 0 0 0 S 0.0 0.0 0:00.00 netns 16 root 20 0 0 0 0 S 0.0 0.0 0:00.00 async/mgr 17 root 20 0 0 0 0 S 0.0 0.0 0:00.00 pm 18 root 20 0 0 0 0 S 0.0 0.0 0:07.86 sync_supers 19 root 20 0 0 0 0 S 0.0 0.0 0:10.38 bdi-default 20 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kintegrityd/0 21 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kintegrityd/1 22 root 20 0 0 0 0 S 0.0 0.0 0:04.35 kblockd/0 23 root 20 0 0 0 0 S 0.0 0.0 0:04.18 kblockd/1 24 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kacpid 25 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kacpi_notify 26 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kacpi_hotplug 27 root 20 0 0 0 0 S 0.0 0.0 0:00.00 ata/0 28 root 20 0 0 0 0 S 0.0 0.0 0:00.00 ata/1 29 root 20 0 0 0 0 S 0.0 0.0 0:00.00 ata_aux 30 root 20 0 0 0 0 S 0.0 0.0 0:00.00 ksuspend_usbd 31 root 20 0 0 0 0 S 0.0 0.0 0:00.00 khubd 32 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kseriod 33 root 20 0 0 0 0 S 0.0 0.0 0:00.00 md/0 34 root 20 0 0 0 0 S 0.0 0.0 0:00.00 md/1 35 root 20 0 0 0 0 S 0.0 0.0 0:00.00 md_misc/0 36 root 20 0 0 0 0 S 0.0 0.0 0:00.00 md_misc/1 37 root 20 0 0 0 0 S 0.0 0.0 0:00.48 khungtaskd 38 root 20 0 0 0 0 S 0.0 0.0 1:07.52 kswapd0 39 root 25 5 0 0 0 S 0.0 0.0 0:00.00 ksmd 40 root 39 19 0 0 0 S 0.0 0.0 0:22.00 khugepaged 41 root 20 0 0 0 0 S 0.0 0.0 0:00.00 aio/0 42 root 20 0 0 0 0 S 0.0 0.0 0:00.00 aio/1 43 root 20 0 0 0 0 S 0.0 0.0 0:00.00 crypto/0 44 root 20 0 0 0 0 S 0.0 0.0 0:00.00 crypto/1 49 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthrotld/0 50 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthrotld/1 52 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kpsmoused 53 root 20 0 0 0 0 S 0.0 0.0 0:00.00 usbhid_resumer 83 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kstriped 233 root 20 0 0 0 0 S 0.0 0.0 0:00.00 scsi_eh_0 234 root 20 0 0 0 0 S 0.0 0.0 0:00.00 scsi_eh_1 321 root 20 0 0 0 0 S 0.0 0.0 0:00.00 virtio-blk 359 root 20 0 0 0 0 S 0.0 0.0 0:03.24 kdmflush 360 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kdmflush 380 root 20 0 0 0 0 S 0.0 0.0 0:20.64 jbd2/dm-0-8 381 root 20 0 0 0 0 S 0.0 0.0 0:00.00 ext4-dio-unwrit 382 root 20 0 0 0 0 S 0.0 0.0 0:00.00 ext4-dio-unwrit 694 root 20 0 0 0 0 S 0.0 0.0 0:00.00 vballoon 697 root 20 0 0 0 0 S 0.0 0.0 0:00.00 virtio-net 818 root 20 0 0 0 0 S 0.0 0.0 0:00.00 jbd2/vda1-8 819 root 20 0 0 0 0 S 0.0 0.0 0:00.00 ext4-dio-unwrit 820 root 20 0 0 0 0 S 0.0 0.0 0:00.00 ext4-dio-unwrit 851 root 20 0 0 0 0 S 0.0 0.0 0:06.96 kauditd 1013 root 20 0 0 0 0 S 0.0 0.0 0:15.45 flush-253:0 ps: # ps aux --sort -vsz | head USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND www-data 13213 0.0 0.1 204416 4772 ? S 08:28 0:00 php-fpm: pool default www-data 13214 0.0 0.1 204416 4776 ? S 08:28 0:00 php-fpm: pool default www-data 13215 0.0 0.1 204416 4832 ? S 08:28 0:00 php-fpm: pool default www-data 13216 0.0 0.1 204416 4776 ? S 08:28 0:00 php-fpm: pool default www-data 13218 0.0 0.1 204416 4956 ? S 08:28 0:00 php-fpm: pool default free: #free -m total used free shared buffers cached Mem: 3831 3530 300 0 130 143 -/+ buffers/cache: 3256 574 Swap: 4095 0 4095 When I stooped Nginx, PHP-FPM the memory usage was still the same. Could you help me to investigate what is consuming the memory on the system? Regards

Read the article

How to collect the performance data of a server during an unreachable/down period using Nagios?

- by gsc-frank

Some time services and host stop responding due to a poor server performance. I mean, if for some reason (could be lot of concurrency services access, a expensive backup execution on the server or whatever that consume tons of server resources) a server performance is very degraded, that could lead that the server isn't capable to establish any "normal network communication" (without trigger whatever standards timeouts defined for such communication). Knowing host's performance data (cpu, memory, ...) in case of available during that period (host is not down and despite of its performance degradation still allow plugins collect performance data) could be very useful for sysadmin to try to determine what cause the problem, or at least, if the host performance was good and don't interfered at all in the host/service down. This problem could be solved using remote active (NRPE) or remote passive (NSCA) if such remote solutions could store (buffered) perf data to be send to central Nagios server when host performance or network outage allow it. I read the doc of both solutions and can't find any reference to such buffer mechanism neither what happened in case that NSCA can't reach Nagios server. Any idea of how solve this lack of info? so useful for forensic analysis. EDIT: My questions isn about which tools I can use to debug perf problems or gather perf data to analysis, but is about how collect (using Nagios) host perf data even during a network outage for its posterior analysis (kind of forensic analysis). The idea is integrate such data to Nagios graphers like pnp4nagios and NagiosGrapther. I know that I could install tools like Cacti in each of my host, and have a kind of performance data collection redundancy, but I really want avoid that and try to solve all perf analysis requirements with one tools: Nagios

Read the article

Nagios DNX plugins

- by danneh3826

I'm toying with the idea of multiple Nagios instances setup to monitor our infrastructure. I've looked at all the various methods of distributed Nagios checks, and I think DNX comes out the closest. DNX handles failure of worker nodes, that's fine. What happens if the main DNX server fails though? Is there a way to replicate the server too? I'm using AWS EC2 primarily, so I can utilise Elastic Load Balancing for the web UI, but I need to be able to handle the AZ where the monitoring server is to fail over, and essentially for a second to pick up the checking load (active/passive, active/active, so long as it doesn't fail completely) The other thing I'm trying to solve is an issue with routing. What I'd like is to have multiple nodes report a fault before Nagios confirms it as critical. Not the NRPE checks, as they're pretty self explanitory, but things more like check_ping. I often have routing issues out of AWS to certain datacenters, so Nagios can often report bad/no ping/timeout as a critical issue, even though the machine in question is working fine. Would it be possible to have a setup where a worker complains a service check is critical, and have a second worker node (positioned in another datacenter/AZ) also report the service as critical before the Nagios central server issues a critical alert? I realise I might be asking a bit much (how far down the line do you go setting up failover systems before it starts to get ridiculous), however surely someone must have thought of this scenario when developing DNX?

Read the article

Monitoring Between EC2 Regions

- by ABrown

I'm working on a small EC2 project that involves a handful of servers in two different regions (US East and EU West). My first task is to implement a Nagios monitoring solution. Monitoring within a region is simple - I just use the private domain names/IPs, but I'm a little unsure of the best way to handle monitoring the second region without setting up a second Nagios install. The environment is fairly static, so I'm not going to be scripting the configuration with the EC2 tools just yet. As I see it, I have two options. Two Nagios installations (which is over-kill for the small number of servers I'm dealing with). Pros: I don't have to alter the group permissions nor do I have to pay for the traffic, redundancy in the monitoring solution - I could monitor the Nagios servers. Cons: two installations to deal with and I'd need to run another server instance. Have the single installation monitor both regions. Pros: one installation to deal with. Cons: slightly reduced security - security group will have to have NRPE (5666) opened for one source IP and also paying for a small amount of bandwidth at the Internet rate for data transfer between the regions. I guess my question is - how have others handled this problem and what are your recommendations? Thanks!

Read the article

Nagios plugin script not working as expected

- by Linker3000

I have modified an off-the-shelf Nagios plugin perl script to (in theory) return a one or zero according to the existence, or not, of a file on a remote linux server. The script runs a remote ssh session and logs in as the nagios user. The remote linux servers have private keys setup for that user, and on the bash command line the script works as expected, but when run as a plugin it always returns '1' (true) even if the file does not exist. Some help with the logic or a comment on why things are not working as expected within Nagios would be appreciated. I'd prefer to use this ssh login method rather than having to install nrpe on all the linux servers. To run from a command line (assuming remote server has a user called nagios with a valid private key): ./check_reboot_required -e ssh -H remote-servers-ip-addr -p 'filename-to-check' -v Ta. #! /usr/bin/perl -w # # # License Information: # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. # ############################################################################ use POSIX; use strict; use Getopt::Long; use lib "/usr/lib/nagios/plugins" ; use vars qw($host $opt_V $opt_h $opt_v $verbose $PROGNAME $pattern $opt_p $mmin $opt_e $opt_t $opt_H $status $state $msg $msg_q $MAILQ $SHELL $device $used $avail $percent $fs $blocks $CMD $RMTOS); use utils qw(%ERRORS &print_revision &support &usage ); sub print_help (); sub print_usage (); sub process_arguments (); $ENV{'PATH'}=''; $ENV{'BASH_ENV'}=''; $ENV{'ENV'}=''; $PROGNAME = "check_reboot_required"; Getopt::Long::Configure('bundling'); $status = process_arguments(); if ($status){ print "ERROR: processing arguments\n"; exit $ERRORS{'UNKNOWN'}; } $SIG{'ALRM'} = sub { print ("ERROR: timed out waiting for $CMD on $host\n"); exit $ERRORS{'WARNING'}; }; $host = $opt_H; $pattern = $opt_p; print "Pattern >" . $pattern . "< " if $verbose; alarm($opt_t); #$CMD = "/usr/bin/find " . $pattern . " -type f 2>/dev/null| /usr/bin/wc -l"; $CMD = "[ -f " . $pattern . " ] && echo 1 || echo 0"; alarm($opt_t); ## get cmd output from remote system if (! open (OUTPUT, "$SHELL $host $CMD|" ) ) { print "ERROR: could not open $CMD on $host\n"; exit $ERRORS{'UNKNOWN'}; } my $perfdata = ""; my $state = "3"; my $msg = "Indeterminate result"; # only first line is relevant in this iteration. while (<OUTPUT>) { my $result = chomp($_); $msg = $result; print "Shell returned >" . $result . "< length is " . length($result) . " " if $verbose; if ( $result == 1 ) { $msg = "Reboot required (NB: Result still not accurate)" . $result ; $state = $ERRORS{'WARNING'}; last; } elsif ( $result == 0 ) { $msg = "No reboot required (NB: Result still not accurate) " . $result ; $state = $ERRORS{'OK'}; last; } else { $msg = "Output received, but it was neither a 1 nor a 0" ; last; } } close (OUTPUT); print "$msg | $perfdata\n"; exit $state; ##################################### #### subs sub process_arguments(){ GetOptions ("V" => \$opt_V, "version" => \$opt_V, "v" => \$opt_v, "verbose" => \$opt_v, "h" => \$opt_h, "help" => \$opt_h, "e=s" => \$opt_e, "shell=s" => \$opt_e, "p=s" => \$opt_p, "pattern=s" => \$opt_p, "t=i" => \$opt_t, "timeout=i" => \$opt_t, "H=s" => \$opt_H, "hostname=s" => \$opt_H ); if ($opt_V) { print_revision($PROGNAME,'$Revision: 1.0 $ '); exit $ERRORS{'OK'}; } if ($opt_h) { print_help(); exit $ERRORS{'OK'}; } if (defined $opt_v ){ $verbose = $opt_v; } if (defined $opt_e ){ if ( $opt_e eq "ssh" ) { if (-x "/usr/local/bin/ssh") { $SHELL = "/usr/local/bin/ssh"; } elsif ( -x "/usr/bin/ssh" ) { $SHELL = "/usr/bin/ssh"; } else { print_usage(); exit $ERRORS{'UNKNOWN'}; } } elsif ( $opt_e eq "rsh" ) { $SHELL = "/usr/bin/rsh"; } else { print_usage(); exit $ERRORS{'UNKNOWN'}; } } else { print_usage(); exit $ERRORS{'UNKNOWN'}; } unless (defined $opt_t) { $opt_t = $utils::TIMEOUT ; # default timeout } unless (defined $opt_H) { print_usage(); exit $ERRORS{'UNKNOWN'}; } return $ERRORS{'OK'}; } sub print_usage () { print "Usage: $PROGNAME -e <shell> -H <hostname> -p <directory/file pattern> [-t <timeout>] [-v verbose]\n"; } sub print_help () { print_revision($PROGNAME,'$Revision: 0.1 $'); print "\n"; print_usage(); print "\n"; print " Checks for the presence of a 'reboot-required' file on a remote host via SSH or RSH\n"; print "-e (--shell) = ssh or rsh (required)\n"; print "-H (--hostname) = remote server name (required)"; print "-p (--pattern) = File pattern for find command (default = /var/run/reboot-required)\n"; print "-t (--timeout) = Plugin timeout in seconds (default = $utils::TIMEOUT)\n"; print "-h (--help)\n"; print "-V (--version)\n"; print "-v (--verbose) = debugging output\n"; print "\n\n"; support(); }

Developer IT