opsview - Developer IT

Opsview Notifications, how to report event duration

- by dotwaffle

At present, Opsview reports recoveries in the following format: RECOVERY: Internal Alarm is OK on host Ellie: SNMP OK - 0 Service: Internal Alarm Host: Ellie Alias: Ellie Address: 1.2.3.4 State: OK Comment: () Date/Time: Mon Oct 5 14:57:53 BST 2009 Additional Info: SNMP OK - 0 What I would ideally like to do is add a "duration" field, so that you can tell without scrolling back on a Blackberry how long the event has been at fault for. Is there an easy solution to this?

Read the article

Use 3rd party tools with opsview

- by hoberion

How can I use 3rd party tools like nagstamon nagios monitoring or the iphone tools with opsview? there is no cgi-bin anymore it seems (you get redirected to status/hostgroup)

Read the article

Opsview Notifications, how to report event duration

- by dotwaffle

At present, Opsview reports recoveries in the following format: RECOVERY: Internal Alarm is OK on host Ellie: SNMP OK - 0 Service: Internal Alarm Host: Ellie Alias: Ellie Address: 1.2.3.4 State: OK Comment: () Date/Time: Mon Oct 5 14:57:53 BST 2009 Additional Info: SNMP OK - 0 What I would ideally like to do is add a "duration" field, so that you can tell without scrolling back on a Blackberry how long the event has been at fault for. Is there an easy solution to this?

Read the article

Monitoring multiple sites on a single server using OpsView

- by Kev

We have several web servers. On each of these servers there can be ~250 web sites. I need to add a HTTP check for each site on each server. Each site has a reserved host header that we know can always be resolved in the format of: w10000.hostchecks.mycompany.com w10020.hostchecks.mycompany.com w11992.hostchecks.mycompany.com ..and so on.. What I want is for there to be a master ping check on the web server's main IP address and then separate HTTP checks for each of the sites on the server. If the master ping test fails then I want the HTTP tests to cease until the master ping check goes OK. I had a stab at this and tried do the following: Create a parent host that does a ping check on the server's main ip address (e.g. server is named WEB0001). For each of the sites that reside on WEB0001: Create a separate Host with a Primary Hostname of wXXXXX.hostchecks.mycompany.com Make WEB0001 the parent host Add a monitor (HTTP check to a special url that is mapped into each site using a virtual directory: H- $HOSTADDRESS$ -u /__hostcheck/IsAlive.aspx -w 5 -c 10 -p 80 However I find that if I down the parent server (WEB0001) the http checks seem to continue. Am I going about this completely the wrong way?

Read the article

check_snmp warning & critical thresholds with negative values

- by Oesor

I'm querying some signal level values measured in dBm, and the SNMP host on the remove device reports the values as negative values, ie, -90 dBm. However, check-snmp seems to be incapable of dealing with negative numbers as part of its threshold values. If I specify the values as part of a collection of OIDs, it accepts the syntax but converts the snmp value to positive, thus always generating a WARNING/CRITICAL result: root@ops-00:/usr/local/nagios/libexec# ./check_snmp -H 192.168.1.100 -o DEVICE-MIB::AverageReceiveSNR.0,DEVICE-MIB::CurrentNoiseFloor.0 -w 10:,~:-85 -c 15:,~:-80 -vvvv /usr/bin/snmpget -t 1 -r 5 -m ALL -v 1 [authpriv] 192.168.1.100:161 DEVICE-MIB::AverageReceiveSNR.0 DEVICE-MIB::CurrentNoiseFloor.0 DEVICE-MIB::AverageReceiveSNR.0 = INTEGER: 25 DEVICE-MIB::CurrentNoiseFloor.0 = INTEGER: -97 Processing line 1 oidname: DEVICE-MIB::AverageReceiveSNR.0 response: = INTEGER: 25 Processing line 2 oidname: DEVICE-MIB::CurrentNoiseFloor.0 response: = INTEGER: -97 SNMP CRITICAL - 25 *97* | DEVICE-MIB::AverageReceiveSNR.0=25 DEVICE-MIB::CurrentNoiseFloor.0=97 If I run it with a single OID, it gives me an error that the format is incorrect: root@ops-00:/usr/local/nagios/libexec# ./check_snmp -H 192.168.1.100 -o DEVICE-MIB::CurrentNoiseFloor.0 -w ~:-85 -c ~:-80 -vvvv Range format incorrect And if I run it with no thresholds defined, it works properly and returns the right value. This makes the graphs correct, however it'll never generate a notification when out of range: root@ops-00:/usr/local/nagios/libexec# ./check_snmp -H 192.168.1.100 -o DEVICE-MIB::CurrentNoiseFloor.0 -vvvv /usr/bin/snmpget -t 1 -r 5 -m ALL -v 1 [authpriv] 192.168.1.100:161 DEVICE-MIB::CurrentNoiseFloor.0 DEVICE-MIB::CurrentNoiseFloor.0 = INTEGER: -97 Processing line 1 oidname: DEVICE-MIB::CurrentNoiseFloor.0 response: = INTEGER: -97 SNMP OK - -97 | DEVICE-MIB::CurrentNoiseFloor.0=-97 What am I doing wrong here? How would I, for example, generate a CRITICAL when the noise floor is -80 dBm or higher, a WARNING when it's -85 to -80 dBm, and an OK when -85 dBm or lower? Do I have to write my own SNMP plugins when dealing with negative values?

Read the article

Cause of flapping UNKNOWN Nagios status?

- by jldugger

We run some Nagios service checks via OpsView, and one of our hosts is getting a strange response for SSH: "UNKNOWN: Service results are stale" It happens regularly, but seems to go away as the system retries a 2nd and 3rd time. It started after a patch and reboot of the server in question last week. The system itself responds to SSH from boxes I've tested with (which doesn't include the monitoring system I am not given access to). /var/log/secure is full of lines ala: sshd[15628]: Did not receive identification string from xxx.xxx.226.20 Time stamps are reliably every five minutes, which is pretty obviously the monitoring script disconnecting once it gets a login prompt. Anyone know what might be causing this, or how to fix it? It's really frustrating to see this pop on and off the status page.

Read the article

SNMP keeps crashing

- by jldugger

We're using OpsView/Nagios to monitor our servers. We've added the SNMP service to all our servers and deployed the configuration via GPO, but one win2k3 server seems to have a problem; it crashes pretty regularly. The event log carries messages like: Event Type: Error Event Source: Service Control Manager Event Category: None Event ID: 7034 Date: 6/11/2009 Time: 7:11:49 PM User: N/A Computer: HOSTNNAME Description: The SNMP Service service terminated unexpectedly. It has done this 2 time(s). and also Event Type: Error Event Source: Application Error Event Category: (100) Event ID: 1000 Date: 6/11/2009 Time: 7:11:18 PM User: N/A Computer: HOSTNAME Description: Faulting application snmp.exe, version 5.2.3790.3959, faulting module ntdll.dll, version 5.2.3790.3959, fault address 0x000417af. Now, I could probably set it to simply restart on crash in perpetuity, but I think it's better to fix problems like this. Is this a known problem? If not, what should I do to diagnose it?

Read the article

WmiPrvSE memory leak on Windows 2008 R2

- by MichaelGG

I've seen references on Windows 2008 to WmiPrvSE leaks, but nothing about Windows 2008 R2. We're running R2 on top of Hyper-V (2008). We are also running NSClient++ for monitoring from opsview. Over time, WmiPrvSE.exe starts to use a lot of memory, causing memory alert issues (less than 10% free). VM has 2GB, WmiPrvSE consumes up to 500-600MB before I kill it. Killing the process doesn't seem to have any negative effect; it starts up again and I haven't noticed any problems. But after a day or two, it's back in the same situation. Any ideas on what to do? Resource Monitor doesn't show any Disk or Network IO by WmiPrvSE.exe. Just slowly climbing private memory... Edited to add: We aren't running clustering, or Windows System Resource Manager. The only regular WMI user I can guess is NSClient++, but we don't seem to have this problem on other servers.

Developer IT