monitoring

Mission: We have a cluster of 2 devices. From the Nagios server network, we can’t ping both devices at the same time. It will be possible to ping one device for a while and impossible to ping the other one until the contrary happens.
An alert will appear in Nagios if one of the device becomes unreachable but it’s going to be a false alarm as the other device is reachable and the cluster healthy…
We want an alarm only if both devices are unreachable.

Solution:
To do this I used a Nagios plugins called check_multiaddr.
I found it there: http://exchange.nagios.org/directory/Plugins/Others/check_multiaddr/details

How I deployed it:
– I uploaded the file check_multiaddr.pl to my Nagios server
– I copied it into my nagios plugins directory (/usr/lib/nagios/plugins/)
– I made it executable:

#chmod +x check_multiaddr.pl

– I tested the script:
I want to test the ping service (Example with options already set: check_ping -H $HOSTADDRESS$ -w 800.0,20% -c 999.0,60% -p 5)
My devices IP addresses are 10.2.0.2 and 10.2.0.4
So, here is the command I executed:

#./check_multiaddr.pl /usr/lib/nagios/plugins/check_ping -H 10.2.0.2,10.2.0
.4 -w 800.0,20% -c 999.0,60% -p 5

– I got a timeout error:

Timeout detected (9s - you can edit its duration in ./check_multiaddr.pl).

– I edited the check_multiaddr.pl file and changed the TIMEOUT value from 9 to 15 seconds:

my $TIMEOUT = 20;

– I executed the same command and this time it worked:

10.2.0.2: PING OK - Packet loss = 0%, RTA = 0.97 ms|rta=0.970000ms;800.000000;999.000000;0.000000 pl=0%;20;60;0

– Now I had to change the Nagios configuration. I added a new host to represent my cluster. And I typed the 2 IP addresses separated by a comma (where I usually enter 1 IP address)
– I also created 2 new commands based on my existing commands: one to check if the host is alive and the standard ping service.
So, a command which was before: $USER1$/check_ping -H $HOSTADDRESS$ -w 3000.0,80% -c 5000.0,100% -p 1
became: $USER1$/check_multiaddr.pl $USER1$/check_ping -H $HOSTADDRESS$ -w 3000.0,80% -c 5000.0,100% -p 1
– And of course I assigned this commands to the new host check command and to the ping service for this host.
– Then I reloaded Nagios to make it apply the new configuration
– At first I got timeout errors (again?):

(Host Check Timed Out)

But this time it wasn’t because of the script setting. I had to change the Host Check Timeout in Nagios main configuration file. I changed the value to 15 (it was set to 10). If you’re using Centreon to set the Nagios configuration, the parameter is in Nagios -> nagios.cfg > Logs Options
– I reloaded again and it worked :-)

Again, don’t hesitate to write your questions/comments in French, Portuguese, Italian, Spanish or Romanian.

Just Make It Work

my sysadmin diary

Cluster monitoring with Nagios