Installation d'une sonde Nagios pour ESXi

Contexte

Pour des besoins concernant l’hébergement que propose ma société, j'ai été amené à gérer un serveur ESXi, du coup, il n'y a pas de raison de ne pas le surveiller, je dirai même que c'est encore plus nécessaire ! On a vite tendance à tomber dans les pièges de la virtualisation qui consistent à charger le serveur avec beaucoup VM s'imaginant que celui-ci augmente ces performances au fur et à mesure de la charge ... :D

Prérequis

Installation des paquets nécessaires

# yum install openssl-devel binutils perl perl-Nagios-Plugin perl-Class-MethodMaker mod_perl libuuid uuid-perl perl-XML-LibXML perl-XML-LibXML-Common

Installation du vSphere SDK Perl

Vous téléchargez le tar.gz : VMware-vSphere-Perl-SDK-5.1.0-780721.x86_64.tar.gz

$ tar xvfz VMware-vSphere-Perl-SDK-5.1.0-780721.x86_64.tar.gz

$ cd vmware-vsphere-cli-distrib

Il y a 2 variables à changer afin de permettre sans encombre l'installation du SDK :

my $httpproxy =0;
my $ftpproxy =0;
par :
my $httpproxy =1;
my $ftpproxy =1;

# ./vmware-install.pl

Problème UUID

Si vous avez encore un soucis d'une dépendance non résolue avec UUID, alors effectuez ceci :

# yum install gcc

$ wget http://search.cpan.org/CPAN/authors/id/C/CF/CFABER/UUID-0.03.tar.gz

$ tar xvfz UUID-0.03.tar.gz

$ cd UUID-0.03

# perl Makefile.PL

# make

# make install

Puis relancez ./vmware-install.pl

Installation du plugin Nagios

Télécharger le plugin ici :http://www.op5.org/community/plugin-inventory/op5-projects/check-esx-plugin

$ cd /usr/local/nagios/libexec/

$ wget http://git.op5.org/git/?p=nagios/op5plugins.git;a=blob_plain;f=check_vmware_api.pl;hb=HEAD

# chown nagios:nagios check_vmware_api.pl

# chmod 755 check_vmware_api.pl

Lançons la commande une première fois et nous obtenons ceci :

$ ./check_vmware_api.pl --help

check_vmware_api.pl 0.7.0

This nagios plugin is free software, and comes with ABSOLUTELY NO WARRANTY.
It may be used, redistributed and/or modified under the terms of the GNU
General Public Licence (see http://www.fsf.org/licensing/licenses/gpl.txt).

VMWare Infrastructure plugin

Usage: check_vmware_api.pl -D <data_center> | -H <host_name> [ -C <cluster_name> ] [ -N <vm_name> ]
    -u <user> -p <pass> | -f <authfile>
    -l <command> [ -s <subcommand> ] [ -T <timeshift> ] [ -i <interval> ]
    [ -x <black_list> ] [ -o <additional_options> ]
    [ -t <timeout> ] [ -w <warn_range> ] [ -c <crit_range> ]
    [ -V ] [ -h ]

 -?, --usage
   Print usage information
 -h, --help
   Print detailed help screen
 -V, --version
   Print version information
 --extra-opts=[section][@file]
   Read options from an ini file. See http://nagiosplugins.org/extra-opts
   for usage and examples.
 -H, --host=<hostname>
   ESX or ESXi hostname.
 -C, --cluster=<clustername>
   ESX or ESXi clustername.
 -D, --datacenter=<DCname>
   Datacenter hostname.
 -N, --name=<vmname>
   Virtual machine name.
 -u, --username=<username>
   Username to connect with.
 -p, --password=<password>
   Password to use with the username.
 -f, --authfile=<path>
   Authentication file with login and password. File syntax :
   username=<login>
   password=<password>
 -w, --warning=THRESHOLD
   Warning threshold. See
   http://nagiosplug.sourceforge.net/developer-guidelines.html#THRESHOLDFORMAT
   for the threshold format.
 -c, --critical=THRESHOLD
   Critical threshold. See
   http://nagiosplug.sourceforge.net/developer-guidelines.html#THRESHOLDFORMAT
   for the threshold format.
 -l, --command=COMMAND
   Specify command type (CPU, MEM, NET, IO, VMFS, RUNTIME, ...)
 -s, --subcommand=SUBCOMMAND
   Specify subcommand
 -S, --sessionfile=SESSIONFILE
   Specify a filename to store sessions for faster authentication
 -x, --exclude=<black_list>
   Specify black list
 -o, --options=<additional_options>
   Specify additional command options (quickstats, ...)
 -T, --timestamp=<timeshift>
   Timeshift in seconds that could fix issues with "Unknown error". Use values like 5, 10, 20, etc
 -i, --interval=<sampling period>
   Sampling Period in seconds. Basic historic intervals: 300, 1800, 7200 or 86400. See config for any changes.
   Supports literval values to autonegotiate interval value: r - realtime interval, h<number> - historical interval specified by position.
   Default value is 20 (realtime). Since cluster does not have realtime stats interval other than 20(default realtime) is mandatory.
 -M, --maxsamples=<max sample count>
   Maximum number of samples to retrieve. Max sample number is ignored for historic intervals.
   Default value is 1 (latest available sample).
 --trace=<level>
   Set verbosity level of vSphere API request/respond trace
 -t, --timeout=INTEGER
   Seconds before plugin times out (default: 30)
 -v, --verbose
   Show details for command-line debugging (can repeat up to 3 times)
Supported commands(^ - blank or not specified parameter, o - options, T - timeshift value, b - blacklist) :
    VM specific :
        * cpu - shows cpu info
            + usage - CPU usage in percentage
            + usagemhz - CPU usage in MHz
            + wait - CPU wait time in ms
            + ready - CPU ready time in ms
            ^ all cpu info(no thresholds)
        * mem - shows mem info
            + usage - mem usage in percentage
            + usagemb - mem usage in MB
            + swap - swap mem usage in MB
            + swapin - swapin mem usage in MB
            + swapout - swapout mem usage in MB
            + overhead - additional mem used by VM Server in MB
            + overall - overall mem used by VM Server in MB
            + active - active mem usage in MB
            + memctl - mem used by VM memory control driver(vmmemctl) that controls ballooning
            ^ all mem info(except overall and no thresholds)
        * net - shows net info
            + usage - overall network usage in KBps(Kilobytes per Second)
            + receive - receive in KBps(Kilobytes per Second)
            + send - send in KBps(Kilobytes per Second)
            ^ all net info(except usage and no thresholds)
        * io - shows disk I/O info
            + usage - overall disk usage in MB/s
            + read - read latency in ms (totalReadLatency.average)
            + write - write latency in ms (totalWriteLatency.average)
            ^ all disk io info(no thresholds)
        * runtime - shows runtime info
            + con - connection state
            + cpu - allocated CPU in MHz
            + mem - allocated mem in MB
            + state - virtual machine state (UP, DOWN, SUSPENDED)
            + status - overall object status (gray/green/red/yellow)
            + consoleconnections - console connections to VM
            + guest - guest OS status, needs VMware Tools
            + tools - VMWare Tools status
            + issues - all issues for the host
            ^ all runtime info(except con and no thresholds)
    Host specific :
        * cpu - shows cpu info
            + usage - CPU usage in percentage
                o quickstats - switch for query either PerfCounter values or Runtime info
            + usagemhz - CPU usage in MHz
                o quickstats - switch for query either PerfCounter values or Runtime info
            ^ all cpu info
                o quickstats - switch for query either PerfCounter values or Runtime info
        * mem - shows mem info
            + usage - mem usage in percentage
                o quickstats - switch for query either PerfCounter values or Runtime info
            + usagemb - mem usage in MB
                o quickstats - switch for query either PerfCounter values or Runtime info
            + swap - swap mem usage in MB
                o listvm - turn on/off output list of swapping VM's
            + overhead - additional mem used by VM Server in MB
            + overall - overall mem used by VM Server in MB
            + memctl - mem used by VM memory control driver(vmmemctl) that controls ballooning
                o listvm - turn on/off output list of ballooning VM's
            ^ all mem info(except overall and no thresholds)
        * net - shows net info
            + usage - overall network usage in KBps(Kilobytes per Second)
            + receive - receive in KBps(Kilobytes per Second)
            + send - send in KBps(Kilobytes per Second)
            + nic - makes sure all active NICs are plugged in
            ^ all net info(except usage and no thresholds)
        * io - shows disk io info
            + aborted - aborted commands count
            + resets - bus resets count
            + read - read latency in ms (totalReadLatency.average)
            + write - write latency in ms (totalWriteLatency.average)
            + kernel - kernel latency in ms
            + device - device latency in ms
            + queue - queue latency in ms
            ^ all disk io info
        * vmfs - shows Datastore info
            + (name) - free space info for datastore with name (name)
                o used - output used space instead of free
                o breif - list only alerting volumes
                o regexp - whether to treat name as regexp
                o blacklistregexp - whether to treat blacklist as regexp
                b - blacklist VMFS's
                T (value) - timeshift to detemine if we need to refresh
            ^ all datastore info
                o used - output used space instead of free
                o breif - list only alerting volumes
                o blacklistregexp - whether to treat blacklist as regexp
                b - blacklist VMFS's
                T (value) - timeshift to detemine if we need to refresh
        * runtime - shows runtime info
            + con - connection state
            + health - checks cpu/storage/memory/sensor status
                o listitems - list all available sensors(use for listing purpose only)
                o blackregexpflag - whether to treat blacklist as regexp
                b - blacklist status objects
            + storagehealth - storage status check
                o blackregexpflag - whether to treat blacklist as regexp
                b - blacklist status objects
            + temperature - temperature sensors
                o blackregexpflag - whether to treat blacklist as regexp
                b - blacklist status objects
            + sensor - threshold specified sensor
            + maintenance - shows whether host is in maintenance mode
            + list(vm) - list of VMWare machines and their statuses
            + status - overall object status (gray/green/red/yellow)
            + issues - all issues for the host
                b - blacklist issues
            ^ all runtime info(health, storagehealth, temperature and sensor are represented as one value and no thresholds)
        * service - shows Host service info
            + (names) - check the state of one or several services specified by (names), syntax for (names):<service1>,<service2>,...,<serviceN>
            ^ show all services
        * storage - shows Host storage info
            + adapter - list bus adapters
                b - blacklist adapters
            + lun - list SCSI logical units
                b - blacklist LUN's
            + path - list logical unit paths
                b - blacklist paths
            ^ show all storage info
        * uptime - shows Host uptime
                o quickstats - switch for query either PerfCounter values or Runtime info
        * device - shows Host specific device info
            + cd/dvd - list vm's with attached cd/dvd drives
                o listall - list all available devices(use for listing purpose only)
    DC specific :
        * cpu - shows cpu info
            + usage - CPU usage in percentage
                o quickstats - switch for query either PerfCounter values or Runtime info
            + usagemhz - CPU usage in MHz
                o quickstats - switch for query either PerfCounter values or Runtime info
            ^ all cpu info
                o quickstats - switch for query either PerfCounter values or Runtime info
        * mem - shows mem info
            + usage - mem usage in percentage
                o quickstats - switch for query either PerfCounter values or Runtime info
            + usagemb - mem usage in MB
                o quickstats - switch for query either PerfCounter values or Runtime info
            + swap - swap mem usage in MB
            + overhead - additional mem used by VM Server in MB
            + overall - overall mem used by VM Server in MB
            + memctl - mem used by VM memory control driver(vmmemctl) that controls ballooning
            ^ all mem info(except overall and no thresholds)
        * net - shows net info
            + usage - overall network usage in KBps(Kilobytes per Second)
            + receive - receive in KBps(Kilobytes per Second)
            + send - send in KBps(Kilobytes per Second)
            ^ all net info(except usage and no thresholds)
        * io - shows disk io info
            + aborted - aborted commands count
            + resets - bus resets count
            + read - read latency in ms (totalReadLatency.average)
            + write - write latency in ms (totalWriteLatency.average)
            + kernel - kernel latency in ms
            + device - device latency in ms
            + queue - queue latency in ms
            ^ all disk io info
        * vmfs - shows Datastore info
            + (name) - free space info for datastore with name (name)
                o used - output used space instead of free
                o breif - list only alerting volumes
                o regexp - whether to treat name as regexp
                o blacklistregexp - whether to treat blacklist as regexp
                b - blacklist VMFS's
                T (value) - timeshift to detemine if we need to refresh
            ^ all datastore info
                o used - output used space instead of free
                o breif - list only alerting volumes
                o blacklistregexp - whether to treat blacklist as regexp
                b - blacklist VMFS's
                T (value) - timeshift to detemine if we need to refresh
        * runtime - shows runtime info
            + list(vm) - list of VMWare machines and their statuses
            + listhost - list of VMWare esx host servers and their statuses
            + listcluster - list of VMWare clusters and their statuses
            + tools - VMWare Tools status
                b - blacklist VM's
            + status - overall object status (gray/green/red/yellow)
            + issues - all issues for the host
                b - blacklist issues
            ^ all runtime info(except cluster and tools and no thresholds)
        * recommendations - shows recommendations for cluster
            + (name) - recommendations for cluster with name (name)
            ^ all clusters recommendations
    Cluster specific :
        * cpu - shows cpu info
            + usage - CPU usage in percentage
            + usagemhz - CPU usage in MHz
            ^ all cpu info
        * mem - shows mem info
            + usage - mem usage in percentage
            + usagemb - mem usage in MB
            + swap - swap mem usage in MB
                o listvm - turn on/off output list of swapping VM's
            + memctl - mem used by VM memory control driver(vmmemctl) that controls ballooning
                o listvm - turn on/off output list of ballooning VM's
            ^ all mem info(plus overhead and no thresholds)
        * cluster - shows cluster services info
            + effectivecpu - total available cpu resources of all hosts within cluster
            + effectivemem - total amount of machine memory of all hosts in the cluster
            + failover - VMWare HA number of failures that can be tolerated
            + cpufainess - fairness of distributed cpu resource allocation
            + memfainess - fairness of distributed mem resource allocation
            ^ only effectivecpu and effectivemem values for cluster services
        * runtime - shows runtime info
            + list(vm) - list of VMWare machines in cluster and their statuses
            + listhost - list of VMWare esx host servers in cluster and their statuses
            + status - overall cluster status (gray/green/red/yellow)
            + issues - all issues for the cluster
                b - blacklist issues
            ^ all cluster runtime info
        * vmfs - shows Datastore info
            + (name) - free space info for datastore with name (name)
                o used - output used space instead of free
                o breif - list only alerting volumes
                o regexp - whether to treat name as regexp
                o blacklistregexp - whether to treat blacklist as regexp
                b - blacklist VMFS's
                T (value) - timeshift to detemine if we need to refresh
            ^ all datastore info
                o used - output used space instead of free
                o breif - list only alerting volumes
                o blacklistregexp - whether to treat blacklist as regexp
                b - blacklist VMFS's
                T (value) - timeshift to detemine if we need to refresh


Copyright (c) 2008 op5


Après un test rapide, nous obtenons une erreur de ce type :
CHECK_VMWARE_API.PL CRITICAL - Server version unavailable at ...
La vérification du certificat pose problème, si vous ne voulez pas le passer en paramètre, utiliser cette option :
--no-certificate-checking
ou rajoutez ceci au début du script perl :
$ENV{PERL_LWP_SSL_VERIFY_HOSTNAME} = 0;

Configuration de Nagios

Nous allons stocker les identifiants de connexions de l'ESXi dans le fichiers etc/resource.cfg qui ne doit pas être accessible via les CGI

$USER09$=username
$USER10$=password

Ensuite reste à configurer les commandes :

# 'check_esx_cpu' command definition

define command{
        command_name check_esx_cpu
        command_line $USER1$/check_vmware_api.pl -H $HOSTADDRESS$ -u $USER09$ -p $USER10$ -l cpu -s usage -w $ARG1$ -c $ARG2$
        }
 
# 'check_esx_mem' command definition
define command{
        command_name check_esx_mem
        command_line $USER1$/check_vmware_api.pl -H $HOSTADDRESS$ -u $USER09$ -p $USER10$ -l mem -s usage -w $ARG1$ -c $ARG2$
        }
 
# 'check_esx_net' command definition
define command{
        command_name check_esx_net
        command_line $USER1$/check_vmware_api.pl -H $HOSTADDRESS$ -u $USER09$ -p $USER10$ -l net -s usage -w $ARG1$ -c $ARG2$
        }
 
# 'check_esx_runtime' command definition
define command{
        command_name check_esx_runtime
        command_line $USER1$/check_vmware_api.pl -H $HOSTADDRESS$ -u $USER09$ -p $USER10$ -l runtime -s status
        }
 
# 'check_esx_ioread' command definition
define command{
        command_name check_esx_ioread
        command_line $USER1$/check_vmware_api.pl -H $HOSTADDRESS$ -u $USER09$ -p $USER10$ -l io -s read -w $ARG1$ -c $ARG2$
        }
 
# 'check_esx_iowrite' command definition
define command{
        command_name check_esx_iowrite
        command_line $USER1$/check_vmware_api.pl -H $HOSTADDRESS$ -u $USER09$ -p $USER10$ -l io -s write -w $ARG1$ -c $ARG2$
        }

Puis la traditionnelle configuration :

define host{

    use              generic-host
    host_name    myesx1
    alias             myesx1
    address        XXX.XXX.XXX.XXX
}

Et la définition des services :
 


define service{
        use                                  generic-service
        host_name                        myesx1
        service_description            ESXi CPU Load
        check_command                check_esx_cpu!80!90
        }
 
define service{
        use                                  generic-service
        host_name                        myesx1
        service_description            ESXi Memory usage
        check_command                check_esx_mem!80!90
        }
 
define service{
        use                                  generic-service
        host_name                        myesx1
        service_description            ESXi Network usage
        check_command                check_esx_net!102400!204800
        }
 
define service{
        use                                  generic-service
        host_name                        myesx1
        service_description            ESXi Runtime status
        check_command                check_esx_runtime
        }
 
define service{
        use                                 generic-service
        host_name                       myesx1
        service_description           ESXi IO read
        check_command               check_esx_ioread!40!90
        }
 
define service{
        use                                 generic-service
        host_name                       myesx1
        service_description           ESXi IO write
        check_command               check_esx_iowrite!40!90
        }

Conclusion

Voilà, le tour est joué, vous avez un début de supervision de votre serveur ESX ! Pour avoir un monitoring plus fin, je vous invite à parcourir cette documentation : http://www.op5.com/how-to/monitoring-vmware-esx-3-x-esxi-vsphere-4-and-vcenter-server

Vus : 9042
Publié par Slobberbone : 81