Distributed monitoring with Nagios and Puppet

In the past I had only one Nagios3 server to monitor all my production servers. The configuration was
fully generated with Puppet by Naginator.
This solution even with drawbacks (hard to set specific alert thresholds, appliances without Puppet, etc…)
is very powerfull. I never had to mind about monitoring configuration :
I’m always sure that every host in production is monitored by nagios thanks to Puppet.
However my needs have evolved and I begun to have distributed monitoring problems :
4 datacenters spread between Europe and USA and networks outages between datacenters
raising a lot of False Positives alerts.
I didn’t have any performance isssues as I have less than 200 hosts and 2K services.

I tried Shinken, really I tried. 2 years ago and again this last few months.
I had to package it into Debian package because all of our servers are
built unattended : the installation script was not an option for me.

On the paper Shinekn was perfect :
* fully compatible with Nagios configration
* support of specific shinken parameters on Puppet (ie: poller_tag)
* built-in support of distributed monitoring with realms and poller_tag
* built-in suport of HA
* built-in support of Livestatus
* very nice community and developers

In my experience :
* the configuration was not fully compatible (but adjustments were easy)
* shinken takes a lot more RAM than Nagios (even if Jean Gabès took the time to write me a very long mail to explain this behavior)
* the most important to me : the whole set was IMHO not enough stable and robust for my use-case : in case of netsplit, daemons did not resynchronize after outage, some modules crashed quite often without explanation, some problems with Pyro, etc ..

At the end I was not confident enough about my monitoring POC and I did not choose to put it in production.

To be clear :
* I still believe that Shinken will be in (near ?) future one (or THE) solution to replace the old Nagios, but it was not ready
for my needs
* Some people are running Shinken in production and some on very big production without any problem. My experience should not
convince you not to try this product ! You need to make your own opinion !

In deciding not to use Shinken I had to find another solution.

I choose this architecture :
* One Nagios per datacenter for polling
* Puppet to manage the whole distributed configuration (it take there the aim of arbiter in shinken)
* Livestatus + Check_MK Multisite to aggregate views of monitoring from all datacenters

Puppet tricks

We use a lot of Facts custom in Puppet and we have a Fact “$::hosting”
wich let us know in which datacenter is the host.
In order to cut our monitoring configuration between each poller, I use dynamic target for all puppet’s resources bounded to datacenters (hosts, services, hostescalation, serviceescalation):

Here is a simplified example of Host configuration in Exported Resources :

        $puppet_conf_directory = '/etc/nagios3/conf.puppet.d'
        $host_directory = "$puppet_conf_directory/$::hosting"

        @@nagios_host { "$::fqdn" :
                tag           => 'nagios',
                address    => $::ipaddress_private,
                alias         => $::hostname,
                use           => "generic-host",
                notify        => Service[nagios3],
                target        => "$host_directory/$::fqdn-host.cfg",
        }

All common resources between every pollers (contacts, contactgroups,
commands, timeperiods, etc…) are generated in one common directory
that all nagios pollers are reading (ie: ‘/etc/nagios3/conf.puppet.d/common’).
Eventually in nagios.cfg, I read for each poller the good directories for each datacenters.

# ie for nagios1 : 
cfg_dir=/etc/nagios3/conf.puppet.d/common
cfg_dir=/etc/nagios3/conf.puppet.d/hosting1
# for nagios2 :
cfg_dir=/etc/nagios3/conf.puppet.d/common
cfg_dir=/etc/nagios3/conf.puppet.d/hosting2

I decided not to use tags in exported resources :
it let me have exact same configuration files on each nagios poller in “/etc/nagios3/conf.puppet.d” : only change nagios.cfg between pollers.
In case of problem on one of my poller, I can very simply add monitoring of one datacenter only adding one more directory to source in nagios.cfg.

With that configuration I have a very simple distributed monitoring thanks to Puppet once again :)

I will explain in one future blog post how to aggregate views with Livestatus and Check_MK Multisite.

Share
  1. Si le projet Shinken perdure, grandit, c’est très fort probable qu’il devienne la référence absolue pour de la supervision évoluée.

    J’ai commencer à utiliser Shinken pour de la supervision sur de nombreux sites clients depuis quelques mois, le retour que je peux faire se résume en une phrase : encore un peu jeune, mais prometteur, les effets de bords ne sont pas si nombreux, il faut faire pas mal de tests, veille sur le forum et ça roule.

    Il ne dépend qu’aux adminsys de faire part de nos remarques à Jean Gabes afin d’améliorer Shinken. Ce que je n’ai pas manqué de faire :)

    Pour en revenir à ton article, Nagios et Puppet font bon ménage, c’est parfait pour des grosses infra si on utilise les best pratices des deux produits.

    • Je suis d’accord avec toi sur le fait que shinken devienne à terme un outil de référence en monitoring.

      Tu as une installation de shinken avec tous les modules distribués en failover / HA ?

      Puppet et Nagios font effectivement bon ménage mais un des inconvénient est que le temps de génération de la configuration de Nagios par Puppet fait que les runs deviennent très longs !

Leave a Comment

Trackbacks and Pingbacks: