Tag Archives: nagios

Paving the road from CRITICAL to OK

Imagine you have an issue on your mail server. It’s an intricate issue, one of those whose origin may be difficult to track because it lies on the lowest levels of the system: mail server runs on a legacy operating system inside a virtual machine on top a modern host that attaches disk images from a clustered storage system, connected into a fast network and receiving user requests via load balanced tunnels, etc.

There are so many parts running this setup that figuring out what happened could be a lot of effort, so the first thing that comes to your mind are the metrics collected by Munin (aha! let’s find some correlation of events in those graphs). So, you open the Munin web pages and you discover that it’s been 3 days since graphs were last updated. You think that maybe the state files were updated and you just need to rebuild the graphs… Bad luck, again.

How could this happen and nobody noticed? You think that we need to add some Nagios checks for the Munin lock files, so you get a warning or critical alert if it’s been too long since last update, right? Then you discover that the check is already there and it’s been moaning every single one of those 3 days that Munin wasn’t being updated. Well, it happens that Munin is not a core service, it’s only useful once in a while, though it can be very frustrating not having it when you actually need it. Thus, nobody payed attention to that alert because nobody needed Munin at that time and there were other alerts more pressing. So, what will happen next time?

Okay, you promise that next time you’ll pay more attention to those critical alerts and see that Munin needs an intervention. Are you happy with that? I’m not. I believe that errors caused by humans raise defects on the underlying system being operated. Also, I believe that repetitive work is a bad approach to solve any issue. Automation makes IT better.

In this particular case, Munin fails to remove its locks sometimes for whatever reason (it may well be that there’s something caused by our configuration, but it’s not worth the trouble to debug it because it only happens a few times a year). When it happens though, Nagios checks send alerts and the person on call should connect to the Munin server and just remove the locks. That’s just it, remove the locks. Therefore, I added an event handler to those Nagios checks that will remove the lock next time is gets too old. The critical alert will be triggered anyway and that’s good because it’s a critical event that should be recorded, but the system will take care of that immediately and very soon it will be OK again.

Now I’m wondering how many of our critical alerts could be automated like this. I’m sure that not all of them will be that easy to automate, but what if most of them are? Well, I won’t start adding random event handlers everywhere. I’ll wait for the next issue that raises an automation opportunity because I’m a huge fan of small steps. I hope you’ll do the same.

Reduir els correus de Bacula

El Bacula és un sistema de gestió de backups professional en programari lliure (també es pot contractar suport empresarial, si cal). En la configuració predeterminada és costum enviar un correu per cada treball que indiqui si ha finalitzat correctament o no. Però en un entorn amb una pila de servidors i diversos treballs per servidor, això implica rebre diàriament molts correus que habitualment indiquen que tot ha anat bé. En el meu cas, són entre 60-70 correus diaris.

Afortunadament, si hom disposa d’un sistema de monitoratge compatible amb Nagios, pot utilitzar un connector que examina els logs del bacula director per veure si s’ha produït algun error. Per tant, ja no cal seguir rebent aquest allau diari de correus que indiquen que els treballs han acabat bé. Per fer-ho només cal que canvieu la configuració dels Messages anomenats Standard al fitxer /etc/bacula/bacula-dir.conf i on posava «mail» hi poseu «mail on error»:

Messages {
  Name = Standard
  mailcommand = "/usr/lib/bacula/bsmtp -h localhost -f \"\(Bacula\) \<%r\>\" -s \"Bacula: %t %e of %c %l\" %r"
  operatorcommand = "/usr/lib/bacula/bsmtp -h localhost -f \"\(Bacula\) \<%r\>\" -s \"Bacula: Intervention needed for %j\" %r"
  mail on error = bacula@example.com = all, !skipped
  operator = bacula@example.com = mount
  console = all, !skipped, !saved
  append = "/var/lib/bacula/log" = all, !skipped
  catalog = all
}

Amb aquest canvi, tots els treballs que utilitzin els missatges estàndard passaran a enviar correus només en cas que es produeixi algun error. Però potser us interessa que els treballs de recuperació sí que notifiquin si han acabat bé i així podeu evitar d’estar pendents dels logs, oi? Doncs és ben fàcil també: només cal crear un nou tipus de missatge pels treballs de recuperació que tingui «mail» enlloc del «mail on error» i indicar-ho a la secció corresponent:

Job {
  Name = "RestoreFiles"
  Type = Restore
  Client = bacula-fd
  Storage = Tape
  FileSet = "Full Set"
  Pool = Default
  Messages = Restore
  Where = /tmp/bacula-restores
}

Messages {
  Name = Restore
  mailcommand = "/usr/lib/bacula/bsmtp -h localhost -f \"\(Bacula\) \<%r\>\" -s \"Bacula: %t %e of %c %l\" %r"
  operatorcommand = "/usr/lib/bacula/bsmtp -h localhost -f \"\(Bacula\) \<%r\>\" -s \"Bacula: Intervention needed for %j\" %r"
  mail = bacula@example.com = all, !skipped
  operator = bacula@example.com = mount
  console = all, !skipped, !saved
  append = "/var/lib/bacula/log" = all, !skipped
  catalog = all
}

Es tracta d’un petit canvi que pot augmentar significativament la productivitat d’un equip de sysadmins amb una pila de servidors dels quals es fan còpies de seguretat cada dia.