Paving the road from CRITICAL to OK

Imagine you have an issue on your mail server. It’s an intricate issue, one of those whose origin may be difficult to track because it lies on the lowest levels of the system: mail server runs on a legacy operating system inside a virtual machine on top a modern host that attaches disk images from a clustered storage system, connected into a fast network and receiving user requests via load balanced tunnels, etc.

There are so many parts running this setup that figuring out what happened could be a lot of effort, so the first thing that comes to your mind are the metrics collected by Munin (aha! let’s find some correlation of events in those graphs). So, you open the Munin web pages and you discover that it’s been 3 days since graphs were last updated. You think that maybe the state files were updated and you just need to rebuild the graphs… Bad luck, again.

How could this happen and nobody noticed? You think that we need to add some Nagios checks for the Munin lock files, so you get a warning or critical alert if it’s been too long since last update, right? Then you discover that the check is already there and it’s been moaning every single one of those 3 days that Munin wasn’t being updated. Well, it happens that Munin is not a core service, it’s only useful once in a while, though it can be very frustrating not having it when you actually need it. Thus, nobody payed attention to that alert because nobody needed Munin at that time and there were other alerts more pressing. So, what will happen next time?

Okay, you promise that next time you’ll pay more attention to those critical alerts and see that Munin needs an intervention. Are you happy with that? I’m not. I believe that errors caused by humans raise defects on the underlying system being operated. Also, I believe that repetitive work is a bad approach to solve any issue. Automation makes IT better.

In this particular case, Munin fails to remove its locks sometimes for whatever reason (it may well be that there’s something caused by our configuration, but it’s not worth the trouble to debug it because it only happens a few times a year). When it happens though, Nagios checks send alerts and the person on call should connect to the Munin server and just remove the locks. That’s just it, remove the locks. Therefore, I added an event handler to those Nagios checks that will remove the lock next time is gets too old. The critical alert will be triggered anyway and that’s good because it’s a critical event that should be recorded, but the system will take care of that immediately and very soon it will be OK again.

Now I’m wondering how many of our critical alerts could be automated like this. I’m sure that not all of them will be that easy to automate, but what if most of them are? Well, I won’t start adding random event handlers everywhere. I’ll wait for the next issue that raises an automation opportunity because I’m a huge fan of small steps. I hope you’ll do the same.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s