Wednesday, March 25, 2009

Of easy and painless systems monitoring

I'm not a systems administrator. I only have 8 servers to babysit and it used to be enough to be a time consuming problem. You might not be a systems administrator either, nor have many machines / services / websites to monitor, yet the fact remains that as IT professionals we need to keep a close eye on what's going on. I'm not talking about 99.999% uptime here, but a 1% downtime is enough to make a lot of customers, clients and managers angry; especially since outages have a way to happen exactly when it should not.

What are your options? How much does it cost? What can you monitor? These are all questions I'll try to shed a light on. The solution I'm proposing today is one I used myself for years. I'm not legally obliged to 5-9 availability, yet this is what I achieved with a total cost of 0. Yep, z.e.r.o. zero. El zilcho.

I'm not saying this will work for anybody, neither am I pretending to be an expert on the issue at hand, but I learned a lot in a few years on the subject so here it is.

Innovation comes from the needs

First thing first. These are the questions you absolutely need to answer before putting your automated monitoring solution together.

  • If any abnormality is discovered, who will be reached and  how ?

Will you setup rounds of monitoring among your partners or employees? Do you have cell phones that can receive emails? Are you using pagers? If you can't answer to these questions, stop now. Monitoring is not for you apparently. You either don't have the resources or the availability to do so.

  • What needs to be  monitored?

There is a whole list of things that can be monitored. As a matter of fact, pretty much anything can. Room and server temperatures, machine responses to arbitrary queries,query execution times, memory and disk usage, network throughput... the list long. As I said, I'll show you how to monitor all these things, but you need to know what is critical and what is not. You can try to monitor everything, sure, but keep in mind that the whole point of automated systems monitoring is being reliable and saving time. Therefore, you'll fail in the later. Make a list.

The tools of the trade

There are lots of monitoring systems out there. Most are expensive, lots are hard to setup. I'm not interested in either one of those categories. I guess you're not. My tools stack is this :

  • Zenoss Core : Monitoring platform.

  •  SNMP : Monitoring communications standard I use.

Zenoss Core is a free (as in free speech or free beer) open source monitoring platform. It's distributed under the GPL license and available for download in a plethora of package formats. RPM, VMWare virtual appliance, source tarball, zip files, whatever. This is the core of my monitoring solution.

I suggest installing it on a cheap commodity PC with CentOS. Everyone has one of those old machines lying around somewhere in a closet or a warehouse. Find one, plug it, setup the OS of your choice (preferably CentOS, since it's Red Hat binary compatible and FOSS). My preferred way to monitor is through a network via SNMP (I'll get to that later on), so you need a network card on it. If you want it to send messages to pagers, you'll obviously need a modem. That's all the hardware you need.

Getting Zenoss up and running is pretty straight forward and well documented. CentOS is a RPM based system, so one or two command lines usually does the trick.If you did choose CentOS, as I strongly suggested, checkout chapter 5 of the installation guide. Installation should be a trivial matter and takes somewhere around two minutes. Literally.

SNMP - Simple Network Management Protocol

I mentioned SNMP as a monitoring communication standard. What is SNMP exactly? It's a simple way of querying appliances and getting information in the form of a numeric tree. This is where you need to do your homework. It is vital that you understand at least a little what SNMP is and how it works. Don't be afraid though. It's widely used, documented and simple enough to be understood in a 30 minutes research. Skip your favorite TV program tonight, and read a little. 30 minutes of reading will save you hours of problems down the road; your call.

SNMP is supported by a wide variety of devices. Windows Servers, Linux servers, routers, switches, printers, UPCs, name it. It's easy to activate and allows you to monitor pretty much anything.

On the downside, SNMP might not be the most secure monitoring solution, but it does the job fairly well. You can secure it further through firewalls and other such devices later on.

Installation on a Windows machine is easy. Go to the Add/Remove Programs utility, open the list of Windows components, look in the Management and Monitoring Tools and activate Simple Network Management Protocol. You can Google for it if you need more help. It should appear in the services management utility under SNMP Service, which you can fire-up by typing 'services.msc' in the Run dialog.

Linux servers need to install NET-SNMP. Again, I won't go through the details, since the installation process depends on your Linux flavor, but it's usually a trivial matter. Google for it if you need help. It's usually included in your distro base packages. Zenoss offers a nice howto that might be of great help.

Whatever platform you setup, make sure to allow network traffic on the port 161 in UDP.  This is the usual port that SNMP agents use. It can be configured to be something else in your respective SNMP agent configuration.

Putting it all together

The hardest part is over. You can go grab a beer and celebrate. Don't celebrate too much though, because you're not done yet. We still need to fire up Zenoss, configure what we want to monitor and setup alerting rules. I'll go through the basic steps very quickly, but there are tons of documentation out there. I suggest using the Administration Guide as a helper.

Zenoss runs by default on HTTP port 8080. Navigate to

As of version 2.3, the default login is 'admin' and the default password is 'zenoss'. You're now taken to the dashboard. Pretty slick heh? Select 'Add device' from the menu on the left.

Fill the following fields.

  • Device name : This is either a DNS or IP that points to the machine you want to monitor.

  • Device class path : Choose this value carefully. In the drop-down list, you need to select what describes best the device you want to monitor. If it's a Linux server, choose /Server/Linux. Guess what is the best choice for Windows servers?

  • SNMP community : This is the name of your SNMP community that you specified in your SNMP agent configuration. It's usually something like private or public, but some people like to use secret values.

Everything else can be ignored for now. As you master Zenoss, you'll find those extra fields most useful. Everything comes in good time, so keep focusing on the task at hand for now : press 'Add device'.

The next screen will slowly fill up with log messages. Zenoss is currently discovering your device and has a lot of values to checkup, so let it finish it's job. If everything goes well, you'll see a link to navigate to the device at the end of the log window.

If you get error messages, you did something bad. Start by checking the troubleshooting section of the Zenoss howtos. Most issues can be solved with the information provided there. If the problem persists, there is still your good friend Google.

By default, Zenoss will monitor disk usage, memory usage, CPU usage, system load, network throughput and a few other things. Once you have a first device setup, you can tweak the alerting rules in the 'Settings' section of your Zenoss installation.

That's it for now. Get familiar with Zenoss and devices discovery.  This is your homework for this week. Next up, I'll tell you how to monitor IP services, temperatures and even more sexy stuff.