Friday, October 12, 2012

Why Monitoring Sucks

It doesn't matter what software you use, the fundamental model behind all current monitoring software is broken.

The  idea the originally prompted me to create this blog still holds true, I never published it. But the basic tenent of that message still holds true.

"Push Considered Harmful"

The original idea I had was in the context of grid computing in the early 2000's. Early versions of the LHCG were based on the idea that a central scheduler could monitor the entire state of the "grid" and
assign jobs to the hosts that had both the data and the available cycles. This failed badly, and the concept of "pilot jobs" invaded LHCG. In essense, pilot jobs flip the paradigm on it's head and turn "push" architectures into "pull" ones. Instead of a node passive waiting for the central service to ask
it for state, the central service should be waiting for nodes to notify it for state.

There have been no large scale push architectures that have actually ever worked outside a single organization. This is an important lesson for monitoring.

The reason "monitoring sucks" is that it is entirely based on a push model. All current monitoring software assumes that you have a model of your infrastructure that can be mapped onto existing
hardware either virtual or real. This is completely backwards.

There will be no improvement in monitoring software until this model is thrown out and a "pull" or
peer to peer model is created. There is a reason Nagios still exists, it's because everything else that
has tried to replace it uses the same "push" model.

What is needed is a subscribe model in which a node requests a monitoring service to "pull" service requests with a finite time limit. Similarly there needs to be a subscribe service for alerts. Monitoring configuration needs to be a dynamic service; any software that reads a static file for configuration is
doomed to suck.

The other fundamental problem in monitoring is the poor separation between alert and diagnostic services. For diagnostics, you want to monitor "everything", for alerts you should only be monitoring
things that you can actually control. If it sends an alert, there needs to be procedure to handle the alert.

Until monitoring services look a lot more like DHCP than DNS, they will continue to suck, regardless of the software used to implement them.