Modern Monitoring Spikes, Heights and Virtual Hikes
In the following weeks I would like to share my expereince with what is widely referred to as Modern Monitoring and I would like to stress out that like any new wave, I like to understand the ripples. This first section / chapter will give a short introduction to the “trendy concepts” … and provide a basis for more pragmatic implementation patterns and available solutions.
Now and Then (not so long ago …)
A decade ago which isn’t that long in “man years” but quite a while in the cloud era, When I started off with monitoring, we only monitored what we really really cared about, right ? we didn’t measure!.
Our main concern was is it up? or is it down!, that was the essence of the monolith, it was either alive or dead … We all know, it’s the essence of monitoring, but nowadyas ”peeling off the interface” we all knew as “server” no longer exists!, considering the fact that, the only distinguishable entity we can or want to measure is a service. And services have components or consumes/interacts with other services, and when something happens in your cache/memory layer, some caching occurs some other place in the system and the first thing you certainly know is you have a plumbing issue …
And all we get out of that is something like the following:
Figure #1: “A spike” - another useless piece of information?
Spikes, Heights …
Word games aside, all these Spikes & Heights have a way of sending us to long dreadful “Virtual Hikes” … The Spikes & Heights in a small fixed sized datacenter, are arguabely manageble, but when all you get is this type of data (important to note “data” and not “information”) your Hike begins … and the following question are asked:
- How do we recover from this?
- Who do we wake up?
- How critical is this incident? (can it wait for tomorrow?)
Since I returned to ‘practice monitoring’, in the past give or take 2 years, I learned that until now my concepts (or more like it miss-concepts), of monitoring, were very narrow. And in many cases the wrong approach. Don’t get me wrong it got the job done! but the way it was done, creative/primitive/hackish ?
I guess, only someone who’s experienced with me / like me can be an objective judge …
So as we briefly covered the predicaments in the “prehistoric era of the web”, the even bigger issue was (and in many cases still is), when a business needed a certain service up and running, and the concern of what that service does or it’s importance to the organization, was
siloed and probably known only to the IT / OPS manager. And as an ‘sys/sysops engineer’ the approach to monitoring practices was based on the way the organization was behaving …
Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the
organization’s communication structure.
a.k.a “Conway’s Law.”
When I wanted to be super serious about my job I found ways of ensuring that service was up and running … as a most simple example I keep giving:
curl -I www.tikalk.com 2>/dev/null | head -n 1 | cut -d$' ' -f2
The check above would basically provide me a an assurance my service is up and ready to serve … It’s scale/capacity/volume etc, were things we had to learn & measure over time and in most cases these KPI’s weren’t the ‘things’ guiding us, through the monitoring definition process …. So Spikes Happen!
In A Perfect world
A world we have time and money and no competition ;) …
What if (humour me for a minute), What if in a perfect world a Product manager would specify the capacity of a certain feature, and let’s all try and be optimistic, neutral and pessimistic about our requirements and let’s say our scale for a certain app is:
- 20 - pessimistic
- 60 - natural
- 90 - optimistic
And all at once we have a KPI !
Now let’s just understand, how we reach that KPI, and what processes/services/functions we need, in order to achieve that KPI - so the first big step is communications!
Moving forward »
Many of my new conceptions of monitoring is inpired by this highly recommend book which really focuses on, you guessed again -> (Effective) Modern Monitoring & Alerting principles. I would love to hear your thoughts and experiences from both the
Dev & Ops sides of the fence.
In my next post (hopefully a few days away) I would like to also discuss the correlation of Logging and what in my personal (influenced by many different solutions and practices of course) is the right way to tackle and combine these two mandatory [IMO] orginzational efforts.