All successful companies want to be able to answer the question, “but how are we really doing?”. There are a ton of ways to define and measure the success of your infrastructure. I won’t pretend to know all about business metrics, but I can talk about IT ops. There are a lot of opinions on even this part of measuring performance, so YMMV.
I’m going to open (and close) this post by saying, staring at a dashboard all day long is NOT an effective or acceptable means of monitoring your infrastructure, nor is email-based monitoring, which amounts to the same thing. What a waste of someone’s time! If you care enough to measure and alarm on your infrastructure, then set up sms-based notifications for critical alerts at the very least. The first time your MTTR for a major outage is increased because your oncall was tarrying in the bathroom instead of hurrying back to his terminal, you’ll understand. Talk about an awkward post mortem conversation. Additionally, think of all of the time that your engineer could focus on fixing production problems or working on automation if she wasn’t staring at a screen, scared to death of missing a critical issue.
What I think matters most
FYI- I include load balancers in the ‘Network’ bullet below, although my personal opinion is that those are systems devices and should be admin’d by systems engineers. 🙂
Site latency. You want to know how the front end is being received by your customers. There are other factors that come into play, of course. Things like flash objects rendering on the client side aren’t controllable or measurable. But having latency data, from both internal and external sources, will give you a more accurate depiction of how your site is behaving. If all you do is measure from internal sources, what happens when the border becomes unavailable? What if there are major routing issues with a specific provider in another part of the world? Say, a cable break off the coast of China? With distributed external monitoring, you may be able to change routing to avoid that carrier, rather than just seeing ingress traffic drop & wondering what the heck happened.
Service errors/timeouts. The next step when figuring out root cause for site latency is to look at the logs on the front end webservers to see if there are performance or connectivity issues for/to the services that feed into page generation. (no, the next step isn’t always to blame the network!! get over it already!!) Applications are rarely standalone in any distributed environment. Depending on how deep and distributed your stack is, you may have to check 10’s or even 100’s of services individually without these logs. Just make sure that the important errors (fatals and timeouts, usually) are logged at the appropriate log level. Logging every little event, regardless of impact to customer experience or functionality, will only make the job of troubleshooting a site issue more difficult.
Service latency. If you don’t find any critical errors or timeouts in the front end logs, take a look at latency measurements between up- and downstream dependencies. Maybe services are just slow to respond. This could be because of packet loss in the network, system utilization issues, or maybe it’s something like pileups on a backend database. Understanding where the bottleneck is occurring is a very good thing.
Databases. Point #1: I’m not a database expert. That being said, database utilization, queue lengths, read/write operations and DB failover events are all metrics that really just can’t be overlooked. Granted, you’ll probably see any major DB issues surfaced in upstream dependencies, but you absolutely have to have visibility into such a critical layer of the stack. To me, DBs have always seemed temperamental and prone to ‘weird’ behaviour. The more insight you can get to narrow down potential issues, the better.
Network. It’s great that software can/should have exponential backoff or graceful failure in the event of connectivity with upstream dependencies, but that’s slightly reactive and will impact the customer experience if a critical app is affected. Measuring and alarming on network connectivity/latency will allow you to drill down into issues more quickly. Looking at inter- and intra-[datacentre,rack,region] metrics is a decent starting place. But to really do network monitoring justice, you also need lower-level metrics to help drill down into root cause. Monitoring drops, link bounces, packet loss, and downed devices/interfaces are a few of the key metrics. There are soooooo many other things to monitor in a complex network (running out of TCAM space? really?); I won’t even deign to try to enumerate all of them here. A solid network engineer can probably spout off 20 of them in the space of 30 seconds though.
Btw, it’s fairly simple when you’re talking about a network of maybe 100 boxes. But when you get over 50k of them behind innumerable devices with cross-datacentre dependencies, it’s a bit tougher to measure inter-switch issues from the service perspective, let alone alarm on them. (that’s a metric buttload of data!)
Basic server metrics. Servers and network devices do keel over, and applications can suck all the memory, CPU or disk I/O. Granted, the service should be built with hardware redundancy (servers, switches, network, power) and the application should be able to handle a failed machine with no impact once you’re in a truly distributed environment. But monitoring the machines obviously needs to happen- someone’s gotta know that a machine is dead so they can fix/replace it.
Page weight. This isn’t an obvious one, and there isn’t necessarily a direct correlation between page weight and latency. But if you don’t see timeouts, connectivity issues or server/device problems, performance degradation could be as simple as the site serving heavier pages. (sometimes you just have to bite the bullet & roll out a heavier page!)
Get some historical data on your alarms before you turn on alerting so you know that you’ve set the proper thresholds. Last thing you want to do is turn on an alert and barrage a [most likely] already-overworked oncall. Exceptions should include alerts stemming from a real-life high-impact event. If a condition is integral enough to cause an outage or noticeable customer impact, then it needs to have a monitor and alarm. You can always adjust the threshold for these one-offs as you go along. And before you turn on alarms, take care of the next section too. Seriously!
How do you know what to do with the alerts? This deserves its own section here because it’s so important. If you’re ‘together’ enough to measure and alert, then you should also be ready to make sure that the folks who are supporting the alarms are well-informed about what to do with them… before the alerts start paging them. Run books are a completely separate topic, but in short, they should include simple step-by-step directions with clear examples and/or screen shots, escalation information, and links to architecture diagrams. They should be created on the premise that anyone with a basic understanding of the system/service/network can take care of the issue at 3am without any handoff. Or, at worst, the body of the doc should contain simple decision-tree-type information such as “if you see XXX, you probably want to start looking at YYY”.
There are myriad ways that “uptime” (one of my least favourite words in the tech world) or Availability can be measured. But what does “availability” even mean? Before you measure it, you’ll need to work with both technical and business partners to define it. Every business has slightly different goals, business models, architectures, etc, so there can’t really be one known definition of how to measure whether it’s “up” or “down”. Think carefully about building critical processes or monitors around this measurement, though. For example, if you run an e-commerce site and measure availability by deviation from the number of orders received in a given time period, beware that some number of customers will be impacted before you can see & act on it. It’s fine to alarm on it, but alerting on latency or number of timeouts will enable much quicker MTTD and MTTR, since it allows you to dive at least one level farther down the stack (a straight symptom rather than a pointer to actual root cause). I’m not saying that you shouldn’t alarm on order rates in this case- just that you should always consider whether there’s a less reactive metric to key off of for critical ops coverage.
What about measuring the “Customer Experience”? Is that the same or different than “Availability”? Short story: it depends on how you define it. You could measure TTI (time to interact), which is most likely going to be different. I know a few websites that do this, and if you can define the cut-off reliably (do background processes count? what exact functionality needs to be available to allow a customer to ‘interact’?), then I’m all for TTI as one measurement. Percentage of fatals in core services could also be included here. Granted, that’s more of a real-time metric and one that’s already mentioned above. But it’s also good for a weekly roll-up of site health.
Measuring vs Reporting, Dashboards vs Decks
Not every single thing you can measure should be reported or alarmed on. I should have put this as a disclaimer right at the top of this doc, really. There is such a thing as ‘paralysis by analysis’. It’s fine if a single service owner wants to review 50 or 60 different metrics, but there should be no more than 3-5 true health metrics. If you have more than 5 metrics (and I’d group ‘aggregate log fatals’ into one here as an example) that are top priority to measure, then you should continue working on your monitoring configs or maybe revisit the way your service responds to failures.
Dashboards are [near-]real-time and decks are compiled from historical information (summaries). As an Ops manager, if someone comes to me with a deck from last week and tells me we have a site-impacting issue, then 1) we’ll run the event and make sure it gets handled and 2) the owner of the service will be conducting a monitoring audit of their service in fairly short order. I have no problem with a discussion that begins with, “I see latency’s been increasing over the past two weeks (without breaching the alert threshold)- any ideas what might be up?”. In fact, that’s exactly the type of discussion that a deck should incite. It’s just the times when someone might say, “Hey- it looks like we had a 50% FATAL rate on this core service last week. Any ideas?” that I get a little perturbed.
Again, staring at a dashboard all day long is NOT an effective or acceptable means of monitoring your infrastructure, nor is email-based monitoring, which amounts to the same thing. While dashboards will provide fairly up-to-date visibility into the health of a service, any critical metrics that are being rendered in the dashboard should also have a corresponding alert.
Oh yeah- and remember to monitor the monitoring system. 😉