Event Management

Something blew up in your infrastructure and you have no idea what’s wrong or where to even start looking. A large percentage of your customer base is impacted, and the company is hemorrhaging money every minute the episode continues. Your senior management team is screaming in both ears, and your engineers are floundering in your peripheral vision, trying to find root cause. Sound familiar?

True Ops folks tend to thrive in this type of environment, but businesses don’t. And engineers, regardless of whether they write software or maintain systems & networks, hate seeing the same events over and over again. Managing these events doesn’t just last for the duration of the event itself. To do it right, it takes copious amounts of training, automation, process innovation, consistency and follow-through. This is my ‘take’ on how to go about rolling out a new process.

This may seem like a lot of overhead (it’s a lot of words), but the process itself actually pretty simple. The effort is really in making the right process designs up front and in creating the proper tooling & training/drilling around it. It’s a very iterative process; it took well longer than a year to solidify it, and we were constantly re-factoring it as we learned more about our evolving architecture. Most of what’s described below is for Impact 1 events (site outages, etc) and doesn’t necessarily apply to run-of-the-mill or day-to-day requests (over-rotating burns people out and diminishes the importance of those major events). Not all of this applies to a small, 20-person company either, although the responsibilities contained in the ‘During an Event’ section will apply to almost any sized team or event. Perhaps you’ll need to combine roles or re-distribute responsibilities depending on the size of the team or event, but the process itself is pretty extensible. The examples follow distributed websites, since it’s what I know, but the concepts themselves ought to apply to other architectures and businesses. (I also assume you’re running conference calls, but the same applies if you run your events over IRC, skype, etc).

Culture Shift

If you’re one of the few lucky people who work in a company where IT Operations garners as much attention and love as launching new features/products, then we’re all jealous of you. πŸ™‚ Engineers and business people alike would absolutely love to have 100 percent of the company’s time focused on innovation. In my experience, any time I mention ‘process’, I receive looks of horror, dread and anger from engineering, including management. The knee-jerk reaction is to assume that a new procedure will only create more delay or will divert precious time from what ‘truly matters’. Taking a measured approach to dispelling those rumors will pave the way to a successful roll out. It just takes a lot of discussion, supporting metrics, the ability to translate those metrics into meaningful improvement to the bottom line, a considered plan, and the willingness to partner with people rather than being proscriptive about it.

  • Act like a consultant. Even if you’re a full-time employee who’s ‘grown up’ in an organization, you should begin with a consultant mind set so you can objectively take stock of your current environment, solicit objective feedback, and define solid requirements based on your learnings. This can be difficult when you’re swimming (drowning?) in the issues, and gathering input from people who are participants but not owners of the process will help immensely.
  • Use metrics You have to know the current state of affairs before diving headlong into improvements or prioritizing the deliverables in your project. If you don’t have a ticketing system or feature-rich monitoring system from which to gather metrics programmatically, then use a stop watch to codify the time it takes to run through each step of the current process. If all you have is anecdotal evidence to reference initially, then so be it. And if that’s the case, gaining visibility into the process should be at the top of your priorities.
  • Be truly excited. Don’t pay lip service to a change in process, and don’t allow the leaders in your organization to do so either. The minute you sense resistance or hesitation in supporting the effort, intercept it and start a conversation. This is where the numbers come in handy. If the engineers tasked with following a new process are hearing grumblings from managers or co-workers, then it adds unnecessary roadblocks. To be sure, we encountered our fair share of resistance which bred some frustration during our roll-out. But we used the fact that every improvement decreased the number of outage minutes, added to the bottom line and helped with the stock price- even if it was an indirect benefit. That’s something that everyone can and should be excited about.
  • Incremental progress. Not everything included here has to (or can) happen overnight, or even in the first six months. I hate the saying, “done is better than perfect”, but sometimes it actually applies. I’ve included ideas on how to roll most of the process out in an incremental fashion while still getting consistent bang for the buck.
  • Continual refinement. No good process is one-size-fits-all-forever. Keep an open mind when receiving feedback, ensure that the process is extensible enough to morph over time, and continually revisit performance and gather input from participants. Architectures change, and the processes surrounding them must change as well.

Prepping for success

The following deliverables are fundamental to securing a solid Event Management process that’s as streamlined as possible. It will take time to address the majority of the research and work involved, but basing prioritization on the goals of the program and the biggest pain points will allow measurable progress from the outset.

Impact Definitions

You need to know the impact or severity level of the event before you know what process to run. The number of levels may vary, but make sure to decide on a number that is both manageable and covers the majority of issues in your environment. I have to admit that over time, my previous company moved to looking at events as “pageable with a concall” (sev1), “pageable without a call” (sev2) or “non-pageable” (sev3) offenses, rather than adhering to each specific impact definition. This isn’t right or wrong; the behavior reflected our environment. Although each organization is unique, here are some examples to consider:

Impact 1: Outage Condition: Customer-facing service or site is down. Percentage of service fatals breaches a defined threshold (whatever is acceptable to the business).
Sev1 tickets/events follow all processes below and have a very tight SLA to resolve which triggers auto-escalation up the relevant management chain. The escalation time will depend on the types of events involved in a typical sev1, but we escalated through the management chain aggressively, beginning at 15 minutes after the ticket was submitted. The additional setting of rotating the ticket to the secondary on-call if the ticket isn’t moved to the appropriate next state or updated (thus paging them) should also be fairly tight (ie- if a ticket isn’t moved from ‘assigned’ to ‘researching’ within 15min, the ticket will auto-reassign to the secondary and page the group manager).
Impact 2: Diminished Functionality: Customer-facing service or site is impaired. Percentage of service fatals breaches a defined threshold (whatever is acceptable to the business).
Sev2 tickets/events will page the correct on-call directly, with a moderately tight SLA to resolve which triggers auto-escalation up the relevant management chain. These tickets will also rotate to the secondary on-call and page the group manager if the ticket isn’t moved to the appropriate next state after the agreed-upon SLA.
Impact 3: Group Productivity Impaired: Tickets in this queue will most likely wind up covering issues that will either become sev1/sev2 if not addressed or are action items stemming from a sev1/sev2 issue. It may also cover a critical tool or function that is down and affecting an entire group’s productivity. These tickets don’t page the on-call, and the SLA to resolve is much more forgiving.
Impact 4: Individual Productivity Impaired/Backlog This sev level was treated more like a project backlog, and while there are other products that cover bugs and project tasks, I like the idea of having everything related to work load in the same place. It’s simpler to gather metrics and relate backlog tasks to break/fix issues.

Incremental progress

I will always recommend front-loading the sev1 definition and over-escalating initially. In my mind, it’s much better to page a few extra people in the beginning than it is to lose money because you didn’t have the proper sense of urgency or the correct people for an issue. If you can’t integrate automatic rotation of tickets into your current system, then add it into your checklist and make a conscious decision to watch the time and escalate when necessary.

Tools and Visibility

Tools

It doesn’t take an entire platform of tools to run an event properly, although that certainly does help. The following tools are fairly important, however, so if you have to prioritize efforts in this arena, I’d start here.

  • Ticketing System A flexible and robust ticketing system is an extremely important part of a solid Event Management process. It’s your main communication method both during and after an event, and it’s a primary source for metrics. If participants in an event are fumbling with the fundamental mechanism for communicating, then they’re not concentrating on diagnosing and fixing the issue. There are many important features to consider, but extensibility, configurability and API’s into the tool are all critical to ensuring that whatever system you choose grows along with your processes and organization.
  • Engagement/Notification System. Ideally this will be tied into your ticketing system. If you have your tickets set up to page a group, then you ought to already have that information in the system. While our first-line support team utilized a full version of a homegrown event management application, we always wanted to provide a pared-down version of the same tool for service owners throughout the company. I certainly hope that’s happened by now, since the more distributed a company becomes, the more difficult it is to locate the right people for cross-functional issues which may not be sev1-worthy.
  • Sev1 Checklist I’m a big proponent of checklists that can and should be used for every event. In the heat of battle, it’s easy to overlook a step here and there, which can cause more work farther into the event. Building a checklist into an overall Event Management application is a great way to track progress during an event, ensure each important step is covered and inform latecomers to the event of progress without interrupting the flow of the call or the troubleshooting discussions. Separate lists should be created for the front-line ops team, call leaders and resolvers. Each role owns different responsibilities, but everyone must understand the responsibilities of all three roles.

Incremental progress

Ticketing: If your system doesn’t include features such as service-based groups, automatic submission of tickets, reporting/auditing or fundamental search functionality, start investing in either bolstering the current system or migrating to another one. Depending on the scope of that work, beg/borrow/steal development resources to create a hook into the backend data store to pull information related to your specific needs. (this is a grey statement, but every environment has different needs).

Checklists: It’s fine to start small with a binder of blank lists that the team can grab quickly. Anything is better that nothing! Include columns for timestamps, name of the person who completed the action, the actual action and a place for notes at the very least. The facets of an event I would document initially are discovering the issue (goes without saying), cutting the ticket, initial engagement & notification, each subsequent notification, when on-calls join the call/chat, any escalations sent, root cause identified, service restored, and post mortem/action items assigned.

Visibility

  • Monitoring/Alerting. You have to be able to recognize that an event is going on before you can kick off a process. If you’re really good, your monitoring will begin the process for you by auto-cutting a ticket based on a specific alarm and notifying the proper engagement/notification lists. That takes time, of course, but you should be able to build a solid list of alerts around sev1 conditions as you go along- automation like that is rarely built in a day. Almost every post mortem I’ve been in for a high-impact event has included a monitoring action item of this type; if those conversations are happening then you’re bound to have fodder for monitoring and automation. I’ve chatted about monitoring and alerting in a previous post, so I won’t regurgitate it here.
  • Changes in the Environment. Understanding what’s changed in your environment can significantly aid in narrowing the scope of diagnosing and troubleshooting events. Accumulating this data can be a huge task, and visualizing the amount of change within a large distributed, fast-paced, high-growth environment in an easily-digestible format is a bear. The visibility is well worth it, however, so if you don’t have a Change Management system or process, it’s a fantastic deliverable to put on a road map. CM is an entirely separate post though, so I won’t go into it here.

Incremental progress

Changes: Start small by collating data such as software deployments for major services, a simple calendar of Change Management events (heck, even a spreadsheet will suffice in the beginning), and recent high-impact tickets (sev1/sev2). You can migrate into a heads-up type of display once you have the data and understand the right way to present it to provide high value without being overwhelming.

Standardized On-Call Aliases

Once your company has more than one person supporting a product or service, you should create a standardized on-call alias for each group. Adding churn to the process of engaging the proper people to fix an issue by struggling to figure out who to page is unacceptable- especially when the front-line team has a tight SLA to create a ticket with the proper information, host a call and herd the cats. For example, we used a format akin to “page-$SERVICE-primary” to reach the primary on-call for each major service. (page-ordering-primary, page-networking-primary, etc.) Ditto for each team’s management/escalation rotation (page-$SERVICE-escalation). Managers change over time, and groups of managers can rotate through being an escalation contact. As a company grows, a front-line team can’t be expected to remember that John is the new escalation point for ordering issues during a specific event.

Primary/Secondary/Management Escalation

When a group gets large enough to handle multiple on-call rotations, a secondary on-call rotation should be created for at least a couple of reasons. First, reducing the churn in finding the proper person to engage will decrease the mean time to engage/diagnose. Secondly, pages can be delayed/lost, engineers can sleep through events, etc. If you’re in the middle of a high-impact event, losing money every minute, and restoring service hinges on engaging one person, then you’re in a bad position. Lastly, there are times when an event is just too large for one person to handle. For example, having a backup who can pore through logs while the primary is debugging an application will usually speed up MTTD/MTTR. Less context switching during a high-pressure call is a Good Thing. (see On-Call Responsibilities for expectations of on-calls).

Management escalation should be brought in if the root cause for a major outage lies in their court, if you’re unable to track down their primary or secondary on-call or if the person engaged in the call isn’t making sufficient progress. Managers should help find more resources to help with an issue and should also serve as a liaison between the resolvers ‘on the ground’ fixing the problem and senior management, where necessary. See Manager Responsibilities below.

Engagement vs Notification

There’s a difference between engagement and notification during an event. Engagement is just that- it’s the mechanism for calling in the right guns to diagnose and fix an issue. Notification is a summary of where you’re at in the process of diagnosing/fixing and should be sent to all interested parties, including senior management. Each of those messages should contain different information and each audience group should also be managed differently.

Engagement

It’s my opinion that the list of folks who are engaged in fixing an issue should be controlled fairly tightly, else you risk the ‘peanut gallery’ causing the discussion to veer off track from the end goal of finding and resolving the root cause of the issue. At a previous company, we created engagement groups for each major bucket (ordering, networking, etc) and populated that with a particular set of aliases that would reach the on-calls of the groups typically involved/necessary in that type of event.

Engagement messages should contain ticket number and impact, contact information (conference call number, IRC channel, etc), and a brief description of the issue. If this is an escalation or out-of-band (engaging someone who isn’t on-call), include something to that effect in the message:

Plz join concall 888-888-8888, pin 33333. sev1 #444444, 50% fatal rate in $SERVICE. (John requests you)

Notification

Notification lists should be open subscription for anyone internally, but you should ensure that the right core set of people is on each list (VP of the product, customer service, etc). Even if a service isn’t directly related to the root cause of an issue, up- and downstream dependencies can impact it. Create notification lists for each major service group (networking, etc) so that people can be notified of problems with services that impact them, either directly or indirectly. The frequency of messages sent should be a part of the defined event management process, as should out-of-band notification practices for more sensitive events (communication with PR, legal, etc).

Notifications should include ticket number, brief description of the issue, who is engaged, whether root cause is known, ETA for fix and the current known impact. Be brief but descriptive with the message.

FYI: sev1 #444444, 50% fatal rate in $SERVICE. Networking, SysEng, $SERVICE engaged. Root cause: failed switch in $DATA_CENTER, ETA 20min

Incremental progress

Aliases: If you’re just starting out or don’t have an effective list management system, you can begin with a simple document or shared calendar containing who is responsible for each service. You can even go as simple as noting who the subject matter expert and group manager are for each team if the concept of an on-call doesn’t exist yet, then build aliases as you canvass that information. Contacting each team to request that they update the doc when on-call responsibilities change probably won’t be met with much resistance- you can sell it as a, “if we know who to page, we won’t page you in the middle of the night”. Engineers should love that. If you utilize a system like IRC, it’s fairly trivial to write bots that will allow ‘checking in’ as an oncall; storing that information in a flat file that can be read by another bot or script to engage them when necessary is a quick solution that doesn’t require navigating to multiples places while spinning up a high-impact call.

Engagement: Start with just using a standard template for both engagement and notification to get people used to the proper messaging. If you don’t have a tool, then begin with either email or CLI (mail -s, anyone?), but make sure you add a copy of each message sent to the relevant ticket’s work log so you have a timestamped record of who was contacted. Again, if you don’t have an effective list management solution, create templates (and aliases, if you’re running things from a commandline).

During an Event

Leading an Event/Conference Call

“Call Leaders”. No matter how much process, automation, visibility and tooling you have, there are always those really gnarly events that need management by an Authority of some sort. Appointing a specific group of people who have deep understanding of the overall architecture and who own the proper mentality and clout within the organization to run those events will go a long way toward driving to root cause quickly and efficiently. Call Leaders should not be at the forefront of technical troubleshooting; they’re on the call to maintain order and focus. These people should be well-respected, organized and knowledgeable. They also have to be a tad on the anal-retentive and overbearing side. Call Leaders are tasked with prioritizing, ensuring appropriate escalation occurs, progress is documented in the corresponding channel(s), the correct communication flows to the proper people, resolution of the issue is actually achieved & signed off on, and post-event actions are assigned. As long as they don’t over-rotate and step on the toes of the engineers who are fixing the issue, you’re all good. Re-evaluating this core group every once in a while is a great thing to do. Depending on how frequently these leaders are engaged, burnout can be an issue. (Btw, for years, our front-line operations team served this function themselves. As we grew and became more distributed, we implemented the additional Call Leader tier, with the aim of focusing on better tooling and visibility to drive down the frequency with which that new tier was engaged.)

  • Documentation: While the front-line team should be adding ticket updates, the Call Leader is responsible for making sure that happens. If done properly (and in conjunction with updates to the status of an event in an Event Management tool), a Call Leader shouldn’t have to interrupt the flow of the call to brief newcomers about the state of the event, nor should they need to ask themselves, “now, about what time did that happen?” after the event is complete. It also allows interested parties outside of the flow of the call to follow along with the event without interrupting with those annoying, “what’s the status of this event?” questions.
  • Focus on resolution. Ask leading questions to focus service owners on resolving the immediate issue (see ‘Common Questions’ below). Once root cause of an issue has been discovered, engineers may have a tendency to dive directly into analysis before the customer experience has actually been restored. There’s plenty of time after an event to do that analysis.
  • Facilitate decision making. The more people participating in an event, the more difficult it can be to make the tough decisions (or sometimes even just the simple ones). Call Leaders should act as a facilitator and as a voice of reason when necessary. For example, making the call on whether to roll back a critical software deployment supporting the launch of a new product isn’t typically something you’d want an engineer to make. They don’t need that stress along with trying to diagnose and fix a production issue. Since Call Leaders are typically tenured employees who understand the business, they should be able to engage the correct people and ask the proper questions to come to a decision quickly.
  • Escalate effectively Pay attention to whether progress is being made on the call or whether anyone is struggling with either understand the issue or handling the work load. Ask whether you can engage anyone else to help, but realize that engineers are notorious for not wanting to ask for help. Give it a few more minutes (this all depends on the urgency of the event), then ask, “who should I engage to help?”. If an on-call doesn’t offer a name, engage both the secondary on-call (if it exists) as well as the group manager. I usually say something along the lines of, “I’m going to grab John to help us understand this issue a bit better.”, which is a fairly non-confrontational way of letting the on-call know that you’re going to bring in additional resources.
  • Release unnecessary participants No one likes to hang out on a call if they’re not contributing to the resolution of the issue. Keeping the call pared down helps with unnecessary interrupts and also keeps on-calls happy. Prior to releasing anyone from the call, make sure that they have noted in the ticket that their service has passed health checks. (remember to note in the ticket when the person dropped off the call for future reference!)
  • Running multiple conference calls If you’re managing an event that includes multiple efforts then it can be a good idea to split the call. Examples of this are a networking issue that spawns a data corruption issue, or an event with multiple symptoms and/or multiple possible triggers/root causes. Communication between the two camps can become unwieldly quickly, so if you don’t have a secondary Call Leader, then utilize the group manager responsible for one of the issues. This necessitates a round of call leader training for primary managers, which ought to be completed in any case. This also makes it highly important that any proposed changes to the environment are added to your communication mechanism (ticket, IRC, etc) prior to making the change so that all parties involved in the event are aware. As you refine monitoring and visibility into the stack, those ‘unknown root cause’ events should happen more and more infrequently.

Common Questions to Ask

Depending on the environment, there will be a subset of questions that you can always ask during an event to clarify the situation or guide the participants. These are a few that usually helped me when driving complex events in previous roles.

  1. What is the scope/impact of the event?
  2. What’s changed in the environment over the past X hours/days?
  3. What is the health of upstream and downstream dependencies for the service exhibiting outage symptoms?
  4. Is a rollback [of a deployment or change] relevant to consider?
  5. How complex is the issue? Are we close to finding root cause?
  6. Do we have everyone on the call we need?
  7. Is sufficient progress being made?
  8. How do we verify that root cause has been addressed?

Incremental progress

Use front-line Ops team and managers if you don’t have sufficient staff for a call leader rotation. Invest in creating and holding training sessions for all of the major participants in your typical events, regardless. Just providing them information on questions to ask and how to interact during an event will set the proper direction. (Remember to continue measuring your effectiveness and make adjustments often.)

Front-Line Ops Responsibilities

The front-line Ops team typically sees major issues first and are the nucleus of managing an event. The team is known as ‘NOC’, ‘tier one’, ‘operators’ or any number of other terms. Regardless of what they’re called, they’re at the heart of operations at any company, and they ought to feel trusted enough to be an equal partner in any event management process. They typically have a broad view of the site, have relationships with the major players in the company, and understand the services & tools extremely well. There’s also some serious pressure on the team when push comes to shove, including the following responsibilities.

  • SLAs If you’re dropping money or hurting your company’s reputation every minute you’re down, then it’s vital that you define and adhere to SLAs for recognizing an event (owned by the monitoring application and service owner), submitting the tracking ticket, and engaging the appropriate people. The two latter responsibilities are owned by operations (or whomever is on the hook for ensuring site outages are recognized and addressed). I recommend keeping state within the trouble ticket about who you’ve engaged and why. We wound up building a feature into our event management tool that allowed resolvers to ‘check in’ to an event, which would add a timestamped entry into the tracking ticket. This allowed anyone following along to the event- including tier-one support and the Call Leader (see below) to know who was actively engaged in the event at any given time. It also provided a leg up on building a post mortem timeline and correcting instances of late engagement by service owners.
  • Engagement and Notification Ops should own the engagement and basic notification for each event. If you need to cobble together some shell scripts to do a ‘mail -s’ to a bunch of addresses in lieu of actual engagement lists to begin with, so be it! Just make sure it makes it into the ticket as quickly as possible so there’s a timestamped record of when the engagement was sent. Ops is closest to the event and typically has a better understanding of what teams/individuals owns pieces of the platform than anyone else. Call Leaders and service owners should request that someone be engaged into the event, rather than calling them directly. Not only does this allow other groups to focus on diagnosis/resolution, but it ensures that messages & the tracking of those messages is consistent. The exception to this should be more sensitive communication with senior management/PR/legal, which should be taken care of by the Call Leader, where relevant.
  • Documentation. Every person involved in an event should own portions of this. My opinion is that front-line ops should document who’s been engaged, who’s joined the event, who’s been released from the event, any troubleshooting they’ve done themselves (links to graphs, alerts they’ve received, high-impact tickets cut around the same time), and contacts they’ve received from customer service, where applicable. Adding action items as you go along (“we need a tool for that” or “missing monitoring here”) will aid with identifying action items and creating the agenda for any required post mortem. Ops should also have an ear trained to the call at all times and should document progress if requested by the Call Leader or another service owner.
  • Aiding in troubleshooting. Each on-call is responsible for troubleshooting their own service, but there are times when the front-line Ops personnel see an issue from a higher level and can associate an issue in one service with an upstream or downstream dependency. Ops folks typically have a better grasp on systems fundamentals than software developers and can parse logs faster & easier than their service owner counterparts. I’m a believer in ‘doing everything you can’, so if you have a front-line person who’s able to go above and beyond while still taking care of their base responsibilities of engagement and notification, then why not encourage that?
  • Keeping call leaders honest. Sometimes even Call Leaders can get sidetracked by diving into root cause analysis prior to the customer experience being restored. Front-line Ops people should be following along with the event (they need to document and help troubleshoot anyway), and should partner with the Call Leader to ensure that service owners stay on track and focus remains on resolving the immediate issue.

Incremental progress

This is a lot for a front-line team to cover, so pare down the responsibilities based on the organization’s needs. Engagement of the proper on-calls is imperative to reducing time to diagnose and resolve, so focus there first. If you have strong leaders to run and document events but still need to improve MTTD/MTTR, then concentrate the Ops team on providing on-calls with additional hands or visibility.

On-Call Responsibilities

A major goal of any IT Event Management process should be to enable engineers to act as subject matter experts and focus on diagnosing, resolving and preventing high-impact events. In exchange for this, on-calls should be asked to do only one thing: multi-task. πŸ™‚

  • Be proactive If you’ve discovered a sev1 condition, get a hold of the NOC/tier1/event management team or leader immediately. Submitting a ticket outside of the typical process will likely introduce delays or confusion in engagement.
  • Respond immediately If you’re engaged into a sev1 event, join it immediately and announce yourself & what team you’re representing. A primary on-call should adhere to a tight SLA for engaging. Our SLA was 15min from the time the page was sent to be online and on the conference call. This allowed time for the message to be received and for the on-call to log in. I’m not a fan of trying to define SLAs for actually resolving an issue- some problems are just really gnarly, especially once you’re highly-distributed, and it’s just not controllable enough to measure and correct.
  • Take action Immediately check the health of your service(s), rather than waiting for the Call Leader to task you with that.
  • Communicate The worst thing to have on a conference call is silence when root cause is still unknown or when there isn’t a clear plan to resolution. If you’ve found an anomaly, need assistance, are making progress, need to make a change to the environment or have determined that your service is healthy, make sure that the call is apprised of what you’ve found and that the ticket is updated with your findings.
  • Escalate Don’t be afraid to escalate to a secondary, manager or subject matter expert if appropriate. No one’s going to think less of you. In fact, if you decrease the time to resolve the issue by escalating, you ought to be praised for it!
  • Restore service Stay focused on restoring service. Leave root cause discussions until after the customer experience is reinstated unless it has direct bearing on actually fixing the issue.
  • Ask questions If there’s ever a question about ownership of a task, whether something’s being/been looked at, what the symptoms are, etc., then ask the people on the call for clarification. Don’t assume that everything is covered.
  • Offline conversations: These should be kept to a minimum to ensure that everyone is on the same page. It’s not just about knowing what changes are being made to the environment during troubleshooting, although you must understand this so that engineers don’t exacerbate the issue, trample on someone’s change, or cloud your understanding of just what change “made it all better”. Something as simple as an off-hand comment about a log entry can spur someone else on the call to think of an undocumented dependency, change to software, or any number of other things related to the event. There are times when spinning off a separate conversation to work through a messy & compartmentalized issue is a good thing. Check in with the Call Leader if you feel it’s a good idea to branch off into a separate discussion.

Troubleshooting Best Practices

Not all engineers have extensive experience in troubleshooting, so here are a few hints to help participants in an event.

  • Determine actual impact before diving headlong into diagnosing, where possible
  • Check the obvious
  • Start at the lowest level commensurate with the issue. For example, if monitoring or symptoms point to an issue that is contained in the database layer, it’s relevant to focus efforts there, rather than looking at front-end webserver logs.
  • Assume that something has changed in the environment until proven otherwise
  • Making changes to the environment:
    • don’t make more than one major change at the same time
    • keep track of the changes you’ve made
    • verify any changes made in support of troubleshooting
    • be prepared to roll back any change you’ve made
  • Ask “if…. then….” questions

Manager Responsibilities

  • Take Ops seriously. Support your team’s operational responsibilities. Contribute to discussions regarding new processes and tools, and encourage your team to do the same. Take operational overhead into account when building your project slate; carve out time for basic on-call duties, post-launch re-factoring, and addressing operational action items where possible.
  • Prepare your engineers. Make sure that anyone who joins the on-call rotation receives training on the architecture they support, tools used in the company, who their escalation contacts are (subject matter experts, usually), and are provided with relevant supporting documentation.
  • Reachability As the management escalation, you should ensure that Ops has your contact information, or your escalation rotation’s alias. You should also have offline contact information for each of your team members.
  • Protect your engineers During a call, there may be times when multiple people are badgering your on-call for disparate information. As a manager, you should deflect and/or prioritize these requests so that your engineer can focus on diagnosing the issue and restoring service.
  • Assist the call leader You may be called upon to help make tough decisions such as rolling back software in support of a critical launch. Be prepared and willing to have that conversation. You are also the escalation contact for determining what additional resources can/should be engaged, and you may be asked to run a secondary conference call/chat, where necessary.
  • Help maintain a sense of urgency It’s possible that efforts to find root cause languish as the duration of an event lengthens. Keep your on-call motivated, and get them help if need be. Keep them focused on restoring the customer experience, and remove any road blocks quickly and effectively.
  • Post-event actions. If the root cause of the event resides in your service stack(s), you will be asked to own and drive post-event actions, which may include holding a post mortem, tracking action items, and addressing any follow-up communication where relevant.

Post-event Actions

For events with widespread impact, a post mortem should be held no later than 1-2 business days of the event. If you’ve documented the ticket properly, this will be fairly simple to prepare for. Either the group manager or the Call Leader will facilitate the meeting, which typically covers a brief description of the issue, major points in the timeline of the event, information on trigger, root cause & resolution, lessons learned and short- & long-term action items. Participants should include the on-call(s) and group manager(s), the call leader, and the member(s) of the Ops team at a minimum. It may also include senior management or members of disparate teams across the organization, depending on the type of event and outstanding actions.

Action items must have a clear owner and due date. Even if the owner is unsure of the root cause and therefore can’t provide an initial ETA on a complete fix, a ‘date for a date’ applies. Make sure to cover the ‘soft’ deliverables such as communicating learnings across the organization, building best practices, or performing audits or upgrades across the platform.

Advertisements

17 thoughts on “Event Management

  1. Your “Escalate effectively” section could probably be expanded upon a lot more.

    I’m not quite sure that your characterization of engineers as not wanting help is entirely accurate. I’m sure that happens sometimes, but more commonly what I ran into is that an engineer gets an idea and starts down the path of trying to prove that is the problem or try to fix it, without necessarily fully connecting the dots up. Since it is in their sphere of influence and it may be something that legitimately be an architectural problem they wind up going down a rathole without looking for other possibilities or engaging other resources. Its kind of an “if you have a hammer” sort of problem. Outages can go on for hours while an engineer or two on one particular team focuses on a problem that isn’t really causing the sev1 outage.

    On the other than there’s nothing more demoralizing to the network engineering team to simply be paged on every single sev1 call without anyone in a tier1/tier2 capacity actually able to determine that there’s a networking issue. Instrumenting the infrastructure well enough so that troubleshooters in tier1/tier2 can get a pretty good handle on if the problem is app, system, database or network can go a long way towards helping this. The habit of app devs to blame the network, the database, or the systems when its really their code is also a problem that i don’t really know how to fix, other than to be able to debug really well and resolve any circular firing squads that arise. At my last job we drew up one of those flowcharts where all the lines eventually led to “blame the network”, which was sort of how troubleshooting went, very demoralizing.

    Hiring is also something that you missed? =)

    You need to have troubleshooters that don’t focus on voodoo and cargo cult debugging, but actually know how to use tools to debug issues. Part of avoiding the hours-long-outage-looking-at-the-wrong-problem issue is just in hiring people that can use tools to logically go from the symptoms of the problem back to finding the real root cause. In the unix world folks that know how to use tools like strace and read thread dumps and can explain what those tools are doing are the kinds of troubleshooters you should hire.

    Fixing bad process is much harder than implementing good process from scratch

    Once you’ve got a manager and a group of people who are getting paid salaries and have their jobs and their duties it becomes much harder to unwind bad policy decisions. If the manager of operations is actually the problem I think the only usable solution is to fire them, but change probably needs to come from outside of the organization (i.e. they should probably start recruiting you…). It may still be difficult to accomplish cultural change even from that position as people become used to certain dysfunctional separations of job duties. If the workplace isn’t already dynamic enough then change can be scary for career managers that have been there for too long.

    Other than those ‘nits’, though it sounds oddly familiar and sensible… =)

    1. Amen to all of that, G! I definitely wasn’t attempting to say that engineers don’t want to help- I think you’re correct that the “blockage” can usually be attributed to someone not asking the pertinent questions and not pursuing multiple possible root causes in parallel (with multiple resources, of course).

      Hiring would have been an interesting section- it’s not just about hiring the right operational personnel, but well-rounded engineers and managers who can keep their cool, see the trees for the forest, and direct resources in concert with the participants rather than being overbearing. Would absolutely love to hear about how you screen for the operational mind set in the software engineering world. I suppose starting with a semi-complex troubleshooting question would tell you quite a bit about someone’s ability to deconstruct an event. It’d also tell you a lot about someone’s thought process, how stubborn they are, how they deal with unexpected blockers, and a whole slew of other traits. We do it for Ops folks- dunno why we wouldn’t do it for devs too!

      Agreed- fixing a bad process with an existing culture is more difficult. When people get comfortable with something, you can run into an “if it ain’t broke” scenario pretty easily. That’s where metrics come in, of course. It helps if you can equate outage minutes to abandon rates (or decreased customer confidence, which is more difficult to measure) and on to the bottom line & your stock price. People or organizations who augur into the “way it’s always been” are holding the company back. That being said, I’ve yet to come across a process that was 100% broken, so perhaps you can win the more inflexible folks over by asking them to help craft a process that will contain the best parts of the current process. You’re right though- at some point you may need support from outside the org to get the adoption necessary.

      And isn’t is always the network? Can’t be applications that don’t handle latency or packet loss well, don’t have redundancy, don’t play well with failover scenarios or any sort of backoff/graceful failure mechanisms, right? Huh…. πŸ˜‰

      Thanks G!

  2. Thanks, interesting post.

    I’ve worked at a couple of big shops too; one runs identically to the process you describe.

    The other provides another perspective. There, large external products (or shared infrastructure services) each have a dedicated operations team who own and run its production systems, and maintain an oncall rotation. These teams run independently but – crucially – share an ethos, best practices and senior management separate to the development organization. In aggregate, they are the company’s experts on operating large-scale Internet services and work with developers to build and deploy scalable, reliable systems.

    I suppose you could say that the oncall engineer for any particular product or service rolls up the ‘front line’, ‘responder’ and often ‘call leader’ functions you identify.

    To give an example, customer-facing product P may find that it’s serving 500s, causing alerts to P-omg. She takes any short-term mitigation steps available (e.g. failing traffic over to a “good” datacenter at the cost of a little latency) and then digs into it. Realizing the problem may be in either of backend services Q or R, she pages in those folks. Q-omg confirms the issue is theirs, and maybe wakes some developers to help out if it’s really hairy. After the event, P-team request a postmortem from Q-team to understand the details of what went wrong and identify followup items. Rinse and repeat.

    This works well, but makes for different organizational structures, hiring and training too – e.g. the (potentially more junior) front-line operations team just doesn’t exist, and every member of every oncall rotation needs to be up to the tasks of difficult troubleshooting in their domain as well as communication and coordination. The resulting esprit de corps is something to behold though. :o)

    Cian

    1. Yeah, we were lucky in that we could afford a hybrid between the silo’d support structure you mention (only for some major services) and the centralized support model. Our front-line team was comprised of mid-level systems and network engineers, and we didn’t provide application support. Application groups provided their own on-call support, thankfully. (that’s a separate post that I’d like to write about). Having on-calls run their own events is one option for an organization which just isn’t staffed out to cover the front-line/call leader model. I would say that troubleshooting, documenting, engaging and coordinating is a lot to ask of an on-call engineer in a distributed environment where you may have 5 or 6 major service teams all trying to figure out inter-dependencies and upstream/downstream impacts. For non-sev1 issues, developers did indeed provide that function, but when everything hit the fan, we called in folks who could manage the event & free up the service owner for actual troubleshooting. In my experience, companies tend to hire for technical skills, whereas the ‘soft’ skills of multi-tasking and communication are often overlooked, which is a shame. Maybe if those skills were cultivated in CS departments we’d have a simpler time finding that well-rounded engineer who could handle ‘all of the above’ while still maintaining the Subject Matter Expertness required of the role!!

      Would love to see that model in action. If devs are asked to provide their own Event Management, I bet they came up with some awesome ‘efficiencies’ to lighten the load! πŸ˜‰

      Steph

  3. The sad fact is that many organisations deprecate process because they figure that by hiring enough gnarly engineers with adrenaline habits, they’ll just magically make it work. But we wouldn’t know anything about that πŸ™‚

    Nice writing Steph.

  4. I rarely leave a response, but i did some searching and wound up here Event Management Stephanie Dean….. And I do have 2 questions for you if you usually do not mind. Is it just me or does it look like some of these remarks look as if they are coming from brain dead folks? πŸ˜› And, if you are writing at additional online sites, I would like to follow everything fresh you have to post. Would you make a list of every one of all your shared sites like your Facebook page, twitter feed, or linkedin profile?

    1. Heh… thanks Teddy! Yeah, most of what I write about seems like common sense, but I think I’ve re-learned something myself on every post, just by reiterating it or by reading people’s comments. And for some people (new managers or engineers, people who’ve never had a mentor, etc.), it might be the first time they’re hearing about some of these concepts. I rarely use other sites because it takes too much of my time to maintain them, but I’ll usually post a quick heads-up to twitter each time I post (@sdean185). πŸ™‚ Thanks!!

  5. Hi Stephanie — I’m a fellow ex-Amazonian, although I think I joined shortly after you’d left. I was pointed at this blog post when I asked on Twitter for good incident response practices outside of Amazon’s “Iron Curtain”. Thanks for writing it all up — this is solid gold stuff!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s