Introducing a New Project (to Ops)

Everyone owns slightly (or drastically) differing opinions about how an IT project ought to be managed. I don’t believe in only one correct way to run a project. The best project teams I work with or observe adapt in various ways: project by project, team by team, customer by customer. Regardless of the method, I expect everyone involved in a project to share some core principles:

  • Adhere to Defined Roles
  • Adhere to Defined Processes
  • Communicate

I’ll focus on the introduction of a hypothetical project of some size and breadth to illustrate how these principles should be applied to make the most of the project team at the outset. The first few days or weeks are the most critical for setting the tone and laying the foundation for the remainder of the assignment. For purposes of this scenario, the production operations organization as a whole is involved in 20+ disparate projects this year, and every one of them is “high priority”. At best, we have two to three engineers focused on specialized areas, from networking to storage to infrastructure core services (LDAP, DNS, etc.).

The introduction of a project is vitally important to its success, but it can be marginal at best without a formal process. Assuming there is no well-defined process, I can learn about new demands on my team in myriad ways, including

  • hallway conversations, either directly or overheard
  • random mention in an email
  • second-hand through a member of any number of non-Ops teams
  • request for resources to help with a build-out within a ticket to an engineering queue
  • random entry in a project plan
  • buried in a response to a separate, unrelated query

None of these can or should be considered a “process”.

Introduction of the Project

Here is how this particular project makes its way through the Ops organization. At this point in the project, a commitment for delivery in two months’ time (40 working days) has already been made to the customer, and the PM has already furnished an update to the customer and senior management that the project is on track. No project plan or list of requirements has been furnished to our org.

During this same time frame, our team is on the hook for delivering two other high-priority projects. Our resource plans show that we’re already 25 man days in the hole without considering this new request, and various engineers have been asked to put in extra time to get us over the hump. We’ve also learned that the project was first introduced to the software development organization four weeks ago.

  • Initial contact with the team comes from three sources: an URGENT email to me from a PM asking for resources to build out a test infrastructure (this indicates it’s already an escalation), a second request from the same PM directly to an ops ticket queue asking for work to be done on the project, and a hallway conversation between one of the Ops managers and a developer on the project with a heads-up that development work is already underway.
  • Early estimates provided to Ops come from software developers who don’t have the time or background to understand our infrastructure or support services well enough to estimate what needs to be delivered.
  • In the next Ops Management meeting, we assign a senior engineering resource to participate in design conversations with the lead developers, and a resource to build out the test architecture once we’ve vetted the design. (the request to build out a test platform makes no sense when we don’t even understand whether the design itself will work in our infrastructure).
  • The scheduling of meetings between the engineers and the facilitation of the discussions falls on our management team. One of our managers will now slip one or more of his own priority deliverables in order to take on this portion of the project.
  • After our engineers complete their design review discussions, it is apparent that the initial resource estimates based on the information we’ve received from the developers are woefully inadequate. The project requires 3 full man weeks of Network Engineering time for design and implementation, a storage engineer for a full month, four weeks of front-line Ops work for deployment of infrastructure, software and testing, and a week of a capacity planning engineer’s time.
  • The project also requires additional hardware which was not factored into our recent server purchase, so we must determine whether we can re-coup machines from other projects or place an expedited buy with one of our vendors. If we determine that we need to “steal” hardware from another project, we add another two weeks of effort from our front-line Ops team.
  • In the middle of trying to work out whether we have enough resources to hit the already-promised deadline, I receive an escalation from the PM for resources to build out the test infrastructure, and am pinged by my boss on why we don’t already have a solid Ops project plan for the initiative. It’s kind of getting ridiculous at this point.

This project is clearly not on track, and due to lack of proactive communication and solid planning, Ops is the blocker right out of the gate. (and you wonder why Ops engineers tend to be surly!) The exercise above has taken five days to complete, which means we’re now down to 35 man days to meet the deadline committed to on our behalf. There is zero chance that we can deliver on this project without missing deliverables on at least one of our other high-priority projects.

Lessons Learned

Based on the core principles listed in the introduction, we have a number of items of feedback for various parties involved in the project.

Adhere to Defined Roles

  1. The roles and responsibilities for everyone involved in the project weren’t clearly defined, including determining the correct points of contact and decision makers within Ops. This led to the PM contacting both me and an engineer for the same work, as well as a developer pinging another Ops manager.
  2. Because there was no single owner, we risked either duplication of effort or no effort, both of which are inauspicious ways to kick off a project.
  3. Ops’ absence in determining the delivery dates for the project meant that the customer received a promised completion date that the project team could not honour.

Adhere to Defined Processes

  1. Multiple processes were used for introduction of the project into Ops.
  2. Because there was no agreement on how to introduce the project into our organization, I received an escalation contact prior to an initial request for resources.
  3. No regular update was given over the first four weeks of the project; Ops might have been able to scramble well enough to hit the deadline had we known about the project earlier.

Communicate

  1. The lack of a clear project plan or requirements list forced us to spend time gleaning that information from various sources prior to making progress on planning or execution.
  2. Incorrect communication to the customer about our ability to meet the deadline put us under the gun to deliver from the outset.

Preferred Introduction Process

Here is how I would expect a project of any weight to come across our organization’s radar. I am not saying that adhering this this particular process will remove resource “crunches”, late-binding requests, or competing high-priority projects. It will, however, allow for control over how projects are prioritized and how resources are distributed across them.

  1. During initial conversations regarding a new project, the PM and lead developer answer a base set of operational questions provided by the Ops team. The list of questions will vary depending on the request type.
  2. Answers to the questions are submitted to the Ops management team to be discussed internally. That discussion yields basic resource estimates and the assignment of an operational engineer to participate in design reviews.
  3. Shortly afterward, a project review meeting is held with all stakeholders to discuss requirements, resourcing, timing, and roles & responsibilities within the project team. The ops organization should send the engineer(s) slated to work on the project, at least one manager who can represent the org, and the PM responsible for the operational work.
  4. A full project plan should be published following the meeting, with clear requirements, deliverables and milestones defined. Project plans and meeting notes should be posted to an easily-accessible file share.
  5. The PM should review the plan with the customer quickly, and any requested amendments should be re-visited with the project team immediately.
  6. After the review of customer feedback, the project plan is locked down, and any new requests are considered “scope creep” and added to the project’s backlog in priority order. Each high-priority request should be discussed within the project team and should include stakeholders in order to determine whether it can be accommodated prior to committing to it.
  7. Regular succinct project updates should be furnished to all stakeholders to avoid last-minute escalations when a milestone is in danger of slipping.

Initial Project Meetings

The first meeting in any project should be a simple high-level walk-through of the project with whomever can define the proper resources for the initiative. This should happen as soon as the project has been approved and really shouldn’t last longer than 20 or 30 minutes. All we care about is getting the correct people in the room for the full scoping exercise and understanding the priority of the project compared to the other initiatives on the slate.

The second meeting should be of longer duration and include all of the resources necessary for delivery. I would much rather schedule a half day for this one than to head into a large project without all of the critical information. The agenda should cover defining the roles and responsibilities of each team member and ensuring that every participant understands the requirements well enough to scope their deliverables and provide reasonable time estimates for milestones. Engineers expect very clearly-defined roles and responsibilities up front, and I can’t blame them. Worrying about who they have to talk to about specific portions of a project, who makes what decisions, what bits of information are important and which aren’t, or who owns specific pieces of the technology direction or implementation just reduces their ability to do their jobs properly. Some questions to consider for that second meeting:

  • Who is the primary customer, and have they provided sufficient requirements?
  • What are the requirements and end goal of the project?
  • What is the priority of the project compared to other current initiatives? (while this should be addressed in the initial meetings, the engineers should also understand the relative priorities)
  • What is the communication channel with the customer?
  • Who is the subject matter expert for each major piece of technology?
  • Who makes decisions on the technical direction of the project? What happens if there is a stalemate on making a decision?

And the ‘soft’ stuff:

  • What is the escalation process? What is or isn’t worthy of escalation? Where does the buck stop?
  • What is the project manager’s role in the project?
  • What is the expectation around scoping dates for deliverables and milestones? Is contingency time added into each deliverable? Does the project itself have padding to allow for slips?
  • How often are the engineers expected to give an update on specific deliverables versus the overall progress toward the next milestone?
  • What do various stakeholders want to know about on a regular basis, and how would they like to see that information?

Answers to these questions should all be documented in an easily-accessible and visible place in the project’s doc repository from the beginning. Quite a bit of the information can be defined on an org- or company-wide level and shared across projects. I would expect a link to the document to be included in the header of the project plan and of every project meeting agenda. Any new member of the project should review the information with the PM prior to any participation in the project. Every manager who has an engineer involved in the project should also be familiar with the info, particularly anything to do with resourcing.

Regular project meetings should be scheduled (scrum/agile, waterfall, whatever). Every person with an outstanding deliverable, anyone expected to provide technical support and any “decision maker” for outstanding questions should attend or provide an update prior to the meeting if they’re unavailable. I’m a fan of stakeholders staying away from these meetings unless a prioritization discussion needs to happen. Too many pointy-haired bosses detracts from actual engineering progress.

the never-ending battle

Supporting Thoughts

Aside from the introduction of the project, there are many other facets of project management that are important for ensuring smooth delivery. Communication, defining roles and processes, and reporting at the right level are all integral to a project’s ongoing success.

Communication Is Most Important

The most important responsibility of a PM is to not only facilitate communication, but to have clear, crisp communication herself. This is important for the other participants as well, but engineers are typically paid to focus on technical delivery first and foremost.

Don’t be afraid of people. A PM shouldn’t shy away from human interaction (face-to-face, video conference, phone) and at least 50% of her day should be filled with talking to people (I pulled that number out of thin air, but it seems pretty reasonable). Like it or not, you can’t get the entire context of why a deliverable is late or why an engineer is spending “so much time” on a particular deliverable from a tracking ticket.

Minimize interruptions. I would also expect a PM to also strike a balance and be reasonable with the interruptions. If a deliverable is due in two weeks and yesterday’s update shows that it’s on track, I don’t see any reason to interrupt an engineer for yet-another-update. Everyone’s time is at a premium every day. Use discretion, common sense, and the guidelines that were set up in the initial meetings to determine whether that daily update is really necessary.

Be succinct. I don’t know too many people who enjoy reading a 3-page project update or listening to a 5-minute monologue, only to learn that the project is on track and there’s nothing to worry about. Verbal diarrhea during project review meetings is pretty terrible to sit through.

Call out important points explicitly in email. Use bold fonts, red colour or something similar to call attention to new actions items or issues. I’m a fan of the ol’ TL;DR format. A bulleted list containing what I honestly need to know about at that top of an email makes the team more efficient and appreciative.

Project Roles and Responsibilities

I believe that a Project Manager who is expected to lead an Ops project should have basic understanding of operational concepts. There are varied opinions on this, and some will say that if a PM asks the right questions, it doesn’t matter whether they understand systems or networking. I say that’s hogwash (because I’m old), and that if an Infrastructure PM doesn’t know enough about managing a system with cfengine/puppet/chef/etc, then they can’t guide the conversations, ask the right questions or help work out dependencies. At that point, they’re a Project Coordinator (big difference), and I shouldn’t be expecting them to drive a project on my team’s behalf.

PC/PM/PPM

I believe there’s a difference between a project coordinator, a project manager, and a program manager. Too often, these roles get confused or just plain aren’t defined, which leads to incorrect expectations from those involved in the project. Each organization needs to set their own job descriptions, but here are mine:

Project Coordinator: Project Coordinators are purely tactical. They get told what to do & how to do it, then it just magically gets handled. These are the people I prefer to work with. Part of it is a trust thing. I don’t dole out the responsibility for managing my teams’ resources easily, mostly due to the points I make later in this post. Part of it is that I’ve historically had strong technical leadership on my teams who can manage scoping and resourcing fairly well on their own.

[Technical] Project Manager: A really good project manager understands the resources and deliverables well enough to make suggestions about how to scope a project. They should also have enough knowledge to formulate considered, relevant questions to spur the right discussions. How does this project for X service impact its interaction with Y downstream service? I would not expect a TPM to know the answer, but I would expect a good one to be able to ask the question and grok the answer. I would also expect a solid TPM to call BS on time estimates that are just obviously way off base. The build-out of a new data centre is scoped at 3 days & you have no tools/automation to facilitate that? Dude.

[Technical] Program Manager: I confess to not having worked with too many honest-to-gosh program managers in my career, so I’ll just say that I would expect them to have a larger, more high-level view of really gnarly initiatives spanning multiple organizations, and to manage other project managers or coordinators who would take care of the lower-level responsibilities such as task tracking, updating project plans and keeping track of resources.

Other Project Team Roles

Engineer: Engineers should be focused on technical delivery as much as possible. There are a few other things that need to happen to facilitate that delivery.

  • Be realistic and take your time when asked for estimates on deliverables and milestones. Then pad those estimates, based on the project guidelines.
  • Know your audience. Keep updates less technical when necessary.
  • Learn about escalation – when, why, how. Escalation isn’t always a “bad thing”. Sometimes asking for assistance or feedback is the best thing you can do for yourself, the team and the project. If you’re already overloaded & don’t tell anyone about it, then we’ll continue to assume that you have the time to hit your deadlines or take on new work.
  • Being proactive on relaying progress, blockers and risks will enable the PM to remove those blockers and escalate issues before they become hair-on-fire scenarios
  • Provide feedback into the process constructively. Yes, Ops Engineers have that well-understood culture of…. curmudgeon-itis. Let it go, and be productive in your discussions with PMs. They honestly are trying to make your life easier.
  • Be patient: not every PM or customer has perfect technical understanding. If they did, they’d be your engineering team mate.

Manager: First point of escalation. In Operations, managers are typically better-poised to assign and manage resources across multiple projects. Project managers aren’t usually deep enough into the scope of our work load across all projects or the specialized skill sets within the team to address this themselves. We do include the PM in the conversations so that he understands what’s involved in the process.

  • Listen to the PM when he escalates to you. If something doesn’t make sense, then ask the relevant questions to get to the heart of the issue & then use the interaction to refine the escalation processes.
  • Hold engineers accountable for hitting their deliverables.
  • Listen to your engineer when she tells you that she’s overloaded, that a deliverable or milestone is in danger of slipping, or that the team flat-out missed something in scoping exercises. Not listening just sets the team up for either heroic efforts or missed deadlines.

“Decision Maker”: When there are conflicts regarding the direction or prioritization of a project, the buck stops with this person. For some projects, there may be multiple decision makers for technical issues or cross-project prioritization. You are responsible for ensuring you have all of the relevant information to make a considered decision. Make sure you are updated on the progress of the project and any potential conflicts. If you don’t have enough information to make a specific decision, then be vocal and ask as many questions as necessary to make the call quickly.

Define the Escalation Process

This could have been included in the section above, but to me, escalation is such a critical component of managing a major project that it deserves to be called out separately. While you’re defining the key decision makers in the initial meetings referenced above, you might as well define the escalation process, including details about what’s worthy of escalation and what isn’t, the method for escalating an issue, where the buck stops, and how to communicate that an issue has been escalated. For example, depending on your culture, it might not be acceptable to notify the entire project team if one person’s deliverable slips. A portion of this information could (should?) be defined at an org or even company level. Then all you have to do is play fill-in-the-blank with the owners for your specific project.

Defining this up front encourages people to communicate in the most sensible and productive way from the beginning, teaches participants to have open, frank conversations to try to work out issues on their own (no one likes being escalated on), and it saves the ‘punch’ of escalating for things that are legitimately critical. As a manager, there are few things worse than realizing that you’ve missed a critical escalation because it’s gotten lost in the noise of a million other unimportant pings.

Reporting

Stakeholder Updates

Each audience involved in a project requires different types and amounts of communication. I honestly didn’t understand this until I became a manager of managers and had multiple teams and projects to keep track of. As an engineer, I wanted to drink from the proverbial fire hose and have access to every tidbit of information possible. Nowadays, I manage four teams spread across approximately 15 projects at any given time. On average, I have somewhere between 10 and 15 minutes to devote to any one project per day.

Project Team Communication

The most valuable information project team members receive are updated project plans, progress against blockers they’ve raised, any changes to project scope, and changes to project timelines. I’m a fan of the daily scrum, where the team can all get on the same page in 15 minutes or less. It’s efficient when customized to your particular organization. Notes should be sent out after each meeting; simple bullet points should be sufficient.

Executive Summaries

There’s a difference between Executive Summaries and communication meant for a technical project team. If I’m asked to read and sift through a 3-page treatise in order to find the one embedded action item meant for myself or my team, it’s useless and a waste of my time. To boil it down, this is what I would expect:

  1. Is the project red/yellow/green based on the current milestone? If I see green, I need to be able to trust that things are on track and I can safely file it away.
  2. If the project is in jeopardy, what is the bulleted list of the blockers, who owns them, what the ECD (estimated completion date) is and what I actually need to worry about. I don’t want to see more than two sentences for any one bullet point. Do I need to find another resource? Do I need to step in and help guide things along?
  3. Did we complete a major milestone on time? Did one of my team members go above and beyond to improve the design, help the project team, etc?

General Status Reports

Include positives like hitting milestones. Make the communication clear enough and at the right technical level for the least-technical stakeholder who honestly needs to understand what’s going on with the project. C-level people and managers in organizations not involved in the project might be in that group (it all depends on your own org though, of course).

Basic Ops Questionnaire

It’s always a good idea to make it as simple as possible for PMs or developers to provide the right ops-related information for an internally- or externally-driven project. If you don’t have a formal NPI (new project initiative) process where these questions are codified, a simple document should suffice. We’ve begun using a google form open to anyone within the company for these types of requests. Some of the questions we expect answers for include

  • What is the request? X customer would like us to host a few customer service tools in our data centre. We would like to extend Y service to serve global requests, rather than serving requests from just Z service.
  • Who are the main technical and project contacts?
  • Who will be administering the service? If said service is ‘hosted’ rather than an internal one.
  • Will you need authentication for your administrators?
  • Does any data need to be stored locally on the machines? If so, what types?
  • Is there an admin UI or set of management scripts to administer the app? Or just backend processing?
  • Is this a mission critical service? Do you require notification during scheduled/unscheduled events?
  • How will the service be monitored?
  • Do you need server/network redundancy? Honestly, I can’t imagine ever putting something into production that isn’t redundant, but some people just don’t want to pay for it!
  • Are there any specialized hardware requirements for computation, storage, etc?
  • Do you have capacity numbers through the next 12 months? If not, when do you expect to have them?
  • Is integration with other services internal or external to our infrastructure required?
  • What security considerations or concerns do you have, if any?
  • How will you handle code deployments?
  • How will you handle packaging?
  • When do you need the environment(s) in place?
  • Are you prepared to furnish operational run books?

This is by no means an exhaustive list, but it helps drive the proper conversations from the beginning, and it also helps Ops management determine which resources to devote to a particular project.

Advertisements

IT Change Management

No IT Ops process incites eye-rolling more quickly or more often than IT Change Management (CM), which is why this post has taken me almost 3 months to finish off. I’ve been shuddering at the thought of opening up that can o’ worms. But it’s good for the business and good for engineers, so it’s gotta be done. Attaching your name to changes made in the environment isn’t a bad thing. In fact, a CM process can save your bacon (mmmm… bacon…) if you’ve been diligent.

The basic purposes of a CM process are to

  • enable engineers to manage change within the environment while minimizing impact
  • provide notification and documentation for potentially-affected partners/customers, as well as peer/dependent groups who may need to troubleshoot behind you
  • furnish checks and balances- an opportunity to have multiple sets of eyes looking at potentially-impacting changes to the environment

The amount of change in your architecture escalates as the environment becomes larger and more distributed. As the number of services in your stack increases, the dependencies between services may evolve into a convoluted morass, making it difficult to diagnose major issues when they occur. Tracking changes in a distributed environment is a critical factor to reducing MTTD/MTTR. Implementing a CM process and tool set is a great way to enable the necessary visibility into your architecture.

Btw, don’t ever delay resolving a legitimate business-critical issue because of CM approval. Just do it and ask for approval afterward.

I don’t like CRB’s

I really don’t have much to say about this. I just wanted to get it in writing up front. I don’t see a lot of value in “Change Review Boards”. It’s my opinion that every org within the company ought to own their own changes and the consequences of any failures stemming from them. Management was hired/promoted because they’re supposed to be able to run their own discrete units well. They ought to be trusted to do that without interference by someone who has no idea about the architecture or the technical facets of a change. Customer approvals (internal, mostly- external where appropriate) and devotion to constant and thorough communication can circumvent any perceived need for centralized oversight for changes. Avoiding a CRB element also allowed us to move much faster, which is something almost every company craves and appreciates.

Why you need CM

If you’re reading this post, then hopefully you already have a passing notion of why CM is a critical component of a truly successful operation. Just prior to rolling out our new CM process and tool set at Amazon, we were besieged with outages, with the top three root causes of botched software deployments, roll-outs of incorrect configuration, and plain ol’ human error. Our architecture was fairly sizeable at that point, and we needed better communication and coordination regarding changes to it. While I obviously can’t provide hard stats, I can say that over the first three years with the new process, the number of outage minutes attributed to fallout from planned changes in the architecture were reduced by more than 50% while the architecture continued to grow and become more complex. Our CM process contributed mightily to this reduction, along with updated tooling and visibility.

Here are just a few discrete points about why you need CM (I’m sure there are a ton more I haven’t included).

  • Event Management. Okay, this is a very Ops-manager-focused point, I’ll admit. When you change stuff in the environment without us knowing about it, we tend to get a little testy, as do the business and your customers. MTTR lengthens substantially when you have to figure out what’s changed in order to help identify root cause. Controlling the number of changes in a dynamic environment can significantly reduce the number of “compound outages” you experience. These are some of the most difficult outages to diagnose, and therefore some of the longest-lived. (Deploying new code at the same time as a major network change? tsk tsk…. Which change actually triggered the event? Which event is a potential root cause?) You’ve probably been in a situation where a site or service is down and the triggering events and/or root causes are unknown. One of the first questions to ask during an event is, “what’s changed in the environment in the past X time period?”. Pinpointing that without a CM process and accompanying tool set can be nigh impossible, depending on the size/complexity of the environment and scope of your current monitoring and auditing mechanisms.
  • Coordination/Control. Controlled roll-outs of projects and products addresses quite a few critical points. In a smaller company, this is even more important, as the resources to support a launch are typically minimal. In any sized company, too many potentially-impacting changes in the environment at the same time is a recipe for disaster (dependencies may change in multiple ways, killing one or more launches, etc). Reducing the amount of change in your environment during high-visibility launches will help the company maintain as much stability as possible while the CEO announces your newest-greatest-thing-EVAR to the press. A little bit of a control honestly isn’t a bad thing. I’ve never understood why the word ‘control’ has such a negative connotation. Must be why I’m an Ops manager.
  • Compliance. I’ve learned a bit by rolling through SOX and SAS70 compliance exercises. You need a mechanism to audit every single change to critical pieces of your architecture. Working out all of the bugs in the process prior to being required to adhere to these types of audits is definitely preferable. Granted, you may enjoy some leeway in your first audit to get your house in order, but why waste time & create a fire by procrastinating?

9 Basics

These nine points can make a huge difference in ensuring the successful adoption of something as potentially invasive to a company’s work flow as this process might be.

  • Automation. Anything to do with a new process that can be automated or scripted should be. This includes the tool set that enables the CM process itself. If you can create a template for common changes, do it. Take it a step further and build ‘1-click’ submission of change records. Move your change records through the approval hierarchy automatically so approvers don’t have to do so manually. There are myriad ways to streamline a process like CM to save engineer time and effort.
  • Automation, part deux. This time I’m talking about carrying out the changes themselves. I know that there are varying schools of thought on this, and I’m by no means saying that automation cures all evils. But automation does reduce human error- especially when you’re completing tasks such as changing network routing or updating system configs across a fleet. The less chance for human error, the better. If you’ve watched Jon Jenkins’ talk on Amazon’s Operational efficiencies, you know that automation allows their developers to complete a deployment every 11.6 seconds to potentially thousands of machines with an outage rate of 0.0001%. Trust me- before we had Apollo, Ops spent a lot of time running conference calls for outages stemming from bad deployments.
  • Tiered approvals work. Not every single change in the environment requires a change record, but every change must be evaluated to ensure the proper coverage. Critical changes in the platform or infrastructure which have the potential to impact the customer experience just ought to require more oversight. As a shareholder and a customer, I know I appreciated the fact that we had multiple levels of reviews (peer/technical, management, internal customer) to catch everything from technical mistakes in the plan to pure timing issues (making a large-scale change to the network in the middle of the day? dude.) There are also many changes which have zero impact and which shouldn’t require numerous sets of eyes on it prior to carrying it out. Completing the 99th instance of the same highly-automated change which hasn’t caused an event of any kind in the last X months? Foregoing approvals seems appropriate. See “The Matrix” below for more information.
  • Err on the side of caution. This doesn’t necessarily require moving more slowly, but it’s a possibility. For changes that could potentially impact the customer experience, a slight delay may prevent a serious outage for your site/service. If you’re unsure whether your potentially-invasive change might conflict with another one that’s already scheduled, then delay it until the next ‘outage window’. Not 100% sure that the syntax on that command to make a wholesale change to your routing infrastructure is correct? Wait until you can get a second set of eyes on it. You’d much rather wait a day or two than cause an outage and subject yourself to outage concalls, post mortems, and ‘availability meetings’, guaranteed.
  • Trust. Reinforce that implementing a CM process has nothing to do with whether or not your engineers are trusted by the company. You hired them because they’re smart and trustworthy. It’s all about making sure that you preserve the customer experience and that you’re aware of everything that’s changing in your environment. Most engineers are pretty over-subscribed. Mistakes happen, and it’s everyone’s job to guard against them if at all possible. The process will just help you do that.
  • Hold “Dog & Pony Shows”. Our new CM process required many updates to most of our major service owner groups’ work flows. It wasn’t just about learning a new tool. We had new standards for managing a ‘tier 1’ service. When the time came to roll out the new process company-wide, we scheduled myriad training sessions across buildings and groups. We tracked attendance & ‘rewarded’ attendees with an icon to display on their intranet profile. This also provided us a way of knowing who was qualified to submit/perform potentially-impacting changes without having to look at roll-call sheets during an event. I always left room for Q&A, and re-built the presentation deck after each session to cover any questions that popped up. We received some fabulous feedback from engineers while initially defining the process, but the most valuable input we collected was after we were able to walk through the entire process and tool set in a room full of never-bashful developers.
  • Awesome tools teams are awesome. Build a special relationship with the team who owns developing and maintaining the tool set that supports your process. A tools team that takes the time to understand the process and how it applies to the various and disparate teams who might use the tools makes all the difference. Quick turn-around times on feature requests, especially at the beginning of the roll-out, will allow you to continue the momentum you’ve created and will show that you’re 1) listening to feedback and 2) can and will act on the feedback.
  • Be explicit. Be as explicit as possible when documenting the process. Don’t leave room for doubt – you don’t want engineers to waste time trying to interpret the rules when they ought to be concentrating on ensuring that the steps and timeline are accurate. When it doesn’t make sense to be dictatorial, provide guidelines and examples at the very least.
  • Incremental roll-out. I always recommend an incremental roll-out for any new and potentially-invasive process. Doing so allows for concentration on a few key deliverables at any given time, easing users into the process gradually while using quick wins to gain their support, gathering feedback before, during and after the initial implementation, and measuring the efficacy of the program in a controlled fashion. Throwing a full process out into the wild to “see what sticks to the wall” isn’t efficient, nor does it instill user confidence in the process itself. In startup cultures, that might work for software development, but avoid asking engineers and managers to jump through untested process hoops while they’re expected to be agile.

The Matrix

I’m a firm believer in the flexibility of a stratified approach to CM. Not every single type of change needs a full review, 500 levels of approvals, etc. We as an organization (Amz Infrastructure) put a lot of thought into the levels and types of approvals required for each specific type of change- especially in the Networking space, where errors have the potential to cause widespread, long-lasting customer impact. We analyzed months of change records and high-impact tickets, and we took a good hard look at our tool set while coming up with a document that covered any exceptions to the “all network changes are tier-1, require three levels of approval and at least 5 business days’ notice” definition. Here’s a sanitized version of a matrixed approach:

Example CM Stratification

We set up a very simple process for adding new changes to the “exception list”. Engineers just sent their manager a message (and cc’d me) with the type of change they were nominating, the level of scrutiny they recommended and a brief justification. It was usually 3-4 sentences long. Then there’d be a brief discussion between myself and the manager to make sure we were copacetic before adding it to the CM process document for their particular team. Last step was communicating that to the relevant team and clearing up any questions – typically in their weekly meeting. Voila!

For approvers

We created guidelines and checklists for reviewers and approvers for the ‘soft’ aspects of change records that weren’t immediately apparent by simply reading the document. We trusted the people involved in the approval process to use their own solid judgement where appropriate, since no two situations or changes are the same. Here are a few of the more major guidelines that I remember; each organization/environment combination will require their own set, of course.

  • Timing of submission. Our policy was to accept after-the-fact changes for sev1 tickets, and Emergent CMs for some sev2 tickets (see below, “Change Record”). Using inappropriately-defined sev2 tickets to circumvent the process was obviously grounds for rejection/rescheduling. The same applied to Emergent changes due to lack of proper project planning, which are rarely worthy of the emergent label.
  • Level of engineer. Ensure that the person responsible for the technical (peer) review owns the correct expertise (product or architectural), and that the technician of the change is of the proper level for the breadth and risk involved. Assuming that a junior engineer can make large architectural changes and then have the necessary competencies to troubleshoot any major fallout most likely won’t set them – or your customers – up for success.
  • Rejecting change records. We provided a few guidelines for gracefully rejecting a CM, including giving proper feedback. For example, rather than saying, “your business justification sucks”, you might say, “it’s unclear how this change provides benefit to the business”, or “what will happen if the change doesn’t happen?” (which were both questions included in our CM form).
  • Outage windows. Unless your change system enforces pre-defined outage windows, you’ll need to review the duration of the change to ensure that it complies. If a change bumps up against the window, you might want to ask the technician about the likelihood that the activity will run long, and request that that information be both added to the change record and communicated to affected customers.
  • Timeliness of approvals. This is more of a housekeeping tip, but still important. Engineers expend a lot of time and energy planning their changes, so the least the approvers can do is be timely with their reviews. Not only is it courteous, it helps the team hit the right notification period, the engineer doesn’t need to spend even more time coordinating with customers to reschedule, and the remainder of your change schedule doesn’t have to be pushed back to accommodate the delay.

Audits/Reporting

This was the biggest pain in my arse for months, I have to say- about four hours every Sunday in preparation for our org’s Metrics meetings during the week. We had expended so much effort in defining the process, as well as educating customers and engineers, and our teams had made a huge mind shift regarding their day-to-day work flow. We absolutely had to be able to report back on how much improvement we were seeing from those efforts. We measured outages triggered by Change events, adherence to the process, and quality of change records. Most of our focus was on quality, as we knew that quality preparation would lead to fewer issues carrying out the actual changes.

Completing an audit for each of the seven teams in Infrastructure entailed reviewing the quality of information provided in 8 separate fields (see below, ‘The Change Record’) for every Infra change record submitted (typically around 100-125 records/week). Steps outside of just the quality of information provided included comparing against each team’s exception list to ensure the proper due diligence had occurred, comparing timestamps to audit the notice period, and examining whether the proper customers had been notified of and approved the change.

Sure wish I had an example of one of the graphs on CM that we added to the weekly Infrastructure metrics deck. They were my favourites. 🙂

Over the first 8 weeks of tracking, our teams increased their quality scores by more than 100% (some teams had negative scores when we began). Outage minutes attributed to networking decreased by approximately 30% within the first 6 months. We also had coverage and tracking for changes made by a couple of teams which had previously never submitted change records, including an automation team which owned tier-1 support tools.

Notification, aka “Avalanche of Email”

To be perfectly frank, we never really figured out how to completely combat this. We did build a calendar that was easy to read and readily-available. We also had an alternate mechanism for getting at that same information if a large-scale event occurred and the main GUI wasn’t reachable, which is typically when you need a CM calendar the most. Targeted notification lists did help. For example, each service might have a ‘$SERVICE-change-notify@’ list (or some variant) for receiving change records related to one particular service. Over-notification is a tough challenge- especially when there are thousands of changes submitted each day in the environment. If anyone has a good solution, I’d love to hear about it!

The Change Record

Yes, it took some time for an engineer to complete a change record perfectly- especially for ‘tier-1 services’, which necessitated more thorough information. Our first version of the form did include auto-completion for information specific to the submitter and technician. We also added questions into the free-text fields within the CM form to draw out the required information to prevent the back-and-forth between the submitter and approvers which might have resulted. ‘V2’ provided the ability to create templates based on specific fields, which saved our engineers quite a bit of time per record.

Here are some of the more important fields that ought to be added to a change form. They don’t comprise all of the input required- just the major points.

  • Tiers/Levels. Most environments do have various ‘tiers’, or levels of importance to the health of the site/service the company is providing. For example, if you’re a commerce site, chances are your Payments platform is held to a 5-9’s type of availability figure. These services ought to be held to a very high standard when it comes to touching the environment. On the flip side, a service such as Recommendations may not be as important to the base customer experience and therefore might not need to be held to such tight requirements. Grab your stakeholders (including a good cross-section of end users of the process) to define these tiers up front.
  • Start/End Time. This kind of goes without saying. It’s the field that should be polled when building an automated change calendar or when people are attempting to not trample on each others’ changes. Once the dust has settled, you can refine this to include fields for Scheduled Start/End and Actual Start/End Time. This will allow gathering more refined metrics about how long changes actually do take to complete, as well as how well teams adhere to their schedules. Setting the Actual Start time would move the change into a ‘Work in Progress’ state and send notification that the change had started. Setting the Actual End would move the record to the Resolved state.
  • Business Impact. Since not everyone viewing a change was able to glean whether their service or site would be impacted, we provided engineers with drop-down selections for broad options such as ‘one or more customer-facing sites impacted’ or ‘only internal sites impacted’. We followed that with a free-text field with questions that would draw out more details about actual impact. The answers were based on “worst-case scenario” (see my point above about erring on the side of caution), but engineers typically added a phrase such as ‘highly unlikely’ where warranted to quell any unwarranted fears from customers, reviewers and approvers.
  • Emergent/Non-Emergent. This was just a simple drop-down box. Any change record which hadn’t been fully approved 48 hours prior to the Scheduled Start time (when the record appeared on the CM schedule and the general populace was notified) was marked as Emergent, which garnered closer attention and review. This did not include after-the-fact change records submitted in support of high-severity issues. It was a simple way to audit and gather metrics, and it also offered customers and senior management a quick way to see high-priority, must-have changes.
  • Timeline. This should be an explicit, step-by-step process, including exact commands, hostnames, and environments. Start the timeline at 00:00 to make it simpler. Scheduled start times can change multiple times depending on scheduling, and having to adjust this section every time is a pain. Timelines must always include a monitoring step before, during and after the change to ensure that the service isn’t behaving oddly prior to the change, that you haven’t caused an outage condition during the change (unless it’s expected) and that the environment has recovered after the work is complete. If you have a front-line Ops team who can help you monitor, that’s a bonus! Just don’t sign them up for the work without consulting them first.
  • Rollback Plan. The rollback plan must also be an explicit, step-by-step process. Using “repeat the timeline in reverse” isn’t sufficient if someone else unfamiliar with your change is on-call and must roll it back at 4am two days after the change. Include exact commands in the plan and call out any gotchas in-line. And remember to add a post-change monitoring step.
  • Approvals. We opted for four types of approvals to allow focus on the most important facets of the process. Over time, we utilized stratification to dial back the involvement required of our management team and the inherent delays that came along with that. Every level of approver had the ability to reject a change record, setting it to a Rejected state and assigning it back to the submitter of the change record for updates.
    • Peer review. Our peer reviewers typically focused on the technical aspects of the change, which included ensuring that the timeline and roll-back plans covered all necessary steps in the proper order, and that pre- and post-change monitoring steps existed.
    • Manager review. Managers typically audited all of the ‘administrative’ information such as proper customer approval, overlap with other critical changes already scheduled, and that the verbiage in the fields (especially business impact) were easily-understood by the wider, non-technical audience.
    • VP review. High-risk, high-visibility changes were usually reviewed by the VP or an approved delegate. VPs typically concentrated on the potential for wider impact, such as interference with planned launches. They were the last step in the approval process and had final say on postponing critical changes for various reasons (amount of outage minutes accrued vs risk of change, not enough dialogue with customers/peers on major architectural changes, etc).
    • Customer approval. We dealt with internal customers, typically software development teams, and we worked closely with each of our major customers to define the proper contacts for coordination/approval. Engineers were required to give customers at least 48 hours’ notice to raise questions or objections. In the case of some network changes, we touched most of the company. VP review and approval would cover the customer approval requirement, and we would use our Availability meeting to announce them & discuss with the general service owner community if time permitted.

    None of these roles should be filled by the technician of the change itself. Conflict of interest. 😉

  • Contact information. We required contact information, including page aliases, for the submitter, technician, and the resolver group responsible for supporting any fallout from the change. Standard engagement alias formatting applied. Information for all approvers were also captured in the form.

huh? it’s only been two weeks?!?

This post is all about what I’ve learned in my first two weeks as Director of LiveOps at Demonware. The role of a manager should always be to enable the organization to increase the level of production while maintaining sanity and without having to horizontally scale the team. (‘buzzword bingo’, anyone?) In a year, this blog will be filled with examples of how we as a management team accomplished that: all of the challenges, wins, missteps, etc. we’ve made on our way to fulfilling our destiny as the premier Operations team in the gaming industry.

When I joined DW on June 1, I had no idea what to expect. Yes, I’ve been in Ops for longer than I care to admit. But gaming is a fairly foreign world to me – I can watch someone play a game all day, and I fare fairly well with games targeted at 4-year-olds. That’s where my experience in the game industry stops. That being said, here are some of my initial impressions after spending three days in Vancouver with the team & working from home in Seattle (silly work permit process….) for a few more days.

  • Operations is Operations. Yes, the technologies might differ drastically between companies, but the same challenges, issues and solutions exist when trying to enable a high-performing team to ‘level-up’: process, standardization, automation and tooling
  • I’m extremely humbled that Demonware selected me to guide their highly-capable LiveOps team. Seriously.
  • I wonder at the amount of work the company had been able churn out with such a small but able staff
  • I’m incredibly excited by the positive attitude and collaborative inter- and intra-team spirit. Even the surliest of engineers kick ass and take names
  • I instantly fell in love with my highly-technical, over-taxed, mostly junior management team. I expect that I will learn just as much from them as I will teach them.

Most importantly, I realize that while the amount of work produced by our engineers reflects a very high-performing organization, we’re at a breaking point. The deliverables in the [currently-being-drafted] short- and long-term Operations road maps far outstrips the processes and resources available. More so than any team I’ve managed previously, and I’ve had to deal with some pretty gnarly resource constraints.

State of the DW LiveOps Union

We build and maintain backend services for Activision/Blizzard games such as Call of Duty – services such as leaderboards and matchmaking. (pretty sweet, right?) Our work load is mostly dictated by the road maps of third-party game studios, and while the work is cyclical, not every game requires the same features or infrastructure. Currently, LiveOps is the tail being wagged, with late-binding requests generating a make-or-break race to hit the hard holiday shopping deadlines.

Engineering and Operations were both re-structured just a few months ago to better reflect the work load. This seems to have gone well for the SDE world, where structures based on services makes a lot of sense. We’re still working through the transition in Operations- these exercises typically take much longer to shake out in our more interrupt-driven, diverse realm.

We’re very, very lucky to have fantastic support from DW senior management. (and I’m not just saying that because my boss will most likely be reading this post at some point) It’s only been two weeks, but I feel ‘mind meld’ coming on, and that’s only happened one other time in my career. Our management understands the value that a world-class Operations team provides to the company. It’s a rare occurrence, in my experience, and I plan to take full advantage of it. 🙂

LiveOps is a technically high-performing team, and…. entertaining. It’s filled with some of the most driven, intelligent and open engineers I’ve worked with. The company has done a fantastic job of hiring for culture as well as technical skill, and that really does make all the difference. Prima donnas can suck the life out of an Ops team.

We’re just beginning to think about Scale-with-a-capital-S. It’s a rare and exciting time in the life of an adolescent company. I thank my lucky stars that I’ve been fortunate enough to experience scaling challenges and seen some amazing solutions to them at Amazon and Facebook. I feel like my time at both of those companies was the best prep I could have ever had for the challenges we’re now facing.

My Dirty Little Assessment

First off, I can’t give enough credit to The First 90 Days for providing me a solid framework for approaching the assessment of my new organization. I’m learning to take my time to focus on observing and building relationships, rather than jumping in and making lightly-considered/rash decisions just to try to make my mark. The book’s common sense is forcing me to focus on defining a few quick-strike wins to build momentum and credibility. If you’re ever faced with transitioning into a new role, read this. It’s bible-worthy IMO, even though none of the concepts are particularly foreign. Now on to what I think I might be blogging about over the next year…

Have I Mentioned We’re Hiring?

Hiring is one of our top priorities. First of all, we have a great recruiting team, and the people who Demonware has hired are fabulous. Just like Amazon and FB, we’ve placed just as much emphasis on culture fit as technical acumen. Like it or not, the work doesn’t stop coming in just because we’re being selective in our hiring process though. To help fill our roles more quickly, we’ll be re-factoring job descriptions, and working with recruiting on updating our processes to include base technical pre-screen questions (to save our phone screeners time and headaches), more timely and descriptive feedback, and using our engineers’ penchant for social networking to get the word out.

“Traditional” Ops Processes

Demonware is just coming out of their startup phase, and it seems that a common denominator in companies at this stage in their progression is lack of mature processes (makes sense). We actually have a great start- it’s all about streamlining and improving upon what we already have. Process should be an enabler, not a hindrance. People who balk at this idea or think that ‘process’ is a four-letter word obviously haven’t seen it implemented the right way. Just sayin’. Here are a couple of deliverables that we’ve talked about as a management team that are on my personal road map:

  • Event Management: We already have a decent (not perfect) Event Management process documented, and we follow it most of the time. We also have a fantastic start on an incredible tool set that covers the basics of notification and engagement. The information we need exists, but we still need to tie it all together. We also need to remove more of the human element in the process (notice I said we follow it most of the time, just like most other shops). In the middle of an event, engineers just want to fix the issue, rather than concentrating on following the process. And, of course, we could always tighten our post-event actions to ensure that we’re lengthening MTBF.
    These are important things to address, but the most important deliverable for this point is the ability to measure the effectiveness of the process (MTTD, MTTR, MTBF). We honestly won’t know how to take this a step farther until we know how we’re currently doing.
  • Change Management: We’re in the same boat with CM as we are with Event Management. Good process that’s well-documented, but no way to measure the effectiveness of it, the time spent per change, number of planned vs. emergent changes, or a solid way to track customer impact/fallout programmatically. This isn’t to say that we don’t pay attention to this- we definitely do. We just need to make it much easier to get at the data we need quickly, and we need to build on that data to improve upon our susceptibility to fallout.
  • Monitoring/Alerting: We monitor A LOT of stuff, and we have the basics covered pretty well. The next step is to refine our monitoring configurations to pare down the noise. We must be able to definitively say that yes, we’re monitoring the right stuff at the proper thresholds, that the correct personnel are notified for the right alarms, and that we’re able to measure our effectiveness at reducing the number of alarms through everything from code re-factoring to architecture standardization.
  • Operational Acceptance (OAC): Ops teams routinely complain about stuff being ‘thrown over the fence’ for them to support. OAC is a great way to ensure that before the team signs off on a new support request, it’s actually supportable. Providing a well-designed OAC checklist to customers will not only address that, but it will oftentimes spawn different design decisions that will make a service/stack more extensible and reliable. Theo Schlossnagel says it’s about “putting more ops into dev”, rather than the inverse. Can’t argue with Theo, right? 🙂

Streamlining

We have to make our own lives simpler. That’s just a given for any Ops team, regardless of how long the team or company has existed or how successful they are. Now that we’re starting to hunker down, we need to begin approaching Operations as a business unit, just like every other organization. It sounds like an awful concept to engineers, but once the framework is in place, those same engineers are grateful that they can depend on the way work flows into and out of the team, there are clear escalation paths, etc.

  • Planning and Prioritization: It’s the same with most Ops teams, but the resounding feedback from our team is that “we never have time to get to the stuff we really need to do”. We need to answer the questions, “what is it that is taking up your time currently?” and “what exactly should we be doing instead, and why?”. Prioritizing work in the Ops world is typically tougher than in the engineering world due to the interrupt-driven, break/fix nature of the role. There’s no reason you can’t just make an “Operational Interrupts” line item in your road map, assign it the proper resource level, and devote the remainder of the team’s time toward the projects which pop the stack in terms of business value.
  • Communication/Partnering: The more of a partnership you can cultivate with engineering and senior management, the easier it gets. We already work well with both sets of customers, but this will always be a focus for us. Reviewing road maps and priorities to make sure we’re all on the same page, participating in design reviews (so that Ops has a seat at the table before a service launches), and consistently setting and resetting expectations will all make our lives easier as Ops personnel.

Event Management

Something blew up in your infrastructure and you have no idea what’s wrong or where to even start looking. A large percentage of your customer base is impacted, and the company is hemorrhaging money every minute the episode continues. Your senior management team is screaming in both ears, and your engineers are floundering in your peripheral vision, trying to find root cause. Sound familiar?

True Ops folks tend to thrive in this type of environment, but businesses don’t. And engineers, regardless of whether they write software or maintain systems & networks, hate seeing the same events over and over again. Managing these events doesn’t just last for the duration of the event itself. To do it right, it takes copious amounts of training, automation, process innovation, consistency and follow-through. This is my ‘take’ on how to go about rolling out a new process.

This may seem like a lot of overhead (it’s a lot of words), but the process itself actually pretty simple. The effort is really in making the right process designs up front and in creating the proper tooling & training/drilling around it. It’s a very iterative process; it took well longer than a year to solidify it, and we were constantly re-factoring it as we learned more about our evolving architecture. Most of what’s described below is for Impact 1 events (site outages, etc) and doesn’t necessarily apply to run-of-the-mill or day-to-day requests (over-rotating burns people out and diminishes the importance of those major events). Not all of this applies to a small, 20-person company either, although the responsibilities contained in the ‘During an Event’ section will apply to almost any sized team or event. Perhaps you’ll need to combine roles or re-distribute responsibilities depending on the size of the team or event, but the process itself is pretty extensible. The examples follow distributed websites, since it’s what I know, but the concepts themselves ought to apply to other architectures and businesses. (I also assume you’re running conference calls, but the same applies if you run your events over IRC, skype, etc).

Culture Shift

If you’re one of the few lucky people who work in a company where IT Operations garners as much attention and love as launching new features/products, then we’re all jealous of you. 🙂 Engineers and business people alike would absolutely love to have 100 percent of the company’s time focused on innovation. In my experience, any time I mention ‘process’, I receive looks of horror, dread and anger from engineering, including management. The knee-jerk reaction is to assume that a new procedure will only create more delay or will divert precious time from what ‘truly matters’. Taking a measured approach to dispelling those rumors will pave the way to a successful roll out. It just takes a lot of discussion, supporting metrics, the ability to translate those metrics into meaningful improvement to the bottom line, a considered plan, and the willingness to partner with people rather than being proscriptive about it.

  • Act like a consultant. Even if you’re a full-time employee who’s ‘grown up’ in an organization, you should begin with a consultant mind set so you can objectively take stock of your current environment, solicit objective feedback, and define solid requirements based on your learnings. This can be difficult when you’re swimming (drowning?) in the issues, and gathering input from people who are participants but not owners of the process will help immensely.
  • Use metrics You have to know the current state of affairs before diving headlong into improvements or prioritizing the deliverables in your project. If you don’t have a ticketing system or feature-rich monitoring system from which to gather metrics programmatically, then use a stop watch to codify the time it takes to run through each step of the current process. If all you have is anecdotal evidence to reference initially, then so be it. And if that’s the case, gaining visibility into the process should be at the top of your priorities.
  • Be truly excited. Don’t pay lip service to a change in process, and don’t allow the leaders in your organization to do so either. The minute you sense resistance or hesitation in supporting the effort, intercept it and start a conversation. This is where the numbers come in handy. If the engineers tasked with following a new process are hearing grumblings from managers or co-workers, then it adds unnecessary roadblocks. To be sure, we encountered our fair share of resistance which bred some frustration during our roll-out. But we used the fact that every improvement decreased the number of outage minutes, added to the bottom line and helped with the stock price- even if it was an indirect benefit. That’s something that everyone can and should be excited about.
  • Incremental progress. Not everything included here has to (or can) happen overnight, or even in the first six months. I hate the saying, “done is better than perfect”, but sometimes it actually applies. I’ve included ideas on how to roll most of the process out in an incremental fashion while still getting consistent bang for the buck.
  • Continual refinement. No good process is one-size-fits-all-forever. Keep an open mind when receiving feedback, ensure that the process is extensible enough to morph over time, and continually revisit performance and gather input from participants. Architectures change, and the processes surrounding them must change as well.

Prepping for success

The following deliverables are fundamental to securing a solid Event Management process that’s as streamlined as possible. It will take time to address the majority of the research and work involved, but basing prioritization on the goals of the program and the biggest pain points will allow measurable progress from the outset.

Impact Definitions

You need to know the impact or severity level of the event before you know what process to run. The number of levels may vary, but make sure to decide on a number that is both manageable and covers the majority of issues in your environment. I have to admit that over time, my previous company moved to looking at events as “pageable with a concall” (sev1), “pageable without a call” (sev2) or “non-pageable” (sev3) offenses, rather than adhering to each specific impact definition. This isn’t right or wrong; the behavior reflected our environment. Although each organization is unique, here are some examples to consider:

Impact 1: Outage Condition: Customer-facing service or site is down. Percentage of service fatals breaches a defined threshold (whatever is acceptable to the business).
Sev1 tickets/events follow all processes below and have a very tight SLA to resolve which triggers auto-escalation up the relevant management chain. The escalation time will depend on the types of events involved in a typical sev1, but we escalated through the management chain aggressively, beginning at 15 minutes after the ticket was submitted. The additional setting of rotating the ticket to the secondary on-call if the ticket isn’t moved to the appropriate next state or updated (thus paging them) should also be fairly tight (ie- if a ticket isn’t moved from ‘assigned’ to ‘researching’ within 15min, the ticket will auto-reassign to the secondary and page the group manager).
Impact 2: Diminished Functionality: Customer-facing service or site is impaired. Percentage of service fatals breaches a defined threshold (whatever is acceptable to the business).
Sev2 tickets/events will page the correct on-call directly, with a moderately tight SLA to resolve which triggers auto-escalation up the relevant management chain. These tickets will also rotate to the secondary on-call and page the group manager if the ticket isn’t moved to the appropriate next state after the agreed-upon SLA.
Impact 3: Group Productivity Impaired: Tickets in this queue will most likely wind up covering issues that will either become sev1/sev2 if not addressed or are action items stemming from a sev1/sev2 issue. It may also cover a critical tool or function that is down and affecting an entire group’s productivity. These tickets don’t page the on-call, and the SLA to resolve is much more forgiving.
Impact 4: Individual Productivity Impaired/Backlog This sev level was treated more like a project backlog, and while there are other products that cover bugs and project tasks, I like the idea of having everything related to work load in the same place. It’s simpler to gather metrics and relate backlog tasks to break/fix issues.

Incremental progress

I will always recommend front-loading the sev1 definition and over-escalating initially. In my mind, it’s much better to page a few extra people in the beginning than it is to lose money because you didn’t have the proper sense of urgency or the correct people for an issue. If you can’t integrate automatic rotation of tickets into your current system, then add it into your checklist and make a conscious decision to watch the time and escalate when necessary.

Tools and Visibility

Tools

It doesn’t take an entire platform of tools to run an event properly, although that certainly does help. The following tools are fairly important, however, so if you have to prioritize efforts in this arena, I’d start here.

  • Ticketing System A flexible and robust ticketing system is an extremely important part of a solid Event Management process. It’s your main communication method both during and after an event, and it’s a primary source for metrics. If participants in an event are fumbling with the fundamental mechanism for communicating, then they’re not concentrating on diagnosing and fixing the issue. There are many important features to consider, but extensibility, configurability and API’s into the tool are all critical to ensuring that whatever system you choose grows along with your processes and organization.
  • Engagement/Notification System. Ideally this will be tied into your ticketing system. If you have your tickets set up to page a group, then you ought to already have that information in the system. While our first-line support team utilized a full version of a homegrown event management application, we always wanted to provide a pared-down version of the same tool for service owners throughout the company. I certainly hope that’s happened by now, since the more distributed a company becomes, the more difficult it is to locate the right people for cross-functional issues which may not be sev1-worthy.
  • Sev1 Checklist I’m a big proponent of checklists that can and should be used for every event. In the heat of battle, it’s easy to overlook a step here and there, which can cause more work farther into the event. Building a checklist into an overall Event Management application is a great way to track progress during an event, ensure each important step is covered and inform latecomers to the event of progress without interrupting the flow of the call or the troubleshooting discussions. Separate lists should be created for the front-line ops team, call leaders and resolvers. Each role owns different responsibilities, but everyone must understand the responsibilities of all three roles.

Incremental progress

Ticketing: If your system doesn’t include features such as service-based groups, automatic submission of tickets, reporting/auditing or fundamental search functionality, start investing in either bolstering the current system or migrating to another one. Depending on the scope of that work, beg/borrow/steal development resources to create a hook into the backend data store to pull information related to your specific needs. (this is a grey statement, but every environment has different needs).

Checklists: It’s fine to start small with a binder of blank lists that the team can grab quickly. Anything is better that nothing! Include columns for timestamps, name of the person who completed the action, the actual action and a place for notes at the very least. The facets of an event I would document initially are discovering the issue (goes without saying), cutting the ticket, initial engagement & notification, each subsequent notification, when on-calls join the call/chat, any escalations sent, root cause identified, service restored, and post mortem/action items assigned.

Visibility

  • Monitoring/Alerting. You have to be able to recognize that an event is going on before you can kick off a process. If you’re really good, your monitoring will begin the process for you by auto-cutting a ticket based on a specific alarm and notifying the proper engagement/notification lists. That takes time, of course, but you should be able to build a solid list of alerts around sev1 conditions as you go along- automation like that is rarely built in a day. Almost every post mortem I’ve been in for a high-impact event has included a monitoring action item of this type; if those conversations are happening then you’re bound to have fodder for monitoring and automation. I’ve chatted about monitoring and alerting in a previous post, so I won’t regurgitate it here.
  • Changes in the Environment. Understanding what’s changed in your environment can significantly aid in narrowing the scope of diagnosing and troubleshooting events. Accumulating this data can be a huge task, and visualizing the amount of change within a large distributed, fast-paced, high-growth environment in an easily-digestible format is a bear. The visibility is well worth it, however, so if you don’t have a Change Management system or process, it’s a fantastic deliverable to put on a road map. CM is an entirely separate post though, so I won’t go into it here.

Incremental progress

Changes: Start small by collating data such as software deployments for major services, a simple calendar of Change Management events (heck, even a spreadsheet will suffice in the beginning), and recent high-impact tickets (sev1/sev2). You can migrate into a heads-up type of display once you have the data and understand the right way to present it to provide high value without being overwhelming.

Standardized On-Call Aliases

Once your company has more than one person supporting a product or service, you should create a standardized on-call alias for each group. Adding churn to the process of engaging the proper people to fix an issue by struggling to figure out who to page is unacceptable- especially when the front-line team has a tight SLA to create a ticket with the proper information, host a call and herd the cats. For example, we used a format akin to “page-$SERVICE-primary” to reach the primary on-call for each major service. (page-ordering-primary, page-networking-primary, etc.) Ditto for each team’s management/escalation rotation (page-$SERVICE-escalation). Managers change over time, and groups of managers can rotate through being an escalation contact. As a company grows, a front-line team can’t be expected to remember that John is the new escalation point for ordering issues during a specific event.

Primary/Secondary/Management Escalation

When a group gets large enough to handle multiple on-call rotations, a secondary on-call rotation should be created for at least a couple of reasons. First, reducing the churn in finding the proper person to engage will decrease the mean time to engage/diagnose. Secondly, pages can be delayed/lost, engineers can sleep through events, etc. If you’re in the middle of a high-impact event, losing money every minute, and restoring service hinges on engaging one person, then you’re in a bad position. Lastly, there are times when an event is just too large for one person to handle. For example, having a backup who can pore through logs while the primary is debugging an application will usually speed up MTTD/MTTR. Less context switching during a high-pressure call is a Good Thing. (see On-Call Responsibilities for expectations of on-calls).

Management escalation should be brought in if the root cause for a major outage lies in their court, if you’re unable to track down their primary or secondary on-call or if the person engaged in the call isn’t making sufficient progress. Managers should help find more resources to help with an issue and should also serve as a liaison between the resolvers ‘on the ground’ fixing the problem and senior management, where necessary. See Manager Responsibilities below.

Engagement vs Notification

There’s a difference between engagement and notification during an event. Engagement is just that- it’s the mechanism for calling in the right guns to diagnose and fix an issue. Notification is a summary of where you’re at in the process of diagnosing/fixing and should be sent to all interested parties, including senior management. Each of those messages should contain different information and each audience group should also be managed differently.

Engagement

It’s my opinion that the list of folks who are engaged in fixing an issue should be controlled fairly tightly, else you risk the ‘peanut gallery’ causing the discussion to veer off track from the end goal of finding and resolving the root cause of the issue. At a previous company, we created engagement groups for each major bucket (ordering, networking, etc) and populated that with a particular set of aliases that would reach the on-calls of the groups typically involved/necessary in that type of event.

Engagement messages should contain ticket number and impact, contact information (conference call number, IRC channel, etc), and a brief description of the issue. If this is an escalation or out-of-band (engaging someone who isn’t on-call), include something to that effect in the message:

Plz join concall 888-888-8888, pin 33333. sev1 #444444, 50% fatal rate in $SERVICE. (John requests you)

Notification

Notification lists should be open subscription for anyone internally, but you should ensure that the right core set of people is on each list (VP of the product, customer service, etc). Even if a service isn’t directly related to the root cause of an issue, up- and downstream dependencies can impact it. Create notification lists for each major service group (networking, etc) so that people can be notified of problems with services that impact them, either directly or indirectly. The frequency of messages sent should be a part of the defined event management process, as should out-of-band notification practices for more sensitive events (communication with PR, legal, etc).

Notifications should include ticket number, brief description of the issue, who is engaged, whether root cause is known, ETA for fix and the current known impact. Be brief but descriptive with the message.

FYI: sev1 #444444, 50% fatal rate in $SERVICE. Networking, SysEng, $SERVICE engaged. Root cause: failed switch in $DATA_CENTER, ETA 20min

Incremental progress

Aliases: If you’re just starting out or don’t have an effective list management system, you can begin with a simple document or shared calendar containing who is responsible for each service. You can even go as simple as noting who the subject matter expert and group manager are for each team if the concept of an on-call doesn’t exist yet, then build aliases as you canvass that information. Contacting each team to request that they update the doc when on-call responsibilities change probably won’t be met with much resistance- you can sell it as a, “if we know who to page, we won’t page you in the middle of the night”. Engineers should love that. If you utilize a system like IRC, it’s fairly trivial to write bots that will allow ‘checking in’ as an oncall; storing that information in a flat file that can be read by another bot or script to engage them when necessary is a quick solution that doesn’t require navigating to multiples places while spinning up a high-impact call.

Engagement: Start with just using a standard template for both engagement and notification to get people used to the proper messaging. If you don’t have a tool, then begin with either email or CLI (mail -s, anyone?), but make sure you add a copy of each message sent to the relevant ticket’s work log so you have a timestamped record of who was contacted. Again, if you don’t have an effective list management solution, create templates (and aliases, if you’re running things from a commandline).

During an Event

Leading an Event/Conference Call

“Call Leaders”. No matter how much process, automation, visibility and tooling you have, there are always those really gnarly events that need management by an Authority of some sort. Appointing a specific group of people who have deep understanding of the overall architecture and who own the proper mentality and clout within the organization to run those events will go a long way toward driving to root cause quickly and efficiently. Call Leaders should not be at the forefront of technical troubleshooting; they’re on the call to maintain order and focus. These people should be well-respected, organized and knowledgeable. They also have to be a tad on the anal-retentive and overbearing side. Call Leaders are tasked with prioritizing, ensuring appropriate escalation occurs, progress is documented in the corresponding channel(s), the correct communication flows to the proper people, resolution of the issue is actually achieved & signed off on, and post-event actions are assigned. As long as they don’t over-rotate and step on the toes of the engineers who are fixing the issue, you’re all good. Re-evaluating this core group every once in a while is a great thing to do. Depending on how frequently these leaders are engaged, burnout can be an issue. (Btw, for years, our front-line operations team served this function themselves. As we grew and became more distributed, we implemented the additional Call Leader tier, with the aim of focusing on better tooling and visibility to drive down the frequency with which that new tier was engaged.)

  • Documentation: While the front-line team should be adding ticket updates, the Call Leader is responsible for making sure that happens. If done properly (and in conjunction with updates to the status of an event in an Event Management tool), a Call Leader shouldn’t have to interrupt the flow of the call to brief newcomers about the state of the event, nor should they need to ask themselves, “now, about what time did that happen?” after the event is complete. It also allows interested parties outside of the flow of the call to follow along with the event without interrupting with those annoying, “what’s the status of this event?” questions.
  • Focus on resolution. Ask leading questions to focus service owners on resolving the immediate issue (see ‘Common Questions’ below). Once root cause of an issue has been discovered, engineers may have a tendency to dive directly into analysis before the customer experience has actually been restored. There’s plenty of time after an event to do that analysis.
  • Facilitate decision making. The more people participating in an event, the more difficult it can be to make the tough decisions (or sometimes even just the simple ones). Call Leaders should act as a facilitator and as a voice of reason when necessary. For example, making the call on whether to roll back a critical software deployment supporting the launch of a new product isn’t typically something you’d want an engineer to make. They don’t need that stress along with trying to diagnose and fix a production issue. Since Call Leaders are typically tenured employees who understand the business, they should be able to engage the correct people and ask the proper questions to come to a decision quickly.
  • Escalate effectively Pay attention to whether progress is being made on the call or whether anyone is struggling with either understand the issue or handling the work load. Ask whether you can engage anyone else to help, but realize that engineers are notorious for not wanting to ask for help. Give it a few more minutes (this all depends on the urgency of the event), then ask, “who should I engage to help?”. If an on-call doesn’t offer a name, engage both the secondary on-call (if it exists) as well as the group manager. I usually say something along the lines of, “I’m going to grab John to help us understand this issue a bit better.”, which is a fairly non-confrontational way of letting the on-call know that you’re going to bring in additional resources.
  • Release unnecessary participants No one likes to hang out on a call if they’re not contributing to the resolution of the issue. Keeping the call pared down helps with unnecessary interrupts and also keeps on-calls happy. Prior to releasing anyone from the call, make sure that they have noted in the ticket that their service has passed health checks. (remember to note in the ticket when the person dropped off the call for future reference!)
  • Running multiple conference calls If you’re managing an event that includes multiple efforts then it can be a good idea to split the call. Examples of this are a networking issue that spawns a data corruption issue, or an event with multiple symptoms and/or multiple possible triggers/root causes. Communication between the two camps can become unwieldly quickly, so if you don’t have a secondary Call Leader, then utilize the group manager responsible for one of the issues. This necessitates a round of call leader training for primary managers, which ought to be completed in any case. This also makes it highly important that any proposed changes to the environment are added to your communication mechanism (ticket, IRC, etc) prior to making the change so that all parties involved in the event are aware. As you refine monitoring and visibility into the stack, those ‘unknown root cause’ events should happen more and more infrequently.

Common Questions to Ask

Depending on the environment, there will be a subset of questions that you can always ask during an event to clarify the situation or guide the participants. These are a few that usually helped me when driving complex events in previous roles.

  1. What is the scope/impact of the event?
  2. What’s changed in the environment over the past X hours/days?
  3. What is the health of upstream and downstream dependencies for the service exhibiting outage symptoms?
  4. Is a rollback [of a deployment or change] relevant to consider?
  5. How complex is the issue? Are we close to finding root cause?
  6. Do we have everyone on the call we need?
  7. Is sufficient progress being made?
  8. How do we verify that root cause has been addressed?

Incremental progress

Use front-line Ops team and managers if you don’t have sufficient staff for a call leader rotation. Invest in creating and holding training sessions for all of the major participants in your typical events, regardless. Just providing them information on questions to ask and how to interact during an event will set the proper direction. (Remember to continue measuring your effectiveness and make adjustments often.)

Front-Line Ops Responsibilities

The front-line Ops team typically sees major issues first and are the nucleus of managing an event. The team is known as ‘NOC’, ‘tier one’, ‘operators’ or any number of other terms. Regardless of what they’re called, they’re at the heart of operations at any company, and they ought to feel trusted enough to be an equal partner in any event management process. They typically have a broad view of the site, have relationships with the major players in the company, and understand the services & tools extremely well. There’s also some serious pressure on the team when push comes to shove, including the following responsibilities.

  • SLAs If you’re dropping money or hurting your company’s reputation every minute you’re down, then it’s vital that you define and adhere to SLAs for recognizing an event (owned by the monitoring application and service owner), submitting the tracking ticket, and engaging the appropriate people. The two latter responsibilities are owned by operations (or whomever is on the hook for ensuring site outages are recognized and addressed). I recommend keeping state within the trouble ticket about who you’ve engaged and why. We wound up building a feature into our event management tool that allowed resolvers to ‘check in’ to an event, which would add a timestamped entry into the tracking ticket. This allowed anyone following along to the event- including tier-one support and the Call Leader (see below) to know who was actively engaged in the event at any given time. It also provided a leg up on building a post mortem timeline and correcting instances of late engagement by service owners.
  • Engagement and Notification Ops should own the engagement and basic notification for each event. If you need to cobble together some shell scripts to do a ‘mail -s’ to a bunch of addresses in lieu of actual engagement lists to begin with, so be it! Just make sure it makes it into the ticket as quickly as possible so there’s a timestamped record of when the engagement was sent. Ops is closest to the event and typically has a better understanding of what teams/individuals owns pieces of the platform than anyone else. Call Leaders and service owners should request that someone be engaged into the event, rather than calling them directly. Not only does this allow other groups to focus on diagnosis/resolution, but it ensures that messages & the tracking of those messages is consistent. The exception to this should be more sensitive communication with senior management/PR/legal, which should be taken care of by the Call Leader, where relevant.
  • Documentation. Every person involved in an event should own portions of this. My opinion is that front-line ops should document who’s been engaged, who’s joined the event, who’s been released from the event, any troubleshooting they’ve done themselves (links to graphs, alerts they’ve received, high-impact tickets cut around the same time), and contacts they’ve received from customer service, where applicable. Adding action items as you go along (“we need a tool for that” or “missing monitoring here”) will aid with identifying action items and creating the agenda for any required post mortem. Ops should also have an ear trained to the call at all times and should document progress if requested by the Call Leader or another service owner.
  • Aiding in troubleshooting. Each on-call is responsible for troubleshooting their own service, but there are times when the front-line Ops personnel see an issue from a higher level and can associate an issue in one service with an upstream or downstream dependency. Ops folks typically have a better grasp on systems fundamentals than software developers and can parse logs faster & easier than their service owner counterparts. I’m a believer in ‘doing everything you can’, so if you have a front-line person who’s able to go above and beyond while still taking care of their base responsibilities of engagement and notification, then why not encourage that?
  • Keeping call leaders honest. Sometimes even Call Leaders can get sidetracked by diving into root cause analysis prior to the customer experience being restored. Front-line Ops people should be following along with the event (they need to document and help troubleshoot anyway), and should partner with the Call Leader to ensure that service owners stay on track and focus remains on resolving the immediate issue.

Incremental progress

This is a lot for a front-line team to cover, so pare down the responsibilities based on the organization’s needs. Engagement of the proper on-calls is imperative to reducing time to diagnose and resolve, so focus there first. If you have strong leaders to run and document events but still need to improve MTTD/MTTR, then concentrate the Ops team on providing on-calls with additional hands or visibility.

On-Call Responsibilities

A major goal of any IT Event Management process should be to enable engineers to act as subject matter experts and focus on diagnosing, resolving and preventing high-impact events. In exchange for this, on-calls should be asked to do only one thing: multi-task. 🙂

  • Be proactive If you’ve discovered a sev1 condition, get a hold of the NOC/tier1/event management team or leader immediately. Submitting a ticket outside of the typical process will likely introduce delays or confusion in engagement.
  • Respond immediately If you’re engaged into a sev1 event, join it immediately and announce yourself & what team you’re representing. A primary on-call should adhere to a tight SLA for engaging. Our SLA was 15min from the time the page was sent to be online and on the conference call. This allowed time for the message to be received and for the on-call to log in. I’m not a fan of trying to define SLAs for actually resolving an issue- some problems are just really gnarly, especially once you’re highly-distributed, and it’s just not controllable enough to measure and correct.
  • Take action Immediately check the health of your service(s), rather than waiting for the Call Leader to task you with that.
  • Communicate The worst thing to have on a conference call is silence when root cause is still unknown or when there isn’t a clear plan to resolution. If you’ve found an anomaly, need assistance, are making progress, need to make a change to the environment or have determined that your service is healthy, make sure that the call is apprised of what you’ve found and that the ticket is updated with your findings.
  • Escalate Don’t be afraid to escalate to a secondary, manager or subject matter expert if appropriate. No one’s going to think less of you. In fact, if you decrease the time to resolve the issue by escalating, you ought to be praised for it!
  • Restore service Stay focused on restoring service. Leave root cause discussions until after the customer experience is reinstated unless it has direct bearing on actually fixing the issue.
  • Ask questions If there’s ever a question about ownership of a task, whether something’s being/been looked at, what the symptoms are, etc., then ask the people on the call for clarification. Don’t assume that everything is covered.
  • Offline conversations: These should be kept to a minimum to ensure that everyone is on the same page. It’s not just about knowing what changes are being made to the environment during troubleshooting, although you must understand this so that engineers don’t exacerbate the issue, trample on someone’s change, or cloud your understanding of just what change “made it all better”. Something as simple as an off-hand comment about a log entry can spur someone else on the call to think of an undocumented dependency, change to software, or any number of other things related to the event. There are times when spinning off a separate conversation to work through a messy & compartmentalized issue is a good thing. Check in with the Call Leader if you feel it’s a good idea to branch off into a separate discussion.

Troubleshooting Best Practices

Not all engineers have extensive experience in troubleshooting, so here are a few hints to help participants in an event.

  • Determine actual impact before diving headlong into diagnosing, where possible
  • Check the obvious
  • Start at the lowest level commensurate with the issue. For example, if monitoring or symptoms point to an issue that is contained in the database layer, it’s relevant to focus efforts there, rather than looking at front-end webserver logs.
  • Assume that something has changed in the environment until proven otherwise
  • Making changes to the environment:
    • don’t make more than one major change at the same time
    • keep track of the changes you’ve made
    • verify any changes made in support of troubleshooting
    • be prepared to roll back any change you’ve made
  • Ask “if…. then….” questions

Manager Responsibilities

  • Take Ops seriously. Support your team’s operational responsibilities. Contribute to discussions regarding new processes and tools, and encourage your team to do the same. Take operational overhead into account when building your project slate; carve out time for basic on-call duties, post-launch re-factoring, and addressing operational action items where possible.
  • Prepare your engineers. Make sure that anyone who joins the on-call rotation receives training on the architecture they support, tools used in the company, who their escalation contacts are (subject matter experts, usually), and are provided with relevant supporting documentation.
  • Reachability As the management escalation, you should ensure that Ops has your contact information, or your escalation rotation’s alias. You should also have offline contact information for each of your team members.
  • Protect your engineers During a call, there may be times when multiple people are badgering your on-call for disparate information. As a manager, you should deflect and/or prioritize these requests so that your engineer can focus on diagnosing the issue and restoring service.
  • Assist the call leader You may be called upon to help make tough decisions such as rolling back software in support of a critical launch. Be prepared and willing to have that conversation. You are also the escalation contact for determining what additional resources can/should be engaged, and you may be asked to run a secondary conference call/chat, where necessary.
  • Help maintain a sense of urgency It’s possible that efforts to find root cause languish as the duration of an event lengthens. Keep your on-call motivated, and get them help if need be. Keep them focused on restoring the customer experience, and remove any road blocks quickly and effectively.
  • Post-event actions. If the root cause of the event resides in your service stack(s), you will be asked to own and drive post-event actions, which may include holding a post mortem, tracking action items, and addressing any follow-up communication where relevant.

Post-event Actions

For events with widespread impact, a post mortem should be held no later than 1-2 business days of the event. If you’ve documented the ticket properly, this will be fairly simple to prepare for. Either the group manager or the Call Leader will facilitate the meeting, which typically covers a brief description of the issue, major points in the timeline of the event, information on trigger, root cause & resolution, lessons learned and short- & long-term action items. Participants should include the on-call(s) and group manager(s), the call leader, and the member(s) of the Ops team at a minimum. It may also include senior management or members of disparate teams across the organization, depending on the type of event and outstanding actions.

Action items must have a clear owner and due date. Even if the owner is unsure of the root cause and therefore can’t provide an initial ETA on a complete fix, a ‘date for a date’ applies. Make sure to cover the ‘soft’ deliverables such as communicating learnings across the organization, building best practices, or performing audits or upgrades across the platform.

Operational acceptance criteria

I have to admit that when I started creating my first Ops Acceptance Criteria (OAC), I had very little knowledge of what it entailed- I just knew it needed to happen. So I scoured the e-interweb for examples and, perhaps not-so-surprisingly, found a plethora of OAC docs which probably should have been confidential information. 🙂 I adopted general concepts from some of those docs to cobble together one of my own. As with everything else I’ll ever write about here, there isn’t a one-size-fits-all solution; every OAC must match the current environment and be reviewed/refined consistently to ensure it doesn’t become a fossil two days after publishing. But here’s a generic outline of what we created for my first IT Ops team, with the confidential/specialized parts omitted, of course. I think it’s a decent outline of what ought to be covered for a typical first-line support team, although the categorization could probably stand a refresh.

Warranty Period

Before a task/service/etc can be considered ‘handed off’, the operations team will shadow the engineering oncall (or designate) for X period of time, and roles will then be reversed, with the engineering teams shadowing the operational oncall, for X period of time. During the warranty period, the following checklist will be utilized and all relevant points signed off on. When all relevant issues are addressed, the operational handoff will be considered complete. One engineer and one member of the operational support staff will be paired to maintain consistency throughout the audit. These two people are responsible for ensuring smooth a handoff.

Implementation/Technology

  • Must reside on currently supported O/S versions and hardware platforms, as defined by whomever is charged with defining them.
  • Must be stable prior to handoff to operational support team. We make sure the service/architecture won’t spawn operational interrupts due to poor monitoring configuration, improper implementation or poor design for at least the last week of the warranty period.
  • Configuration/data/tuneables separate from code (where applicable)
  • Performance characteristics understood and documented

Adherence to standards

  • Default and unique configurations identified and documented
  • Deviations from standard redundancy model identified and documented
  • Any IT Security requirements met and documented

Documentation

  • Operational support documentation has been furnished utilizing the standard Ops Doc template(s).
  • Relevant tagging/categorization has been applied to documentation (where applicable, for ease of oncall duties)
  • Bottlenecks/known choke points documented
  • List of clients (where applicable) documented
  • List of dependencies documented

Operational Procedures

  • RMA (return merch auth) process, including vendor SLAs, documented
  • Hardware ‘spares’ are identified and available as required
  • Vendor contact information documented
  • Routine Maintenance procedures documented
  • Satisfactory high-level code review occurs (major components and software package dependencies, where applicable)
  • Upgrade processes defined and reviewed with operational team (includes test suite as well as expected behaviour)
  • Disaster Recovery procedures tested and documented

Supportability

  • Log retention policy/location/rotation documented
  • Integration with current appropriate tool set. Must be compatible and tested/validated on the platform
  • Permissions are managed via the accepted mechanism for servers, services and/or network devices.
  • Naming convention documented and adheres to standards

Change Management

  • Exceptions to the accepted CM policies and procedures are approved by the engineering team and operations management and are listed in the proper section of the relevant CM policy document(s).
  • Operations team must be an approver/reviewer for major revisions of software or major upgrades/changes in architecture
  • Operational documentation must be provided to and reviewed by the operations team prior to requesting resources for CM completion

Monitoring

  • Alarming with standard notification processes
  • Thresholds have been tested
  • Performance and health monitoring is in place

Ticketing

  • Ticket impact levels are defined for identified failure modes
  • Operational ticket assignments match the task/request in support of customer experience and metrics.
  • Auto-submitted tickets from the system include link to operational documentation
  • Auto-submitted tickets have de-duping mechanism, where appropriate

Escalation

  • What engineering escalation alias is paired with the service?
  • Escalation path to engineering queue and criteria defined
  • SLA for escalation defined/documented

Known Opportunities for Improvement

List any automation or standardization opportunities for this task/product, including links to the relevant action/tracking items.

Support Transition

Whiteboard sessions/”chalk talks” have been given to all operational support personnel prior to handoff. This can be held by the operational or engineering team, whichever makes the most sense. Support staff in all locations should have first-hand (and in-person if possible) knowledge of these sessions.

Measuring and alerting on your infrastructure

All successful companies want to be able to answer the question, “but how are we really doing?”. There are a ton of ways to define and measure the success of your infrastructure. I won’t pretend to know all about business metrics, but I can talk about IT ops. There are a lot of opinions on even this part of measuring performance, so YMMV.

I’m going to open (and close) this post by saying, staring at a dashboard all day long is NOT an effective or acceptable means of monitoring your infrastructure, nor is email-based monitoring, which amounts to the same thing. What a waste of someone’s time! If you care enough to measure and alarm on your infrastructure, then set up sms-based notifications for critical alerts at the very least. The first time your MTTR for a major outage is increased because your oncall was tarrying in the bathroom instead of hurrying back to his terminal, you’ll understand. Talk about an awkward post mortem conversation. Additionally, think of all of the time that your engineer could focus on fixing production problems or working on automation if she wasn’t staring at a screen, scared to death of missing a critical issue.

What I think matters most

FYI- I include load balancers in the ‘Network’ bullet below, although my personal opinion is that those are systems devices and should be admin’d by systems engineers. 🙂

Site latency. You want to know how the front end is being received by your customers. There are other factors that come into play, of course. Things like flash objects rendering on the client side aren’t controllable or measurable. But having latency data, from both internal and external sources, will give you a more accurate depiction of how your site is behaving. If all you do is measure from internal sources, what happens when the border becomes unavailable? What if there are major routing issues with a specific provider in another part of the world? Say, a cable break off the coast of China? With distributed external monitoring, you may be able to change routing to avoid that carrier, rather than just seeing ingress traffic drop & wondering what the heck happened.

Service errors/timeouts. The next step when figuring out root cause for site latency is to look at the logs on the front end webservers to see if there are performance or connectivity issues for/to the services that feed into page generation. (no, the next step isn’t always to blame the network!! get over it already!!) Applications are rarely standalone in any distributed environment. Depending on how deep and distributed your stack is, you may have to check 10’s or even 100’s of services individually without these logs. Just make sure that the important errors (fatals and timeouts, usually) are logged at the appropriate log level. Logging every little event, regardless of impact to customer experience or functionality, will only make the job of troubleshooting a site issue more difficult.

Service latency. If you don’t find any critical errors or timeouts in the front end logs, take a look at latency measurements between up- and downstream dependencies. Maybe services are just slow to respond. This could be because of packet loss in the network, system utilization issues, or maybe it’s something like pileups on a backend database. Understanding where the bottleneck is occurring is a very good thing.

Databases. Point #1: I’m not a database expert. That being said, database utilization, queue lengths, read/write operations and DB failover events are all metrics that really just can’t be overlooked. Granted, you’ll probably see any major DB issues surfaced in upstream dependencies, but you absolutely have to have visibility into such a critical layer of the stack. To me, DBs have always seemed temperamental and prone to ‘weird’ behaviour. The more insight you can get to narrow down potential issues, the better.

Network. It’s great that software can/should have exponential backoff or graceful failure in the event of connectivity with upstream dependencies, but that’s slightly reactive and will impact the customer experience if a critical app is affected. Measuring and alarming on network connectivity/latency will allow you to drill down into issues more quickly. Looking at inter- and intra-[datacentre,rack,region] metrics is a decent starting place. But to really do network monitoring justice, you also need lower-level metrics to help drill down into root cause. Monitoring drops, link bounces, packet loss, and downed devices/interfaces are a few of the key metrics. There are soooooo many other things to monitor in a complex network (running out of TCAM space? really?); I won’t even deign to try to enumerate all of them here. A solid network engineer can probably spout off 20 of them in the space of 30 seconds though.
Btw, it’s fairly simple when you’re talking about a network of maybe 100 boxes. But when you get over 50k of them behind innumerable devices with cross-datacentre dependencies, it’s a bit tougher to measure inter-switch issues from the service perspective, let alone alarm on them. (that’s a metric buttload of data!)

Basic server metrics. Servers and network devices do keel over, and applications can suck all the memory, CPU or disk I/O. Granted, the service should be built with hardware redundancy (servers, switches, network, power) and the application should be able to handle a failed machine with no impact once you’re in a truly distributed environment. But monitoring the machines obviously needs to happen- someone’s gotta know that a machine is dead so they can fix/replace it.

Page weight. This isn’t an obvious one, and there isn’t necessarily a direct correlation between page weight and latency. But if you don’t see timeouts, connectivity issues or server/device problems, performance degradation could be as simple as the site serving heavier pages. (sometimes you just have to bite the bullet & roll out a heavier page!)

Alerting

Get some historical data on your alarms before you turn on alerting so you know that you’ve set the proper thresholds. Last thing you want to do is turn on an alert and barrage a [most likely] already-overworked oncall. Exceptions should include alerts stemming from a real-life high-impact event. If a condition is integral enough to cause an outage or noticeable customer impact, then it needs to have a monitor and alarm. You can always adjust the threshold for these one-offs as you go along. And before you turn on alarms, take care of the next section too. Seriously!

Documentation

How do you know what to do with the alerts? This deserves its own section here because it’s so important. If you’re ‘together’ enough to measure and alert, then you should also be ready to make sure that the folks who are supporting the alarms are well-informed about what to do with them… before the alerts start paging them. Run books are a completely separate topic, but in short, they should include simple step-by-step directions with clear examples and/or screen shots, escalation information, and links to architecture diagrams. They should be created on the premise that anyone with a basic understanding of the system/service/network can take care of the issue at 3am without any handoff. Or, at worst, the body of the doc should contain simple decision-tree-type information such as “if you see XXX, you probably want to start looking at YYY”.

Measuring “Availability”

There are myriad ways that “uptime” (one of my least favourite words in the tech world) or Availability can be measured. But what does “availability” even mean? Before you measure it, you’ll need to work with both technical and business partners to define it. Every business has slightly different goals, business models, architectures, etc, so there can’t really be one known definition of how to measure whether it’s “up” or “down”. Think carefully about building critical processes or monitors around this measurement, though. For example, if you run an e-commerce site and measure availability by deviation from the number of orders received in a given time period, beware that some number of customers will be impacted before you can see & act on it. It’s fine to alarm on it, but alerting on latency or number of timeouts will enable much quicker MTTD and MTTR, since it allows you to dive at least one level farther down the stack (a straight symptom rather than a pointer to actual root cause). I’m not saying that you shouldn’t alarm on order rates in this case- just that you should always consider whether there’s a less reactive metric to key off of for critical ops coverage.

What about measuring the “Customer Experience”? Is that the same or different than “Availability”? Short story: it depends on how you define it. You could measure TTI (time to interact), which is most likely going to be different. I know a few websites that do this, and if you can define the cut-off reliably (do background processes count? what exact functionality needs to be available to allow a customer to ‘interact’?), then I’m all for TTI as one measurement. Percentage of fatals in core services could also be included here. Granted, that’s more of a real-time metric and one that’s already mentioned above. But it’s also good for a weekly roll-up of site health.

Measuring vs Reporting, Dashboards vs Decks

Not every single thing you can measure should be reported or alarmed on. I should have put this as a disclaimer right at the top of this doc, really. There is such a thing as ‘paralysis by analysis’. It’s fine if a single service owner wants to review 50 or 60 different metrics, but there should be no more than 3-5 true health metrics. If you have more than 5 metrics (and I’d group ‘aggregate log fatals’ into one here as an example) that are top priority to measure, then you should continue working on your monitoring configs or maybe revisit the way your service responds to failures.

Dashboards are [near-]real-time and decks are compiled from historical information (summaries). As an Ops manager, if someone comes to me with a deck from last week and tells me we have a site-impacting issue, then 1) we’ll run the event and make sure it gets handled and 2) the owner of the service will be conducting a monitoring audit of their service in fairly short order. I have no problem with a discussion that begins with, “I see latency’s been increasing over the past two weeks (without breaching the alert threshold)- any ideas what might be up?”. In fact, that’s exactly the type of discussion that a deck should incite. It’s just the times when someone might say, “Hey- it looks like we had a 50% FATAL rate on this core service last week. Any ideas?” that I get a little perturbed.

Again, staring at a dashboard all day long is NOT an effective or acceptable means of monitoring your infrastructure, nor is email-based monitoring, which amounts to the same thing. While dashboards will provide fairly up-to-date visibility into the health of a service, any critical metrics that are being rendered in the dashboard should also have a corresponding alert.

Oh yeah- and remember to monitor the monitoring system. 😉

Building a successful front-line Ops team

Okay, this has turned into a mini-novel, but I think these are all important points. There are a ton of others to consider, but I don’t want this to turn into War and Peace. 🙂 While the examples are ops-focused, I suppose most, if not all, of these points can apply to team building in general.

#1: Understand your team’s work load and type
Before you build a team, you really need to know what you’re trying to address by hiring a team. How can you build a balanced and ‘right-sized’ team if you don’t know what your work load looks like? Say you’re hired and your mandate is to hire four senior systems engineers to fill out a new front-line operations team. You have just 500 machines, less than 20 incoming requests per day, 60% of your work load is deploying software (running a series of scripts) and another 25% is filled mostly with rebooting machines. Doesn’t hiring 4 senior engineers seem like a bit much?

#2: Hire for the level of work you currently have
It can be really tempting to over-hire for the role(s) you’re trying to fill- especially if you’re given the green light to “go hire rock stars”. But what happens if you hire a rock star but don’t have the senior-level work to keep them challenged? So in the scenario above, what if you hire a couple of junior engineers who can cover the ‘grunt work’ but who have a lot of potential, and supplement that with one senior engineer who can code improvements in the deployment system and create some tooling around host reboots? Face it: if you hire a senior systems engineer to reboot machines all day with no time allowed to automate that work away, he/she won’t stick around for long. A typical onboarding process can take four weeks or longer to complete, and it usually takes a new hire three months (-ish) to feel moderately comfortable in their role. That’s a lot of time to invest in someone just to do it all over again, not to mention the disruption in team cohesion/morale and the group’s road map.

#3: Hire engineers who can help themselves but who still have runway to grow
This goes hand-in-hand with my points above. Don’t forget that if you’re hiring a front-line team, you’re most likely hiring more junior engineers who need to have a development plan. They need work that’s challenging, but not *too* challenging. It’s a delicate balance, but hiring someone who’s a novice at system administration to reboot machines and only reboot machines for their entire career isn’t doing anyone any favours. You’re hiring ‘go-getters’ though, so you may have to reign them in or point them in the right direction.

#4: Hire for heterogeneity
I’m a strong proponent of not building a team of people who all have the same skill set for a few reasons. First, you could wind up with gaping holes in knowledge (ie all systems engineers who have no networking experience). Secondly, each person by virtue of having a different skill set will have built-in opportunities to mentor the other people on the team. Thirdly, people who don’t have the skill set will have an opportunity to learn something new. Fourth, it’s good to have differing perspectives on issues so you can explore all options for solutions to issues. Fifth, you need to have room for people to grow and get promoted at varying rates. Lastly, it’s my personal opinion that a modicum of disparity in skill sets is healthy- it keeps people on their toes and promotes healthy competition if managed properly. (reward not only the technical achievements but the ‘softer’ skills like process improvements and customer service, and you should wind up with healthy peer pressure)

#5: Have a detailed training plan for your n00bs
It’s not just about technical training, although understanding the architecture and the technologies the company employs are both very important. New team members need to understand tools, processes, and the major players/teams involved in each of the large buckets of work/requests they’ll come across. Putting some thought into the plan up front will actually decrease the time it takes to get someone up to speed. Start with the fundamentals; explain typical requests and accompany that discussion with hands-on training for each task. Pair each new hire with a mentor who’s completed the task before. Knowledge transfers on specific architecture components by subject matter experts may also be helpful, but make sure that you time those sessions so that new hires aren’t too inundated with information that they can’t immediately apply.

#6: Measure your work load/effectiveness/progress
If you’ve taken care of point #1, then you should be able to measure your effectiveness. For an Ops team, this shouldn’t just cover basic ticket metrics like number resolved and time to resolve. It should include the type of issues coming into your queue (success is seeing a downward trend in tickets relative to your drivers, which could be number of machines you support, number of changes in the environment, etc), improvement in basic measurements like site/service latency or errors/timeouts, and customer satisfaction. That last metric is pretty difficult to measure without canvassing customers directly, but proxy metrics like number of tickets reopened for the same issue or total contacts per issue are a start. My favorite metrics are the ‘types of issues coming into the queue’, and I’ve had some success in prioritizing our team’s work load to drive down those interrupts that hog the greatest amount of time or effort. Having short- and mid-term goals for efficiency gains and reviewing progress against them in team meetings shows 1) that team members’ work is making a difference and 2) that the ‘crap’ work is going the way of the Do-Do, freeing up the team to work on increasingly challenging issues, which means that they get more growth opportunities. It’s a virtuous cycle. It does take some time to train up new hires, and if you’re starting from scratch with a new team, you may not see a great strides in metrics to begin with, so make sure to temper your goals according to where your team is at in its evolution.

#7: Sell the team to your partners and customers
If this is a new team, you’re probably going to have to play the role of promoter, even if everyone in the entire company is on board with your charter. Point #6 will certainly enable this, and as long as you’re making progress toward your goals and showing value to the rest of the organization, this should just feel like part and parcel of your job as the leader of your team.