No IT Ops process incites eye-rolling more quickly or more often than IT Change Management (CM), which is why this post has taken me almost 3 months to finish off. I’ve been shuddering at the thought of opening up that can o’ worms. But it’s good for the business and good for engineers, so it’s gotta be done. Attaching your name to changes made in the environment isn’t a bad thing. In fact, a CM process can save your bacon (mmmm… bacon…) if you’ve been diligent.
The basic purposes of a CM process are to
- enable engineers to manage change within the environment while minimizing impact
- provide notification and documentation for potentially-affected partners/customers, as well as peer/dependent groups who may need to troubleshoot behind you
- furnish checks and balances- an opportunity to have multiple sets of eyes looking at potentially-impacting changes to the environment
The amount of change in your architecture escalates as the environment becomes larger and more distributed. As the number of services in your stack increases, the dependencies between services may evolve into a convoluted morass, making it difficult to diagnose major issues when they occur. Tracking changes in a distributed environment is a critical factor to reducing MTTD/MTTR. Implementing a CM process and tool set is a great way to enable the necessary visibility into your architecture.
Btw, don’t ever delay resolving a legitimate business-critical issue because of CM approval. Just do it and ask for approval afterward.
I don’t like CRB’s
I really don’t have much to say about this. I just wanted to get it in writing up front. I don’t see a lot of value in “Change Review Boards”. It’s my opinion that every org within the company ought to own their own changes and the consequences of any failures stemming from them. Management was hired/promoted because they’re supposed to be able to run their own discrete units well. They ought to be trusted to do that without interference by someone who has no idea about the architecture or the technical facets of a change. Customer approvals (internal, mostly- external where appropriate) and devotion to constant and thorough communication can circumvent any perceived need for centralized oversight for changes. Avoiding a CRB element also allowed us to move much faster, which is something almost every company craves and appreciates.
Why you need CM
If you’re reading this post, then hopefully you already have a passing notion of why CM is a critical component of a truly successful operation. Just prior to rolling out our new CM process and tool set at Amazon, we were besieged with outages, with the top three root causes of botched software deployments, roll-outs of incorrect configuration, and plain ol’ human error. Our architecture was fairly sizeable at that point, and we needed better communication and coordination regarding changes to it. While I obviously can’t provide hard stats, I can say that over the first three years with the new process, the number of outage minutes attributed to fallout from planned changes in the architecture were reduced by more than 50% while the architecture continued to grow and become more complex. Our CM process contributed mightily to this reduction, along with updated tooling and visibility.
Here are just a few discrete points about why you need CM (I’m sure there are a ton more I haven’t included).
- Event Management. Okay, this is a very Ops-manager-focused point, I’ll admit. When you change stuff in the environment without us knowing about it, we tend to get a little testy, as do the business and your customers. MTTR lengthens substantially when you have to figure out what’s changed in order to help identify root cause. Controlling the number of changes in a dynamic environment can significantly reduce the number of “compound outages” you experience. These are some of the most difficult outages to diagnose, and therefore some of the longest-lived. (Deploying new code at the same time as a major network change? tsk tsk…. Which change actually triggered the event? Which event is a potential root cause?) You’ve probably been in a situation where a site or service is down and the triggering events and/or root causes are unknown. One of the first questions to ask during an event is, “what’s changed in the environment in the past X time period?”. Pinpointing that without a CM process and accompanying tool set can be nigh impossible, depending on the size/complexity of the environment and scope of your current monitoring and auditing mechanisms.
- Coordination/Control. Controlled roll-outs of projects and products addresses quite a few critical points. In a smaller company, this is even more important, as the resources to support a launch are typically minimal. In any sized company, too many potentially-impacting changes in the environment at the same time is a recipe for disaster (dependencies may change in multiple ways, killing one or more launches, etc). Reducing the amount of change in your environment during high-visibility launches will help the company maintain as much stability as possible while the CEO announces your newest-greatest-thing-EVAR to the press. A little bit of a control honestly isn’t a bad thing. I’ve never understood why the word ‘control’ has such a negative connotation. Must be why I’m an Ops manager.
- Compliance. I’ve learned a bit by rolling through SOX and SAS70 compliance exercises. You need a mechanism to audit every single change to critical pieces of your architecture. Working out all of the bugs in the process prior to being required to adhere to these types of audits is definitely preferable. Granted, you may enjoy some leeway in your first audit to get your house in order, but why waste time & create a fire by procrastinating?
These nine points can make a huge difference in ensuring the successful adoption of something as potentially invasive to a company’s work flow as this process might be.
- Automation. Anything to do with a new process that can be automated or scripted should be. This includes the tool set that enables the CM process itself. If you can create a template for common changes, do it. Take it a step further and build ‘1-click’ submission of change records. Move your change records through the approval hierarchy automatically so approvers don’t have to do so manually. There are myriad ways to streamline a process like CM to save engineer time and effort.
- Automation, part deux. This time I’m talking about carrying out the changes themselves. I know that there are varying schools of thought on this, and I’m by no means saying that automation cures all evils. But automation does reduce human error- especially when you’re completing tasks such as changing network routing or updating system configs across a fleet. The less chance for human error, the better. If you’ve watched Jon Jenkins’ talk on Amazon’s Operational efficiencies, you know that automation allows their developers to complete a deployment every 11.6 seconds to potentially thousands of machines with an outage rate of 0.0001%. Trust me- before we had Apollo, Ops spent a lot of time running conference calls for outages stemming from bad deployments.
- Tiered approvals work. Not every single change in the environment requires a change record, but every change must be evaluated to ensure the proper coverage. Critical changes in the platform or infrastructure which have the potential to impact the customer experience just ought to require more oversight. As a shareholder and a customer, I know I appreciated the fact that we had multiple levels of reviews (peer/technical, management, internal customer) to catch everything from technical mistakes in the plan to pure timing issues (making a large-scale change to the network in the middle of the day? dude.) There are also many changes which have zero impact and which shouldn’t require numerous sets of eyes on it prior to carrying it out. Completing the 99th instance of the same highly-automated change which hasn’t caused an event of any kind in the last X months? Foregoing approvals seems appropriate. See “The Matrix” below for more information.
- Err on the side of caution. This doesn’t necessarily require moving more slowly, but it’s a possibility. For changes that could potentially impact the customer experience, a slight delay may prevent a serious outage for your site/service. If you’re unsure whether your potentially-invasive change might conflict with another one that’s already scheduled, then delay it until the next ‘outage window’. Not 100% sure that the syntax on that command to make a wholesale change to your routing infrastructure is correct? Wait until you can get a second set of eyes on it. You’d much rather wait a day or two than cause an outage and subject yourself to outage concalls, post mortems, and ‘availability meetings’, guaranteed.
- Trust. Reinforce that implementing a CM process has nothing to do with whether or not your engineers are trusted by the company. You hired them because they’re smart and trustworthy. It’s all about making sure that you preserve the customer experience and that you’re aware of everything that’s changing in your environment. Most engineers are pretty over-subscribed. Mistakes happen, and it’s everyone’s job to guard against them if at all possible. The process will just help you do that.
- Hold “Dog & Pony Shows”. Our new CM process required many updates to most of our major service owner groups’ work flows. It wasn’t just about learning a new tool. We had new standards for managing a ‘tier 1’ service. When the time came to roll out the new process company-wide, we scheduled myriad training sessions across buildings and groups. We tracked attendance & ‘rewarded’ attendees with an icon to display on their intranet profile. This also provided us a way of knowing who was qualified to submit/perform potentially-impacting changes without having to look at roll-call sheets during an event. I always left room for Q&A, and re-built the presentation deck after each session to cover any questions that popped up. We received some fabulous feedback from engineers while initially defining the process, but the most valuable input we collected was after we were able to walk through the entire process and tool set in a room full of never-bashful developers.
- Awesome tools teams are awesome. Build a special relationship with the team who owns developing and maintaining the tool set that supports your process. A tools team that takes the time to understand the process and how it applies to the various and disparate teams who might use the tools makes all the difference. Quick turn-around times on feature requests, especially at the beginning of the roll-out, will allow you to continue the momentum you’ve created and will show that you’re 1) listening to feedback and 2) can and will act on the feedback.
- Be explicit. Be as explicit as possible when documenting the process. Don’t leave room for doubt – you don’t want engineers to waste time trying to interpret the rules when they ought to be concentrating on ensuring that the steps and timeline are accurate. When it doesn’t make sense to be dictatorial, provide guidelines and examples at the very least.
- Incremental roll-out. I always recommend an incremental roll-out for any new and potentially-invasive process. Doing so allows for concentration on a few key deliverables at any given time, easing users into the process gradually while using quick wins to gain their support, gathering feedback before, during and after the initial implementation, and measuring the efficacy of the program in a controlled fashion. Throwing a full process out into the wild to “see what sticks to the wall” isn’t efficient, nor does it instill user confidence in the process itself. In startup cultures, that might work for software development, but avoid asking engineers and managers to jump through untested process hoops while they’re expected to be agile.
I’m a firm believer in the flexibility of a stratified approach to CM. Not every single type of change needs a full review, 500 levels of approvals, etc. We as an organization (Amz Infrastructure) put a lot of thought into the levels and types of approvals required for each specific type of change- especially in the Networking space, where errors have the potential to cause widespread, long-lasting customer impact. We analyzed months of change records and high-impact tickets, and we took a good hard look at our tool set while coming up with a document that covered any exceptions to the “all network changes are tier-1, require three levels of approval and at least 5 business days’ notice” definition. Here’s a sanitized version of a matrixed approach:
We set up a very simple process for adding new changes to the “exception list”. Engineers just sent their manager a message (and cc’d me) with the type of change they were nominating, the level of scrutiny they recommended and a brief justification. It was usually 3-4 sentences long. Then there’d be a brief discussion between myself and the manager to make sure we were copacetic before adding it to the CM process document for their particular team. Last step was communicating that to the relevant team and clearing up any questions – typically in their weekly meeting. Voila!
We created guidelines and checklists for reviewers and approvers for the ‘soft’ aspects of change records that weren’t immediately apparent by simply reading the document. We trusted the people involved in the approval process to use their own solid judgement where appropriate, since no two situations or changes are the same. Here are a few of the more major guidelines that I remember; each organization/environment combination will require their own set, of course.
- Timing of submission. Our policy was to accept after-the-fact changes for sev1 tickets, and Emergent CMs for some sev2 tickets (see below, “Change Record”). Using inappropriately-defined sev2 tickets to circumvent the process was obviously grounds for rejection/rescheduling. The same applied to Emergent changes due to lack of proper project planning, which are rarely worthy of the emergent label.
- Level of engineer. Ensure that the person responsible for the technical (peer) review owns the correct expertise (product or architectural), and that the technician of the change is of the proper level for the breadth and risk involved. Assuming that a junior engineer can make large architectural changes and then have the necessary competencies to troubleshoot any major fallout most likely won’t set them – or your customers – up for success.
- Rejecting change records. We provided a few guidelines for gracefully rejecting a CM, including giving proper feedback. For example, rather than saying, “your business justification sucks”, you might say, “it’s unclear how this change provides benefit to the business”, or “what will happen if the change doesn’t happen?” (which were both questions included in our CM form).
- Outage windows. Unless your change system enforces pre-defined outage windows, you’ll need to review the duration of the change to ensure that it complies. If a change bumps up against the window, you might want to ask the technician about the likelihood that the activity will run long, and request that that information be both added to the change record and communicated to affected customers.
- Timeliness of approvals. This is more of a housekeeping tip, but still important. Engineers expend a lot of time and energy planning their changes, so the least the approvers can do is be timely with their reviews. Not only is it courteous, it helps the team hit the right notification period, the engineer doesn’t need to spend even more time coordinating with customers to reschedule, and the remainder of your change schedule doesn’t have to be pushed back to accommodate the delay.
This was the biggest pain in my arse for months, I have to say- about four hours every Sunday in preparation for our org’s Metrics meetings during the week. We had expended so much effort in defining the process, as well as educating customers and engineers, and our teams had made a huge mind shift regarding their day-to-day work flow. We absolutely had to be able to report back on how much improvement we were seeing from those efforts. We measured outages triggered by Change events, adherence to the process, and quality of change records. Most of our focus was on quality, as we knew that quality preparation would lead to fewer issues carrying out the actual changes.
Completing an audit for each of the seven teams in Infrastructure entailed reviewing the quality of information provided in 8 separate fields (see below, ‘The Change Record’) for every Infra change record submitted (typically around 100-125 records/week). Steps outside of just the quality of information provided included comparing against each team’s exception list to ensure the proper due diligence had occurred, comparing timestamps to audit the notice period, and examining whether the proper customers had been notified of and approved the change.
Sure wish I had an example of one of the graphs on CM that we added to the weekly Infrastructure metrics deck. They were my favourites. 🙂
Over the first 8 weeks of tracking, our teams increased their quality scores by more than 100% (some teams had negative scores when we began). Outage minutes attributed to networking decreased by approximately 30% within the first 6 months. We also had coverage and tracking for changes made by a couple of teams which had previously never submitted change records, including an automation team which owned tier-1 support tools.
Notification, aka “Avalanche of Email”
To be perfectly frank, we never really figured out how to completely combat this. We did build a calendar that was easy to read and readily-available. We also had an alternate mechanism for getting at that same information if a large-scale event occurred and the main GUI wasn’t reachable, which is typically when you need a CM calendar the most. Targeted notification lists did help. For example, each service might have a ‘$SERVICE-change-notify@’ list (or some variant) for receiving change records related to one particular service. Over-notification is a tough challenge- especially when there are thousands of changes submitted each day in the environment. If anyone has a good solution, I’d love to hear about it!
The Change Record
Yes, it took some time for an engineer to complete a change record perfectly- especially for ‘tier-1 services’, which necessitated more thorough information. Our first version of the form did include auto-completion for information specific to the submitter and technician. We also added questions into the free-text fields within the CM form to draw out the required information to prevent the back-and-forth between the submitter and approvers which might have resulted. ‘V2’ provided the ability to create templates based on specific fields, which saved our engineers quite a bit of time per record.
Here are some of the more important fields that ought to be added to a change form. They don’t comprise all of the input required- just the major points.
- Tiers/Levels. Most environments do have various ‘tiers’, or levels of importance to the health of the site/service the company is providing. For example, if you’re a commerce site, chances are your Payments platform is held to a 5-9’s type of availability figure. These services ought to be held to a very high standard when it comes to touching the environment. On the flip side, a service such as Recommendations may not be as important to the base customer experience and therefore might not need to be held to such tight requirements. Grab your stakeholders (including a good cross-section of end users of the process) to define these tiers up front.
- Start/End Time. This kind of goes without saying. It’s the field that should be polled when building an automated change calendar or when people are attempting to not trample on each others’ changes. Once the dust has settled, you can refine this to include fields for Scheduled Start/End and Actual Start/End Time. This will allow gathering more refined metrics about how long changes actually do take to complete, as well as how well teams adhere to their schedules. Setting the Actual Start time would move the change into a ‘Work in Progress’ state and send notification that the change had started. Setting the Actual End would move the record to the Resolved state.
- Business Impact. Since not everyone viewing a change was able to glean whether their service or site would be impacted, we provided engineers with drop-down selections for broad options such as ‘one or more customer-facing sites impacted’ or ‘only internal sites impacted’. We followed that with a free-text field with questions that would draw out more details about actual impact. The answers were based on “worst-case scenario” (see my point above about erring on the side of caution), but engineers typically added a phrase such as ‘highly unlikely’ where warranted to quell any unwarranted fears from customers, reviewers and approvers.
- Emergent/Non-Emergent. This was just a simple drop-down box. Any change record which hadn’t been fully approved 48 hours prior to the Scheduled Start time (when the record appeared on the CM schedule and the general populace was notified) was marked as Emergent, which garnered closer attention and review. This did not include after-the-fact change records submitted in support of high-severity issues. It was a simple way to audit and gather metrics, and it also offered customers and senior management a quick way to see high-priority, must-have changes.
- Timeline. This should be an explicit, step-by-step process, including exact commands, hostnames, and environments. Start the timeline at 00:00 to make it simpler. Scheduled start times can change multiple times depending on scheduling, and having to adjust this section every time is a pain. Timelines must always include a monitoring step before, during and after the change to ensure that the service isn’t behaving oddly prior to the change, that you haven’t caused an outage condition during the change (unless it’s expected) and that the environment has recovered after the work is complete. If you have a front-line Ops team who can help you monitor, that’s a bonus! Just don’t sign them up for the work without consulting them first.
- Rollback Plan. The rollback plan must also be an explicit, step-by-step process. Using “repeat the timeline in reverse” isn’t sufficient if someone else unfamiliar with your change is on-call and must roll it back at 4am two days after the change. Include exact commands in the plan and call out any gotchas in-line. And remember to add a post-change monitoring step.
- Approvals. We opted for four types of approvals to allow focus on the most important facets of the process. Over time, we utilized stratification to dial back the involvement required of our management team and the inherent delays that came along with that. Every level of approver had the ability to reject a change record, setting it to a Rejected state and assigning it back to the submitter of the change record for updates.
- Peer review. Our peer reviewers typically focused on the technical aspects of the change, which included ensuring that the timeline and roll-back plans covered all necessary steps in the proper order, and that pre- and post-change monitoring steps existed.
- Manager review. Managers typically audited all of the ‘administrative’ information such as proper customer approval, overlap with other critical changes already scheduled, and that the verbiage in the fields (especially business impact) were easily-understood by the wider, non-technical audience.
- VP review. High-risk, high-visibility changes were usually reviewed by the VP or an approved delegate. VPs typically concentrated on the potential for wider impact, such as interference with planned launches. They were the last step in the approval process and had final say on postponing critical changes for various reasons (amount of outage minutes accrued vs risk of change, not enough dialogue with customers/peers on major architectural changes, etc).
- Customer approval. We dealt with internal customers, typically software development teams, and we worked closely with each of our major customers to define the proper contacts for coordination/approval. Engineers were required to give customers at least 48 hours’ notice to raise questions or objections. In the case of some network changes, we touched most of the company. VP review and approval would cover the customer approval requirement, and we would use our Availability meeting to announce them & discuss with the general service owner community if time permitted.
None of these roles should be filled by the technician of the change itself. Conflict of interest. 😉
- Contact information. We required contact information, including page aliases, for the submitter, technician, and the resolver group responsible for supporting any fallout from the change. Standard engagement alias formatting applied. Information for all approvers were also captured in the form.