Holding 1:1s

I recently conducted a short management training session on holding 1:1 meetings, and I realized that I really should post that content somewhere. The practice is such an important one for the organization. I touched on this in my post about performance reviews a few months ago, but it definitely deserves its own spotlight.

Purpose

1:1 meetings allow both the employee and their direct manager to prepare for performance reviews all year long. It’s a way of guarding against having to spend an obscene amount of time trying to recall what an employee’s goals were at the beginning of the year, what she has accomplished, and what her hurdles were. There’s less risk of forgetting major milestones when they are consistently discussed and documented. There should be no surprises or last-minute rememberances when the actual formal review is delivered.

There are a few other benefits to holding these meetings, aside from the regular focus on the employee’s career progression:

  • it’s an opportunity to provide her timely feedback, both positive and constructive, outside of her documented goals
  • the meetings provide a regular forum to talk about personal and personnel issues, to the extent that he is comfortable talking about them
  • it’s time set aside to talk about the environment as it relates to the employee, the team, the org, the company, and even the industry
  • it’s a great opportunity to receive feedback on your own performance as a manager

Ensuring the efficacy of 1:1s is a shared responsibility between both participants. If the right level of care is taken in preparing for these discussions, the potential to strengthen your employees and your team can be immeasurable!

Preparation

Preparing for structured 1:1 meetings takes time, effort and attention- especially if it’s a new discipline for your team. It’s time well spent, however, and encouraging the proper habits up front makes the ongoing discussions more focused and productive.

The Initial Meeting

Schedule a long enough time slot for your first meeting to address all of the basics. This ensures that you are both clear on the ground rules and are beginning from the same point of view. An hour or hour-and-a-half should be sufficient. Just be sure to emphasize that subsequent 1:1s don’t necessarily need to take that long.

  • Use the first meeting to set goals, if the employee doesn’t already have them. Employees rarely have time to think about their own career, or they’ve never been taught how to think about it to begin with. One of your main responsibilities is the care & feeding of your directs. What does he want to do in the next one, three and five years? Often, employees don’t actually know where they want to be that far in the future. Start by probing what interests him, what he likes doing, who he admires. At the same time, explore what he likes least about certain roles and why (sometimes a person’s perception of a job doesn’t reflect reality). Then, build a plan around allowing him to explore a few roles within the company that might interest him.
  • Review the 1:1 document you will use to track your discussions together. Send this out prior to the meeting so the employee has time to digest it and prepare clarifying questions if necessary. If you have time, fill out a portion of it with them during the meeting to give them an idea of where to start.
  • Decide on the proper frequency for your meetings with the employee. His level, performance, how autonomous he is, and his length of time in the organization and on the team, and the demands on your own time are all factors that inform how frequently you should meet. If 1:1s are a new phenomenon for the team, I would start everyone on a weekly recurring meeting until both participants are comfortable with the routine. I would never pare a meeting schedule back farther than bi-weekly, however. Even if the meetings last ten or fifteen minutes, you both still need the opportunity to re-focus on his career to make sure everything is going according to plan.
  • Commit to your scheduled time. This is tough, especially if you have a more junior team who needs a weekly cadence or if you have a particularly large team. Interruptions to the calendar are always a risk; do a quick prioritization check before rescheduling a 1:1 to accommodate a late-binding request though. The more frequently you reschedule, the less apt the employee is to believe that you truly care about her performance.
  • Set clear expectations about the regular goals of the meeting, and make sure that your 1:1 doc format supports those goals.

Ongoing Prep Work

Most of us don’t have a lot of ‘free’ time nowadays (I can say that word- I’m old). But putting effort into making sure you’re prepared for each discussion will allow you as a manager to get the most out of the meeting and will reinforce the fact that you really do care about the employee. Devoting ten to fifteen minutes to gathering your thoughts prior to the meeting is definitely a worthwhile way to spend your time as a people manager.

  • Read previous 1:1 docs/notes to refresh your memory of what has been covered and what needs to be followed up on. If possible, carve out time in your schedule to do this at least 24 hours prior to your 1:1 to give yourself time to follow up on any actions that may still need attention.
  • Read the updated 1:1 doc from your employee (see below). Make notes where applicable.
  • Read the ‘fuzzy folder’. Most managers keep an email folder for each of their directs that contains communication about deliverables, performance, etc. I call it a ‘warm fuzzy’ folder because I like the thought of it only containing positive messages. 🙂 If there is anything new and noteworthy, add a note to your copy of the 1:1 doc.

The more frequently you prepare for the meeting, the less time this prep work will command. Once you’ve completed your own checklist, you’ll be ready to have a productive conversation.

The Doc

I’ve uploaded a copy of my own 1:1 doc template for reference. I haven’t had to update it much in the past ten years, although I did create a slightly different one for 1:1s with managers, and I’ve moved to using a private google form to make it easier. It isn’t pretty, but it allows us to cover all of the topics mentioned in the first section without wasting real estate.

I believe that the employee should make time to fill this out themselves and send it off to the manager ahead of the meeting. Owning the content forces them to think about their career outside of one recurring discussion, and it gives them the control over where and how to focus the meeting. The responsibility for making the most of the chat is shared between both participants, but that control is a pretty important factor in showing that the meeting is meant to help the employee, not the manager.

  • Progress Against Goals This is a fairly self-explanatory section. Define the frequency of reviewing this based on the length of the goal, level of employee, and the consistency of progress made against them.
  • Opportunities This section is broad by design, and could cover anything in the environment, the team, the org, the company, or the industry. Basically, it’s a chance for the employee to tell you where he believes more effort should be focused. I would rarely let someone get away with saying they have no new ideas for opportunities for improvement within the environment. It’s a challenge to them to keep making their job, team and/or company better.
  • Project Status I use this section more for managers than employees. I get enough project status reports to choke a horse, so unless there is a pressing escalation I need to know about, this is the area I spend the least amount of time on. (no animals were harmed in the writing of that sentence, btw)
  • What Have You Learned? If someone isn’t learning as a part of their job, it might be a sign that they aren’t challenged in their current role or need a fire lit underneath them. Even if their comment is, “I learned I need a day off”, it will give you insight into their mind set and open up a discussion if it’s not already covered elsewhere in the doc.
  • What Do You Need From Me? Get your team used to using you as a resource. Show that you care about making the most of their experience on your team. Ask for ideas on your areas for improvement, blind spots, etc. It’ll only make you better as a manager, and it may provide valuable insight into the dynamics of the team and the employee’s role in it.

You can always add a separate section into your own template to track previous action items or any number of more discrete subsets of the information below. Keep it focused on the employee’s career and the health of the team, rather than tactical project topics if possible.

The Meeting Itself

I only have a few recommendations about the meeting itself. Most of the work is in the preparation and follow-though.

  • Keep the mood and tone consistent with the team’s culture and your relationship with the employee. Introducing formality into an informal environment can hurt the process of gaining the trust necessary to have an open and honest conversation.
  • Create an environment where you can concentrate on the meeting. Close any applications aside from the one you use to take notes in, and set your mobile phone to vibrate (or just turn it off if that’s possible).
  • Keep focused on the agenda. If you run long, make sure to schedule a follow-up meeting to address any important points you haven’t gotten to within the next 24-48 hours if possible. (keep the momentum!)
  • Review your notes together at the end of the session. Canvass the employee for feedback to make sure you leave the meeting with the same understanding.
  • Send your notes to the employee as quickly as possible after the meeting.

That’s it! Short and sweet.

Having Tough Conversations

Constructive feedback helps an employee improve, and should never be called negative. It’s the same reason that performance reviews should cover areas for improvement, rather than weaknesses. This is more than just semantics. It sets the tone for every subsequent conversation you have with the employee. This topic is obviously worthy of many books, but here are some short points I’ve picked up over the years that have helped with having tougher or more critical discussions.

  • Be honest. No one likes to be lied to, and she should appreciate the fact that you’re delivering the message in a straightforward manner like an adult.
  • Be objective. Separate the person from the conversation, and be prepared with concrete examples. Make sure to include the consequences of their actions. (e.g. “failure to deliver your portion of the release caused three of your team mates to work overtime”) Keep the feedback anonymous if possible, though. If you need to bring someone else’s name into it, either represent it as your own feedback (if you’ve noticed it yourself) or consider inviting the other person in for some facilitated discussion. If you’re not trained or experienced, bring in your HR representative to help.
  • Recognize perception issues You should have enough information about your employee’s performance to ascertain whether feedback can be attributed to her performance or someone’s perception of her performance.
  • Deliver the message clearly. Avoid the old “sandwich” message, such as, “You’re doing great! You could work on the timeliness of your code submissions, but overall your performance is solid.” Does he really need to work on timeliness? If so, then be clear about it.
  • Take ownership for delivering the message. Don’t play the “good cop” role and blame someone else for the constructive feedback. If legitimate feedback is given to you by someone else, delivering it is in the best interests of your employee. If you disagree, it may be clear that there is a perception issue that you still need to address. Either way, you are the authority for your team, and you should own the message. Just don’t be a jerk about it.
  • Don’t pressure the employee for a response, especially when broaching a new topic. But don’t let her go away and fester for days or even weeks either. The end goal is to coach her, and you need to give her enough time to participate in a constructive conversation.
  • Give the employee ownership in the actions. Increase his investment in the solution by allowing him to help define the action plan, where relevant.

Following Up

It’s not enough to just bring up a performance issue and then assume that the employee will take care of it.

  • Follow up regularly. If you provide constructive feedback to the employee, it is your responsibility to maintain focus and make sure she is making progress toward addressing the issue. Following up shows you are invested in her improvement and that you expect her to hold up her end of the bargain.
  • Solicit ongoing feedback from any affected parties to make sure the plan is working. Provide timely updates to the employee so you can work on correcting his course quickly if possible.
  • Call out positive progress as reinforcement. I don’t know too many people who thrive on a consistently negative message. If she’s doing well, then make sure she hears that. She’ll be more receptive to feedback the next time around if you’ve shown that the process is meant to make her more effective and happy in her role.
  • Recognize when to bring in HR. Use your HR partner liberally, if only for a sounding board prior to having the conversation. If you are uncomfortable at all with driving the discussion on your own, ask him to give you guidance or attend the meeting along with you. A manager should always be present during performance-related conversations, but it doesn’t hurt to have someone more experienced there to help guide the meeting. If you’re unsure about whether HR ought to be involved, just ask. They should be willing and able to guide you through it.

This process does take time, but it’s so much better to have a tough discussion at the first sign of an issue than to wait until it’s out of control or affecting the team or your customers.

Advertisements

Headcount Planning and Management

Headcount planning and management across a team or an organization is a complex process. In a changing environment, current and forecasted headcount should be revisited regularly in light of many factors, including changes in strategy, level and type of work load, the changing work force landscape, and goals & growth plans for individual team members.

This post might be short, but the process itself requires a good deal of time and considered thought. Here are the main factors or questions I tend to ask myself and the team when reviewing current and future states. Additional ones will always pop up during the evaluation exercises, of course. And keep in mind that the process doesn’t just apply to increasing headcount. Like it or not, contraction is also sometimes a necessity if you’re running your organization like a business that wants to succeed. I’ll focus on growing the org here though. It’s a happier thought.

Understand your current work force

In order to evaluate your headcount plans and make any decisions on whether to potentially change course, you need to understand your current state.

  • Do you have a plethora of specialists? Mostly generalists? A combination of both?
  • Is your staff a mix of experience levels? Or are you heavy on seniors or juniors?
  • What are the goals and career path for each member of the staff?
  • Are any key people planning on moving on within the next 12 months?
  • Where are your human single points of failure?
  • If you have a global team, are you balanced across sites appropriately? Does it matter if teams are imbalanced?

This exercise may take considerable effort the first time you complete it. Every time you revisit it, it becomes easier, assuming extensive time hasn’t passed and the organization hasn’t changed drastically. Discuss it with your peers and customers to get their perspectives, then document it so you don’t have to reinvent the wheel down the road.

Analyze your current work load

The second piece of the pie (mmmm… pie…) is to analyze your work load. I’m not talking about knowing what your work load looks like, but actually understanding it. Throw your metrics from projects, tickets and interrupts into a spreadsheet, then use a handy-dandy pivot table to slice and dice the data if that works for you. What’s the profile of the work for your team?

  • If you lead a technical team, is the majority of the work technical? Or are ‘soft’ responsibilities such as planning and project management that are usurping time from actual engineering?
  • Are there any staggering imbalances? For example, do you have 20 senior engineers on your team, but 20 junior engineers’ worth of work?
  • Does a large percentage of your work load fall onto the shoulders of a small portion of your team?
  • Is there “crap work” that should be automated away or just removed? Do you have the right skill set to address those efforts?
  • Does all of the work your team is doing belong in your team? Or should it be distributed to other teams in the organization/company?

Really, you just need to answer the simple question, does your current staff actually serve your current organization well? Depending on the number of surprises you encounter during these first two exercises, you may need to refactor your work load or quickly re-organize the current staff to alleviate immediate concerns. Just make sure that you’re not making changes for the sake of change, and that the moves you make will serve the needs of the organization for a while to come.

Understand your forecasted work load

Now it’s time to force yourself to think longer-term. Where should your team or organization be in 12, 24 or 36 months in terms of headcount and skill set? This requires visibility and foresight into the company’s and industry’s direction. If you don’t have a well-defined 3-year plan, it’s okay. Make solid estimates through the length of the well-codified information you have, then extrapolate from there, keeping an eye on potential changes or demands in the industry regularly.

  • Do you have the requisite skill sets and the proper number of people to cover the major initiatives on the horizon?
  • Do you have redundancy and contingency in technical and leadership skill sets?
  • How many of your current employees are likely to remain in their current roles – or even within your organization – in the next 1-3 years?
  • Will efficiencies or other environmental factors play a role in the required skill level in the future?
  • Are there new developments in the industry which could change the mix of skill sets in the future?

So now you know what resources you might need in order to address your forecasted work load. Now how are you going to facilitate the potential growth and the ongoing management of it?

Understand the Recruiting Landscape

Say you’ve now figured out that you need to double your work force over the next 12-18 months to hit your mid-term targets. How do you know if that headcount ramp is realistic?

  • What does the recruiting landscape look like? For example, much has been said lately about the ongoing dearth of system administrators. How will that impact your ability to cover the forecasted needs for the team?
  • Do you have the resources to recruit and interview enough candidates to hit your targets?
  • Can you create your own recruiting flywheel? Does it make sense in light of your forecasted work load? If you have the opportunity to hire junior engineers and/or managers and grow them within your organization, it’s a viable option for various reasons: it’s an opportunity for mentoring for your managers and senior engineers, shows the org believes in career progression, gives you a chance to positively influence a new manager prior to them moving into a position of greater authority or influence.
  • Does it make more sense to augment your staff more quickly with contractors or interns for specific roles? If utilizing interns is a viable alternative to bolstering your work force, then great! Give them challenging and meaningful work so that they want to come back and work for you afterward. It’s a great way to kickstart that recruiting flywheel.

There is substantial investment in just getting candidates in the door and through the interviewing process. The same can be said with managing interns or contractors on an ongoing basis. If your org can’t handle that additional work load along with the responsibilities already on your plate, then it might be time to re-think your forecasts.

It’s not just about hiring engineers

Planning for growth isn’t just about how many engineers you need and how many you can realistically hire. There has to be a framework to support the people you bring in.

  • What changes must be made to your leadership staff to accommodate the growth? Do you need more managers? Does your current management staff own the proper knowledge or experience to lead a larger organization, which inherently has different challenges?
  • Do you have sufficient manpower to train your new hires while still making progress against the road map?
  • Do you need to consider creating another level of leadership to accommodate expansion? Perhaps you don’t need more managers, but rather team or technical leads to act as a buffer for more of the day-to-day activities. If there are changes or additions to be made, make sure those are fed back into your headcount estimates. Keep in mind that if you promote a more senior engineer into a lead role, you will lose some or most of their productivity, and they will most likely be replaced by someone either more junior or less familiar with the environment.
  • Will the infrastructure, facilities and HR processes & tools support a significant increase in headcount? If not, who owns making sure those grow along with the organization?
  • If there is an increase in projects along with headcount, do you have sufficient capability and staffing to accommodate the management of those projects?

Once you’ve answered these questions, ask yourself again, is your headcount ramp realistic? If not, what are you going to do about it? It might be time to gather your information into a coherent argument based on the numbers and deliver the tough message to senior management that the forecasted work load just isn’t reasonable. Otherwise, happy hunting!

“Rock Stars”

While this isn’t a post about hiring specifically, I’ll add a word about hiring exceptional people. “Rock stars” don’t just exist at the most senior levels. Juniors who come in with a great fire & attitude are at least as valuable, since they may have a longer life span within the organization, have a fresh perspective, and may be a bit more malleable. Regardless of their skill level, every person in the organization needs to be continually challenged and given growth opportunities. If you can’t cover that, then don’t hire them. There’s too much investment in hiring, training and coaching to make that mistake.

Introducing a New Project (to Ops)

Everyone owns slightly (or drastically) differing opinions about how an IT project ought to be managed. I don’t believe in only one correct way to run a project. The best project teams I work with or observe adapt in various ways: project by project, team by team, customer by customer. Regardless of the method, I expect everyone involved in a project to share some core principles:

  • Adhere to Defined Roles
  • Adhere to Defined Processes
  • Communicate

I’ll focus on the introduction of a hypothetical project of some size and breadth to illustrate how these principles should be applied to make the most of the project team at the outset. The first few days or weeks are the most critical for setting the tone and laying the foundation for the remainder of the assignment. For purposes of this scenario, the production operations organization as a whole is involved in 20+ disparate projects this year, and every one of them is “high priority”. At best, we have two to three engineers focused on specialized areas, from networking to storage to infrastructure core services (LDAP, DNS, etc.).

The introduction of a project is vitally important to its success, but it can be marginal at best without a formal process. Assuming there is no well-defined process, I can learn about new demands on my team in myriad ways, including

  • hallway conversations, either directly or overheard
  • random mention in an email
  • second-hand through a member of any number of non-Ops teams
  • request for resources to help with a build-out within a ticket to an engineering queue
  • random entry in a project plan
  • buried in a response to a separate, unrelated query

None of these can or should be considered a “process”.

Introduction of the Project

Here is how this particular project makes its way through the Ops organization. At this point in the project, a commitment for delivery in two months’ time (40 working days) has already been made to the customer, and the PM has already furnished an update to the customer and senior management that the project is on track. No project plan or list of requirements has been furnished to our org.

During this same time frame, our team is on the hook for delivering two other high-priority projects. Our resource plans show that we’re already 25 man days in the hole without considering this new request, and various engineers have been asked to put in extra time to get us over the hump. We’ve also learned that the project was first introduced to the software development organization four weeks ago.

  • Initial contact with the team comes from three sources: an URGENT email to me from a PM asking for resources to build out a test infrastructure (this indicates it’s already an escalation), a second request from the same PM directly to an ops ticket queue asking for work to be done on the project, and a hallway conversation between one of the Ops managers and a developer on the project with a heads-up that development work is already underway.
  • Early estimates provided to Ops come from software developers who don’t have the time or background to understand our infrastructure or support services well enough to estimate what needs to be delivered.
  • In the next Ops Management meeting, we assign a senior engineering resource to participate in design conversations with the lead developers, and a resource to build out the test architecture once we’ve vetted the design. (the request to build out a test platform makes no sense when we don’t even understand whether the design itself will work in our infrastructure).
  • The scheduling of meetings between the engineers and the facilitation of the discussions falls on our management team. One of our managers will now slip one or more of his own priority deliverables in order to take on this portion of the project.
  • After our engineers complete their design review discussions, it is apparent that the initial resource estimates based on the information we’ve received from the developers are woefully inadequate. The project requires 3 full man weeks of Network Engineering time for design and implementation, a storage engineer for a full month, four weeks of front-line Ops work for deployment of infrastructure, software and testing, and a week of a capacity planning engineer’s time.
  • The project also requires additional hardware which was not factored into our recent server purchase, so we must determine whether we can re-coup machines from other projects or place an expedited buy with one of our vendors. If we determine that we need to “steal” hardware from another project, we add another two weeks of effort from our front-line Ops team.
  • In the middle of trying to work out whether we have enough resources to hit the already-promised deadline, I receive an escalation from the PM for resources to build out the test infrastructure, and am pinged by my boss on why we don’t already have a solid Ops project plan for the initiative. It’s kind of getting ridiculous at this point.

This project is clearly not on track, and due to lack of proactive communication and solid planning, Ops is the blocker right out of the gate. (and you wonder why Ops engineers tend to be surly!) The exercise above has taken five days to complete, which means we’re now down to 35 man days to meet the deadline committed to on our behalf. There is zero chance that we can deliver on this project without missing deliverables on at least one of our other high-priority projects.

Lessons Learned

Based on the core principles listed in the introduction, we have a number of items of feedback for various parties involved in the project.

Adhere to Defined Roles

  1. The roles and responsibilities for everyone involved in the project weren’t clearly defined, including determining the correct points of contact and decision makers within Ops. This led to the PM contacting both me and an engineer for the same work, as well as a developer pinging another Ops manager.
  2. Because there was no single owner, we risked either duplication of effort or no effort, both of which are inauspicious ways to kick off a project.
  3. Ops’ absence in determining the delivery dates for the project meant that the customer received a promised completion date that the project team could not honour.

Adhere to Defined Processes

  1. Multiple processes were used for introduction of the project into Ops.
  2. Because there was no agreement on how to introduce the project into our organization, I received an escalation contact prior to an initial request for resources.
  3. No regular update was given over the first four weeks of the project; Ops might have been able to scramble well enough to hit the deadline had we known about the project earlier.

Communicate

  1. The lack of a clear project plan or requirements list forced us to spend time gleaning that information from various sources prior to making progress on planning or execution.
  2. Incorrect communication to the customer about our ability to meet the deadline put us under the gun to deliver from the outset.

Preferred Introduction Process

Here is how I would expect a project of any weight to come across our organization’s radar. I am not saying that adhering this this particular process will remove resource “crunches”, late-binding requests, or competing high-priority projects. It will, however, allow for control over how projects are prioritized and how resources are distributed across them.

  1. During initial conversations regarding a new project, the PM and lead developer answer a base set of operational questions provided by the Ops team. The list of questions will vary depending on the request type.
  2. Answers to the questions are submitted to the Ops management team to be discussed internally. That discussion yields basic resource estimates and the assignment of an operational engineer to participate in design reviews.
  3. Shortly afterward, a project review meeting is held with all stakeholders to discuss requirements, resourcing, timing, and roles & responsibilities within the project team. The ops organization should send the engineer(s) slated to work on the project, at least one manager who can represent the org, and the PM responsible for the operational work.
  4. A full project plan should be published following the meeting, with clear requirements, deliverables and milestones defined. Project plans and meeting notes should be posted to an easily-accessible file share.
  5. The PM should review the plan with the customer quickly, and any requested amendments should be re-visited with the project team immediately.
  6. After the review of customer feedback, the project plan is locked down, and any new requests are considered “scope creep” and added to the project’s backlog in priority order. Each high-priority request should be discussed within the project team and should include stakeholders in order to determine whether it can be accommodated prior to committing to it.
  7. Regular succinct project updates should be furnished to all stakeholders to avoid last-minute escalations when a milestone is in danger of slipping.

Initial Project Meetings

The first meeting in any project should be a simple high-level walk-through of the project with whomever can define the proper resources for the initiative. This should happen as soon as the project has been approved and really shouldn’t last longer than 20 or 30 minutes. All we care about is getting the correct people in the room for the full scoping exercise and understanding the priority of the project compared to the other initiatives on the slate.

The second meeting should be of longer duration and include all of the resources necessary for delivery. I would much rather schedule a half day for this one than to head into a large project without all of the critical information. The agenda should cover defining the roles and responsibilities of each team member and ensuring that every participant understands the requirements well enough to scope their deliverables and provide reasonable time estimates for milestones. Engineers expect very clearly-defined roles and responsibilities up front, and I can’t blame them. Worrying about who they have to talk to about specific portions of a project, who makes what decisions, what bits of information are important and which aren’t, or who owns specific pieces of the technology direction or implementation just reduces their ability to do their jobs properly. Some questions to consider for that second meeting:

  • Who is the primary customer, and have they provided sufficient requirements?
  • What are the requirements and end goal of the project?
  • What is the priority of the project compared to other current initiatives? (while this should be addressed in the initial meetings, the engineers should also understand the relative priorities)
  • What is the communication channel with the customer?
  • Who is the subject matter expert for each major piece of technology?
  • Who makes decisions on the technical direction of the project? What happens if there is a stalemate on making a decision?

And the ‘soft’ stuff:

  • What is the escalation process? What is or isn’t worthy of escalation? Where does the buck stop?
  • What is the project manager’s role in the project?
  • What is the expectation around scoping dates for deliverables and milestones? Is contingency time added into each deliverable? Does the project itself have padding to allow for slips?
  • How often are the engineers expected to give an update on specific deliverables versus the overall progress toward the next milestone?
  • What do various stakeholders want to know about on a regular basis, and how would they like to see that information?

Answers to these questions should all be documented in an easily-accessible and visible place in the project’s doc repository from the beginning. Quite a bit of the information can be defined on an org- or company-wide level and shared across projects. I would expect a link to the document to be included in the header of the project plan and of every project meeting agenda. Any new member of the project should review the information with the PM prior to any participation in the project. Every manager who has an engineer involved in the project should also be familiar with the info, particularly anything to do with resourcing.

Regular project meetings should be scheduled (scrum/agile, waterfall, whatever). Every person with an outstanding deliverable, anyone expected to provide technical support and any “decision maker” for outstanding questions should attend or provide an update prior to the meeting if they’re unavailable. I’m a fan of stakeholders staying away from these meetings unless a prioritization discussion needs to happen. Too many pointy-haired bosses detracts from actual engineering progress.

the never-ending battle

Supporting Thoughts

Aside from the introduction of the project, there are many other facets of project management that are important for ensuring smooth delivery. Communication, defining roles and processes, and reporting at the right level are all integral to a project’s ongoing success.

Communication Is Most Important

The most important responsibility of a PM is to not only facilitate communication, but to have clear, crisp communication herself. This is important for the other participants as well, but engineers are typically paid to focus on technical delivery first and foremost.

Don’t be afraid of people. A PM shouldn’t shy away from human interaction (face-to-face, video conference, phone) and at least 50% of her day should be filled with talking to people (I pulled that number out of thin air, but it seems pretty reasonable). Like it or not, you can’t get the entire context of why a deliverable is late or why an engineer is spending “so much time” on a particular deliverable from a tracking ticket.

Minimize interruptions. I would also expect a PM to also strike a balance and be reasonable with the interruptions. If a deliverable is due in two weeks and yesterday’s update shows that it’s on track, I don’t see any reason to interrupt an engineer for yet-another-update. Everyone’s time is at a premium every day. Use discretion, common sense, and the guidelines that were set up in the initial meetings to determine whether that daily update is really necessary.

Be succinct. I don’t know too many people who enjoy reading a 3-page project update or listening to a 5-minute monologue, only to learn that the project is on track and there’s nothing to worry about. Verbal diarrhea during project review meetings is pretty terrible to sit through.

Call out important points explicitly in email. Use bold fonts, red colour or something similar to call attention to new actions items or issues. I’m a fan of the ol’ TL;DR format. A bulleted list containing what I honestly need to know about at that top of an email makes the team more efficient and appreciative.

Project Roles and Responsibilities

I believe that a Project Manager who is expected to lead an Ops project should have basic understanding of operational concepts. There are varied opinions on this, and some will say that if a PM asks the right questions, it doesn’t matter whether they understand systems or networking. I say that’s hogwash (because I’m old), and that if an Infrastructure PM doesn’t know enough about managing a system with cfengine/puppet/chef/etc, then they can’t guide the conversations, ask the right questions or help work out dependencies. At that point, they’re a Project Coordinator (big difference), and I shouldn’t be expecting them to drive a project on my team’s behalf.

PC/PM/PPM

I believe there’s a difference between a project coordinator, a project manager, and a program manager. Too often, these roles get confused or just plain aren’t defined, which leads to incorrect expectations from those involved in the project. Each organization needs to set their own job descriptions, but here are mine:

Project Coordinator: Project Coordinators are purely tactical. They get told what to do & how to do it, then it just magically gets handled. These are the people I prefer to work with. Part of it is a trust thing. I don’t dole out the responsibility for managing my teams’ resources easily, mostly due to the points I make later in this post. Part of it is that I’ve historically had strong technical leadership on my teams who can manage scoping and resourcing fairly well on their own.

[Technical] Project Manager: A really good project manager understands the resources and deliverables well enough to make suggestions about how to scope a project. They should also have enough knowledge to formulate considered, relevant questions to spur the right discussions. How does this project for X service impact its interaction with Y downstream service? I would not expect a TPM to know the answer, but I would expect a good one to be able to ask the question and grok the answer. I would also expect a solid TPM to call BS on time estimates that are just obviously way off base. The build-out of a new data centre is scoped at 3 days & you have no tools/automation to facilitate that? Dude.

[Technical] Program Manager: I confess to not having worked with too many honest-to-gosh program managers in my career, so I’ll just say that I would expect them to have a larger, more high-level view of really gnarly initiatives spanning multiple organizations, and to manage other project managers or coordinators who would take care of the lower-level responsibilities such as task tracking, updating project plans and keeping track of resources.

Other Project Team Roles

Engineer: Engineers should be focused on technical delivery as much as possible. There are a few other things that need to happen to facilitate that delivery.

  • Be realistic and take your time when asked for estimates on deliverables and milestones. Then pad those estimates, based on the project guidelines.
  • Know your audience. Keep updates less technical when necessary.
  • Learn about escalation – when, why, how. Escalation isn’t always a “bad thing”. Sometimes asking for assistance or feedback is the best thing you can do for yourself, the team and the project. If you’re already overloaded & don’t tell anyone about it, then we’ll continue to assume that you have the time to hit your deadlines or take on new work.
  • Being proactive on relaying progress, blockers and risks will enable the PM to remove those blockers and escalate issues before they become hair-on-fire scenarios
  • Provide feedback into the process constructively. Yes, Ops Engineers have that well-understood culture of…. curmudgeon-itis. Let it go, and be productive in your discussions with PMs. They honestly are trying to make your life easier.
  • Be patient: not every PM or customer has perfect technical understanding. If they did, they’d be your engineering team mate.

Manager: First point of escalation. In Operations, managers are typically better-poised to assign and manage resources across multiple projects. Project managers aren’t usually deep enough into the scope of our work load across all projects or the specialized skill sets within the team to address this themselves. We do include the PM in the conversations so that he understands what’s involved in the process.

  • Listen to the PM when he escalates to you. If something doesn’t make sense, then ask the relevant questions to get to the heart of the issue & then use the interaction to refine the escalation processes.
  • Hold engineers accountable for hitting their deliverables.
  • Listen to your engineer when she tells you that she’s overloaded, that a deliverable or milestone is in danger of slipping, or that the team flat-out missed something in scoping exercises. Not listening just sets the team up for either heroic efforts or missed deadlines.

“Decision Maker”: When there are conflicts regarding the direction or prioritization of a project, the buck stops with this person. For some projects, there may be multiple decision makers for technical issues or cross-project prioritization. You are responsible for ensuring you have all of the relevant information to make a considered decision. Make sure you are updated on the progress of the project and any potential conflicts. If you don’t have enough information to make a specific decision, then be vocal and ask as many questions as necessary to make the call quickly.

Define the Escalation Process

This could have been included in the section above, but to me, escalation is such a critical component of managing a major project that it deserves to be called out separately. While you’re defining the key decision makers in the initial meetings referenced above, you might as well define the escalation process, including details about what’s worthy of escalation and what isn’t, the method for escalating an issue, where the buck stops, and how to communicate that an issue has been escalated. For example, depending on your culture, it might not be acceptable to notify the entire project team if one person’s deliverable slips. A portion of this information could (should?) be defined at an org or even company level. Then all you have to do is play fill-in-the-blank with the owners for your specific project.

Defining this up front encourages people to communicate in the most sensible and productive way from the beginning, teaches participants to have open, frank conversations to try to work out issues on their own (no one likes being escalated on), and it saves the ‘punch’ of escalating for things that are legitimately critical. As a manager, there are few things worse than realizing that you’ve missed a critical escalation because it’s gotten lost in the noise of a million other unimportant pings.

Reporting

Stakeholder Updates

Each audience involved in a project requires different types and amounts of communication. I honestly didn’t understand this until I became a manager of managers and had multiple teams and projects to keep track of. As an engineer, I wanted to drink from the proverbial fire hose and have access to every tidbit of information possible. Nowadays, I manage four teams spread across approximately 15 projects at any given time. On average, I have somewhere between 10 and 15 minutes to devote to any one project per day.

Project Team Communication

The most valuable information project team members receive are updated project plans, progress against blockers they’ve raised, any changes to project scope, and changes to project timelines. I’m a fan of the daily scrum, where the team can all get on the same page in 15 minutes or less. It’s efficient when customized to your particular organization. Notes should be sent out after each meeting; simple bullet points should be sufficient.

Executive Summaries

There’s a difference between Executive Summaries and communication meant for a technical project team. If I’m asked to read and sift through a 3-page treatise in order to find the one embedded action item meant for myself or my team, it’s useless and a waste of my time. To boil it down, this is what I would expect:

  1. Is the project red/yellow/green based on the current milestone? If I see green, I need to be able to trust that things are on track and I can safely file it away.
  2. If the project is in jeopardy, what is the bulleted list of the blockers, who owns them, what the ECD (estimated completion date) is and what I actually need to worry about. I don’t want to see more than two sentences for any one bullet point. Do I need to find another resource? Do I need to step in and help guide things along?
  3. Did we complete a major milestone on time? Did one of my team members go above and beyond to improve the design, help the project team, etc?

General Status Reports

Include positives like hitting milestones. Make the communication clear enough and at the right technical level for the least-technical stakeholder who honestly needs to understand what’s going on with the project. C-level people and managers in organizations not involved in the project might be in that group (it all depends on your own org though, of course).

Basic Ops Questionnaire

It’s always a good idea to make it as simple as possible for PMs or developers to provide the right ops-related information for an internally- or externally-driven project. If you don’t have a formal NPI (new project initiative) process where these questions are codified, a simple document should suffice. We’ve begun using a google form open to anyone within the company for these types of requests. Some of the questions we expect answers for include

  • What is the request? X customer would like us to host a few customer service tools in our data centre. We would like to extend Y service to serve global requests, rather than serving requests from just Z service.
  • Who are the main technical and project contacts?
  • Who will be administering the service? If said service is ‘hosted’ rather than an internal one.
  • Will you need authentication for your administrators?
  • Does any data need to be stored locally on the machines? If so, what types?
  • Is there an admin UI or set of management scripts to administer the app? Or just backend processing?
  • Is this a mission critical service? Do you require notification during scheduled/unscheduled events?
  • How will the service be monitored?
  • Do you need server/network redundancy? Honestly, I can’t imagine ever putting something into production that isn’t redundant, but some people just don’t want to pay for it!
  • Are there any specialized hardware requirements for computation, storage, etc?
  • Do you have capacity numbers through the next 12 months? If not, when do you expect to have them?
  • Is integration with other services internal or external to our infrastructure required?
  • What security considerations or concerns do you have, if any?
  • How will you handle code deployments?
  • How will you handle packaging?
  • When do you need the environment(s) in place?
  • Are you prepared to furnish operational run books?

This is by no means an exhaustive list, but it helps drive the proper conversations from the beginning, and it also helps Ops management determine which resources to devote to a particular project.

Happy Dance! It’s Performance Review Time!!

I love performance review time. People assume I’m crazy when I say that. Oftentimes, as an Individual Contributor (IC), it’s the one time each year when there’s a focus on what you’ve accomplished over the past year. As a manager, it’s incredibly satisfying to look back and realize just how much your team has delivered.

I completely understand that reviews can be stressful. Typically, I get that knot in my stomach when I don’t exactly know where I stand in terms of my performance. Some people freak out because they aren’t being objective about a flaw and don’t see it as an opportunity for improvement. The combination of those can lead to people assuming they won’t receive a raise, they’ll be demoted, maybe even lose their job. It’s a dangerous and unproductive way to approach something so potentially charged as a performance review.

There are many different ways to lighten the emotional and intellectual load of reviews. Just search for how to write a performance appraisal online and you’ll get so much information back that it’s overwhelming. I stick to a few simple things that don’t require much effort but allow me to really enjoy this time of year.

Prepare All Year

Ongoing preparation makes filling out a performance review so much simpler and less stressful. Otherwise, it’s like cramming for an exam in school, except the stakes are your career progression.

  • Keep a record of what you’ve accomplished. I use a ‘fuzzy folder’ in email, both for myself and for my directs, containing highlights (delivering a project or taking a leadership role in some way) as well as constructive feedback and instances where I or my directs have had… challenges. It gives me a balanced retrospective of performance and it makes it much easier to identify and write feedback on significant events from the past year.
  • Insist on goals. Ideally, IC goals will be built around the goals and values of the company and the organization. But even if those aren’t available or well-defined, it’s possible to create meaningful S.M.A.R.T. goals around making your environment better, saving the company money, growing your skill sets or many other categories. As a manager, identify these categories for the IC and have them create their goals so they have more ownership in the successful completion of them. As an IC, gain your manager’s support for the goals you’ve created and then revisit them frequently so you can make sure your performance stays on track.
  • Use your 1:1s wisely. Whether you’re a manager or an IC, use regular 1:1s throughout the year to ensure that there are no surprises during the discussion portion of the review. I use a regular 1:1 doc so I have a record of the discussions throughout the year, and so I can remember my deliverables. My directs are responsible for filling them out and sending them to me the day before our meeting so I have time to come prepared to make it as beneficial a discussion as possible. It really does help- especially if your 1:1s are infrequent.

Providing Written Feedback on Yourself

  • Be as objective and balanced as possible. Remember that no one is perfect, and include both strengths and areas for improvement, even if it isn’t requested. Your manager ought to be gathering feedback from your partners, customers and peers, not all of whom will have purely positive things to say. It’s a pretty uncomfortable discussion when an IC’s review is completely glowing; everyone has blind spots or holes they need to fill. Your forward-looking goals and development plan also won’t support and drive your growth as much as it should if your ego gets in the way.
  • Don’t short-change yourself. Providing balanced feedback also means that you should cover the positive aspects of your year. Major deliverables, times when you showed leadership, went above-and-beyond… they should all make it into your review. I strive for a 60/40 split between positive and constructive points, both when I’m reviewing myself and my directs. It’s always nice to look back at your review and read about what you’ve done right every now and again.
  • Be quantitative in support of subjective topics. Any time you address a ‘soft’ skill (communication, organization, etc), make sure you include examples in support of your claims. Metrics are the best (can’t argue with numbers!), but if that’s not possible it’s still better to give slightly anecdotal evidence that can at least be followed up on.
  • Take your time. Provide as much feedback about your performance as you feel necessary. You’re selling yourself through your review, and the amount of time and thought you put into it directly reflects on the ownership you have in your career. Or at least that’s how I feel about it. 🙂 Even with the ongoing preparation above, I typically spend about 6 hours writing my self review- about an hour per page.
  • Stay focused. Agree on the message you would like to convey in each paragraph, section, etc. and then stick to it. It makes it easier to give feedback as a manager, and it helps guide the discussion of the review itself.

Providing Written Feedback on Your Directs

  • Avoid matching length. As a manager, receiving a self review that’s just a few sentences long is depressing. We all want our people to show that they’re invested in their careers, of course. If an IC submits a short review, take the opportunity to set an example by providing more feedback. If you have the opportunity, pass the review back to the person and ask them to put a bit more thought into it and re-submit it.
  • Avoid matching tone. Be objective to ward off being emotional in reviews. Most people invest a lot of themselves into their careers, and people who have had a challenging year can become fairly defensive, or even offensive. Keep the tone of your responses consistent and non-confrontational so you can focus on improvement. Use facts and anonymous quotes from peer and customer feedback to support your message.
  • Minimize the surprises. You should be using regular 1:1s to ensure that there are no surprises come review time. Sometimes, however, you may receive new constructive feedback from someone from out of the blue. If that’s the case, make sure you state that in your write-up so that it’s on record.
  • Be firm and direct. Make decisive statements, and stand behind your them. As a manager, you’re responsible for making decisions every day, and your credibility hinges on not being wishy-washy. Treat performance reviews the same way you treat the rest of your job. If you have the proper supporting feedback, this should be fairly easy. (also, see below, “Avoid mixed messages”)
  • Solicit balanced feedback. Choose peers/partners who will give a well-rounded view of the IC. Managers don’t see everything that goes on day-to-day, and supporting evidence from external parties helps round out feedback. Even short five-minute ‘interviews’ with people around the office are a great way to gather this input. Just be sure to send your write-up to the interviewee for approval prior to using an anonymous version in a review.

Discussing the Review

First off, I hate it when managers approach this as “delivering” a review. It automatically puts the manager in the driver’s seat and doesn’t breed collaboration or offer the IC the opportunity to own their own career. Running through a performance review ought to be a discussion. The points below apply to both ICs and managers and can promote a dialogue, rather than one person’s stream of consciousness.

  • Be open-minded. Be open to admitting you may have missed something and to changing your mind. You should still be prepared to ‘agree to disagree’ on one or two points, but make sure that you’re really listening during the conversation and that you’re being objective about the issue before accepting an impasse.
  • Be rational. Take your time and think before you speak. Don’t let the discussion devolve into an emotional match. Feel free to just say, “I feel like this is getting rather [defensive/heated/etc]. Do you mind if we come back to this part once we’ve both had time to cool down and consider it?” Just make sure that you take the time to revisit the discussion, rather than leaving it unresolved.
  • Discuss the facts. Focus on actual events whenever possible. This goes hand-in-hand with being rational. There are times when this isn’t feasible, such as reviewing 3rd-party constructive feedback or evaluating performance around the softer skills (“plays well with others”). If you can isolate those instances that could be touchy, the chat will be much less stressful. Supporting evidence for those discussions will help attenuate the situation too.
  • Avoid mixed messages. It’s a common management failing to deliver criticism wrapped in happy thoughts. We’ve all done it. As a manager, concentrate on delivering constructive feedback in a straightforward manner so everyone is on the same page. Avoid the ol’ “You’re doing great! You could really improve your communication skills, but you’ve definitely delivered some great stuff.” Either as an IC or a manager, ask clarifying questions if you’re unsure about the message being conveyed so you can reach a common understanding.
  • Allow time to digest the feedback. Just as a manager expects an IC’s review to be submitted well before the 1:1 discussion, an IC should expect to have some time to view and digest their manager’s feedback prior to accepting it. If you run into contentious places during the discussion, it may make sense to schedule another follow-up conversation a day or two later just to ensure that both people are ready to move on and focus on the year ahead. If you’re an IC, you should request this if it’s not volunteered and you feel that it’s warranted.

IT Change Management

No IT Ops process incites eye-rolling more quickly or more often than IT Change Management (CM), which is why this post has taken me almost 3 months to finish off. I’ve been shuddering at the thought of opening up that can o’ worms. But it’s good for the business and good for engineers, so it’s gotta be done. Attaching your name to changes made in the environment isn’t a bad thing. In fact, a CM process can save your bacon (mmmm… bacon…) if you’ve been diligent.

The basic purposes of a CM process are to

  • enable engineers to manage change within the environment while minimizing impact
  • provide notification and documentation for potentially-affected partners/customers, as well as peer/dependent groups who may need to troubleshoot behind you
  • furnish checks and balances- an opportunity to have multiple sets of eyes looking at potentially-impacting changes to the environment

The amount of change in your architecture escalates as the environment becomes larger and more distributed. As the number of services in your stack increases, the dependencies between services may evolve into a convoluted morass, making it difficult to diagnose major issues when they occur. Tracking changes in a distributed environment is a critical factor to reducing MTTD/MTTR. Implementing a CM process and tool set is a great way to enable the necessary visibility into your architecture.

Btw, don’t ever delay resolving a legitimate business-critical issue because of CM approval. Just do it and ask for approval afterward.

I don’t like CRB’s

I really don’t have much to say about this. I just wanted to get it in writing up front. I don’t see a lot of value in “Change Review Boards”. It’s my opinion that every org within the company ought to own their own changes and the consequences of any failures stemming from them. Management was hired/promoted because they’re supposed to be able to run their own discrete units well. They ought to be trusted to do that without interference by someone who has no idea about the architecture or the technical facets of a change. Customer approvals (internal, mostly- external where appropriate) and devotion to constant and thorough communication can circumvent any perceived need for centralized oversight for changes. Avoiding a CRB element also allowed us to move much faster, which is something almost every company craves and appreciates.

Why you need CM

If you’re reading this post, then hopefully you already have a passing notion of why CM is a critical component of a truly successful operation. Just prior to rolling out our new CM process and tool set at Amazon, we were besieged with outages, with the top three root causes of botched software deployments, roll-outs of incorrect configuration, and plain ol’ human error. Our architecture was fairly sizeable at that point, and we needed better communication and coordination regarding changes to it. While I obviously can’t provide hard stats, I can say that over the first three years with the new process, the number of outage minutes attributed to fallout from planned changes in the architecture were reduced by more than 50% while the architecture continued to grow and become more complex. Our CM process contributed mightily to this reduction, along with updated tooling and visibility.

Here are just a few discrete points about why you need CM (I’m sure there are a ton more I haven’t included).

  • Event Management. Okay, this is a very Ops-manager-focused point, I’ll admit. When you change stuff in the environment without us knowing about it, we tend to get a little testy, as do the business and your customers. MTTR lengthens substantially when you have to figure out what’s changed in order to help identify root cause. Controlling the number of changes in a dynamic environment can significantly reduce the number of “compound outages” you experience. These are some of the most difficult outages to diagnose, and therefore some of the longest-lived. (Deploying new code at the same time as a major network change? tsk tsk…. Which change actually triggered the event? Which event is a potential root cause?) You’ve probably been in a situation where a site or service is down and the triggering events and/or root causes are unknown. One of the first questions to ask during an event is, “what’s changed in the environment in the past X time period?”. Pinpointing that without a CM process and accompanying tool set can be nigh impossible, depending on the size/complexity of the environment and scope of your current monitoring and auditing mechanisms.
  • Coordination/Control. Controlled roll-outs of projects and products addresses quite a few critical points. In a smaller company, this is even more important, as the resources to support a launch are typically minimal. In any sized company, too many potentially-impacting changes in the environment at the same time is a recipe for disaster (dependencies may change in multiple ways, killing one or more launches, etc). Reducing the amount of change in your environment during high-visibility launches will help the company maintain as much stability as possible while the CEO announces your newest-greatest-thing-EVAR to the press. A little bit of a control honestly isn’t a bad thing. I’ve never understood why the word ‘control’ has such a negative connotation. Must be why I’m an Ops manager.
  • Compliance. I’ve learned a bit by rolling through SOX and SAS70 compliance exercises. You need a mechanism to audit every single change to critical pieces of your architecture. Working out all of the bugs in the process prior to being required to adhere to these types of audits is definitely preferable. Granted, you may enjoy some leeway in your first audit to get your house in order, but why waste time & create a fire by procrastinating?

9 Basics

These nine points can make a huge difference in ensuring the successful adoption of something as potentially invasive to a company’s work flow as this process might be.

  • Automation. Anything to do with a new process that can be automated or scripted should be. This includes the tool set that enables the CM process itself. If you can create a template for common changes, do it. Take it a step further and build ‘1-click’ submission of change records. Move your change records through the approval hierarchy automatically so approvers don’t have to do so manually. There are myriad ways to streamline a process like CM to save engineer time and effort.
  • Automation, part deux. This time I’m talking about carrying out the changes themselves. I know that there are varying schools of thought on this, and I’m by no means saying that automation cures all evils. But automation does reduce human error- especially when you’re completing tasks such as changing network routing or updating system configs across a fleet. The less chance for human error, the better. If you’ve watched Jon Jenkins’ talk on Amazon’s Operational efficiencies, you know that automation allows their developers to complete a deployment every 11.6 seconds to potentially thousands of machines with an outage rate of 0.0001%. Trust me- before we had Apollo, Ops spent a lot of time running conference calls for outages stemming from bad deployments.
  • Tiered approvals work. Not every single change in the environment requires a change record, but every change must be evaluated to ensure the proper coverage. Critical changes in the platform or infrastructure which have the potential to impact the customer experience just ought to require more oversight. As a shareholder and a customer, I know I appreciated the fact that we had multiple levels of reviews (peer/technical, management, internal customer) to catch everything from technical mistakes in the plan to pure timing issues (making a large-scale change to the network in the middle of the day? dude.) There are also many changes which have zero impact and which shouldn’t require numerous sets of eyes on it prior to carrying it out. Completing the 99th instance of the same highly-automated change which hasn’t caused an event of any kind in the last X months? Foregoing approvals seems appropriate. See “The Matrix” below for more information.
  • Err on the side of caution. This doesn’t necessarily require moving more slowly, but it’s a possibility. For changes that could potentially impact the customer experience, a slight delay may prevent a serious outage for your site/service. If you’re unsure whether your potentially-invasive change might conflict with another one that’s already scheduled, then delay it until the next ‘outage window’. Not 100% sure that the syntax on that command to make a wholesale change to your routing infrastructure is correct? Wait until you can get a second set of eyes on it. You’d much rather wait a day or two than cause an outage and subject yourself to outage concalls, post mortems, and ‘availability meetings’, guaranteed.
  • Trust. Reinforce that implementing a CM process has nothing to do with whether or not your engineers are trusted by the company. You hired them because they’re smart and trustworthy. It’s all about making sure that you preserve the customer experience and that you’re aware of everything that’s changing in your environment. Most engineers are pretty over-subscribed. Mistakes happen, and it’s everyone’s job to guard against them if at all possible. The process will just help you do that.
  • Hold “Dog & Pony Shows”. Our new CM process required many updates to most of our major service owner groups’ work flows. It wasn’t just about learning a new tool. We had new standards for managing a ‘tier 1’ service. When the time came to roll out the new process company-wide, we scheduled myriad training sessions across buildings and groups. We tracked attendance & ‘rewarded’ attendees with an icon to display on their intranet profile. This also provided us a way of knowing who was qualified to submit/perform potentially-impacting changes without having to look at roll-call sheets during an event. I always left room for Q&A, and re-built the presentation deck after each session to cover any questions that popped up. We received some fabulous feedback from engineers while initially defining the process, but the most valuable input we collected was after we were able to walk through the entire process and tool set in a room full of never-bashful developers.
  • Awesome tools teams are awesome. Build a special relationship with the team who owns developing and maintaining the tool set that supports your process. A tools team that takes the time to understand the process and how it applies to the various and disparate teams who might use the tools makes all the difference. Quick turn-around times on feature requests, especially at the beginning of the roll-out, will allow you to continue the momentum you’ve created and will show that you’re 1) listening to feedback and 2) can and will act on the feedback.
  • Be explicit. Be as explicit as possible when documenting the process. Don’t leave room for doubt – you don’t want engineers to waste time trying to interpret the rules when they ought to be concentrating on ensuring that the steps and timeline are accurate. When it doesn’t make sense to be dictatorial, provide guidelines and examples at the very least.
  • Incremental roll-out. I always recommend an incremental roll-out for any new and potentially-invasive process. Doing so allows for concentration on a few key deliverables at any given time, easing users into the process gradually while using quick wins to gain their support, gathering feedback before, during and after the initial implementation, and measuring the efficacy of the program in a controlled fashion. Throwing a full process out into the wild to “see what sticks to the wall” isn’t efficient, nor does it instill user confidence in the process itself. In startup cultures, that might work for software development, but avoid asking engineers and managers to jump through untested process hoops while they’re expected to be agile.

The Matrix

I’m a firm believer in the flexibility of a stratified approach to CM. Not every single type of change needs a full review, 500 levels of approvals, etc. We as an organization (Amz Infrastructure) put a lot of thought into the levels and types of approvals required for each specific type of change- especially in the Networking space, where errors have the potential to cause widespread, long-lasting customer impact. We analyzed months of change records and high-impact tickets, and we took a good hard look at our tool set while coming up with a document that covered any exceptions to the “all network changes are tier-1, require three levels of approval and at least 5 business days’ notice” definition. Here’s a sanitized version of a matrixed approach:

Example CM Stratification

We set up a very simple process for adding new changes to the “exception list”. Engineers just sent their manager a message (and cc’d me) with the type of change they were nominating, the level of scrutiny they recommended and a brief justification. It was usually 3-4 sentences long. Then there’d be a brief discussion between myself and the manager to make sure we were copacetic before adding it to the CM process document for their particular team. Last step was communicating that to the relevant team and clearing up any questions – typically in their weekly meeting. Voila!

For approvers

We created guidelines and checklists for reviewers and approvers for the ‘soft’ aspects of change records that weren’t immediately apparent by simply reading the document. We trusted the people involved in the approval process to use their own solid judgement where appropriate, since no two situations or changes are the same. Here are a few of the more major guidelines that I remember; each organization/environment combination will require their own set, of course.

  • Timing of submission. Our policy was to accept after-the-fact changes for sev1 tickets, and Emergent CMs for some sev2 tickets (see below, “Change Record”). Using inappropriately-defined sev2 tickets to circumvent the process was obviously grounds for rejection/rescheduling. The same applied to Emergent changes due to lack of proper project planning, which are rarely worthy of the emergent label.
  • Level of engineer. Ensure that the person responsible for the technical (peer) review owns the correct expertise (product or architectural), and that the technician of the change is of the proper level for the breadth and risk involved. Assuming that a junior engineer can make large architectural changes and then have the necessary competencies to troubleshoot any major fallout most likely won’t set them – or your customers – up for success.
  • Rejecting change records. We provided a few guidelines for gracefully rejecting a CM, including giving proper feedback. For example, rather than saying, “your business justification sucks”, you might say, “it’s unclear how this change provides benefit to the business”, or “what will happen if the change doesn’t happen?” (which were both questions included in our CM form).
  • Outage windows. Unless your change system enforces pre-defined outage windows, you’ll need to review the duration of the change to ensure that it complies. If a change bumps up against the window, you might want to ask the technician about the likelihood that the activity will run long, and request that that information be both added to the change record and communicated to affected customers.
  • Timeliness of approvals. This is more of a housekeeping tip, but still important. Engineers expend a lot of time and energy planning their changes, so the least the approvers can do is be timely with their reviews. Not only is it courteous, it helps the team hit the right notification period, the engineer doesn’t need to spend even more time coordinating with customers to reschedule, and the remainder of your change schedule doesn’t have to be pushed back to accommodate the delay.

Audits/Reporting

This was the biggest pain in my arse for months, I have to say- about four hours every Sunday in preparation for our org’s Metrics meetings during the week. We had expended so much effort in defining the process, as well as educating customers and engineers, and our teams had made a huge mind shift regarding their day-to-day work flow. We absolutely had to be able to report back on how much improvement we were seeing from those efforts. We measured outages triggered by Change events, adherence to the process, and quality of change records. Most of our focus was on quality, as we knew that quality preparation would lead to fewer issues carrying out the actual changes.

Completing an audit for each of the seven teams in Infrastructure entailed reviewing the quality of information provided in 8 separate fields (see below, ‘The Change Record’) for every Infra change record submitted (typically around 100-125 records/week). Steps outside of just the quality of information provided included comparing against each team’s exception list to ensure the proper due diligence had occurred, comparing timestamps to audit the notice period, and examining whether the proper customers had been notified of and approved the change.

Sure wish I had an example of one of the graphs on CM that we added to the weekly Infrastructure metrics deck. They were my favourites. 🙂

Over the first 8 weeks of tracking, our teams increased their quality scores by more than 100% (some teams had negative scores when we began). Outage minutes attributed to networking decreased by approximately 30% within the first 6 months. We also had coverage and tracking for changes made by a couple of teams which had previously never submitted change records, including an automation team which owned tier-1 support tools.

Notification, aka “Avalanche of Email”

To be perfectly frank, we never really figured out how to completely combat this. We did build a calendar that was easy to read and readily-available. We also had an alternate mechanism for getting at that same information if a large-scale event occurred and the main GUI wasn’t reachable, which is typically when you need a CM calendar the most. Targeted notification lists did help. For example, each service might have a ‘$SERVICE-change-notify@’ list (or some variant) for receiving change records related to one particular service. Over-notification is a tough challenge- especially when there are thousands of changes submitted each day in the environment. If anyone has a good solution, I’d love to hear about it!

The Change Record

Yes, it took some time for an engineer to complete a change record perfectly- especially for ‘tier-1 services’, which necessitated more thorough information. Our first version of the form did include auto-completion for information specific to the submitter and technician. We also added questions into the free-text fields within the CM form to draw out the required information to prevent the back-and-forth between the submitter and approvers which might have resulted. ‘V2’ provided the ability to create templates based on specific fields, which saved our engineers quite a bit of time per record.

Here are some of the more important fields that ought to be added to a change form. They don’t comprise all of the input required- just the major points.

  • Tiers/Levels. Most environments do have various ‘tiers’, or levels of importance to the health of the site/service the company is providing. For example, if you’re a commerce site, chances are your Payments platform is held to a 5-9’s type of availability figure. These services ought to be held to a very high standard when it comes to touching the environment. On the flip side, a service such as Recommendations may not be as important to the base customer experience and therefore might not need to be held to such tight requirements. Grab your stakeholders (including a good cross-section of end users of the process) to define these tiers up front.
  • Start/End Time. This kind of goes without saying. It’s the field that should be polled when building an automated change calendar or when people are attempting to not trample on each others’ changes. Once the dust has settled, you can refine this to include fields for Scheduled Start/End and Actual Start/End Time. This will allow gathering more refined metrics about how long changes actually do take to complete, as well as how well teams adhere to their schedules. Setting the Actual Start time would move the change into a ‘Work in Progress’ state and send notification that the change had started. Setting the Actual End would move the record to the Resolved state.
  • Business Impact. Since not everyone viewing a change was able to glean whether their service or site would be impacted, we provided engineers with drop-down selections for broad options such as ‘one or more customer-facing sites impacted’ or ‘only internal sites impacted’. We followed that with a free-text field with questions that would draw out more details about actual impact. The answers were based on “worst-case scenario” (see my point above about erring on the side of caution), but engineers typically added a phrase such as ‘highly unlikely’ where warranted to quell any unwarranted fears from customers, reviewers and approvers.
  • Emergent/Non-Emergent. This was just a simple drop-down box. Any change record which hadn’t been fully approved 48 hours prior to the Scheduled Start time (when the record appeared on the CM schedule and the general populace was notified) was marked as Emergent, which garnered closer attention and review. This did not include after-the-fact change records submitted in support of high-severity issues. It was a simple way to audit and gather metrics, and it also offered customers and senior management a quick way to see high-priority, must-have changes.
  • Timeline. This should be an explicit, step-by-step process, including exact commands, hostnames, and environments. Start the timeline at 00:00 to make it simpler. Scheduled start times can change multiple times depending on scheduling, and having to adjust this section every time is a pain. Timelines must always include a monitoring step before, during and after the change to ensure that the service isn’t behaving oddly prior to the change, that you haven’t caused an outage condition during the change (unless it’s expected) and that the environment has recovered after the work is complete. If you have a front-line Ops team who can help you monitor, that’s a bonus! Just don’t sign them up for the work without consulting them first.
  • Rollback Plan. The rollback plan must also be an explicit, step-by-step process. Using “repeat the timeline in reverse” isn’t sufficient if someone else unfamiliar with your change is on-call and must roll it back at 4am two days after the change. Include exact commands in the plan and call out any gotchas in-line. And remember to add a post-change monitoring step.
  • Approvals. We opted for four types of approvals to allow focus on the most important facets of the process. Over time, we utilized stratification to dial back the involvement required of our management team and the inherent delays that came along with that. Every level of approver had the ability to reject a change record, setting it to a Rejected state and assigning it back to the submitter of the change record for updates.
    • Peer review. Our peer reviewers typically focused on the technical aspects of the change, which included ensuring that the timeline and roll-back plans covered all necessary steps in the proper order, and that pre- and post-change monitoring steps existed.
    • Manager review. Managers typically audited all of the ‘administrative’ information such as proper customer approval, overlap with other critical changes already scheduled, and that the verbiage in the fields (especially business impact) were easily-understood by the wider, non-technical audience.
    • VP review. High-risk, high-visibility changes were usually reviewed by the VP or an approved delegate. VPs typically concentrated on the potential for wider impact, such as interference with planned launches. They were the last step in the approval process and had final say on postponing critical changes for various reasons (amount of outage minutes accrued vs risk of change, not enough dialogue with customers/peers on major architectural changes, etc).
    • Customer approval. We dealt with internal customers, typically software development teams, and we worked closely with each of our major customers to define the proper contacts for coordination/approval. Engineers were required to give customers at least 48 hours’ notice to raise questions or objections. In the case of some network changes, we touched most of the company. VP review and approval would cover the customer approval requirement, and we would use our Availability meeting to announce them & discuss with the general service owner community if time permitted.

    None of these roles should be filled by the technician of the change itself. Conflict of interest. 😉

  • Contact information. We required contact information, including page aliases, for the submitter, technician, and the resolver group responsible for supporting any fallout from the change. Standard engagement alias formatting applied. Information for all approvers were also captured in the form.

huh? it’s only been two weeks?!?

This post is all about what I’ve learned in my first two weeks as Director of LiveOps at Demonware. The role of a manager should always be to enable the organization to increase the level of production while maintaining sanity and without having to horizontally scale the team. (‘buzzword bingo’, anyone?) In a year, this blog will be filled with examples of how we as a management team accomplished that: all of the challenges, wins, missteps, etc. we’ve made on our way to fulfilling our destiny as the premier Operations team in the gaming industry.

When I joined DW on June 1, I had no idea what to expect. Yes, I’ve been in Ops for longer than I care to admit. But gaming is a fairly foreign world to me – I can watch someone play a game all day, and I fare fairly well with games targeted at 4-year-olds. That’s where my experience in the game industry stops. That being said, here are some of my initial impressions after spending three days in Vancouver with the team & working from home in Seattle (silly work permit process….) for a few more days.

  • Operations is Operations. Yes, the technologies might differ drastically between companies, but the same challenges, issues and solutions exist when trying to enable a high-performing team to ‘level-up’: process, standardization, automation and tooling
  • I’m extremely humbled that Demonware selected me to guide their highly-capable LiveOps team. Seriously.
  • I wonder at the amount of work the company had been able churn out with such a small but able staff
  • I’m incredibly excited by the positive attitude and collaborative inter- and intra-team spirit. Even the surliest of engineers kick ass and take names
  • I instantly fell in love with my highly-technical, over-taxed, mostly junior management team. I expect that I will learn just as much from them as I will teach them.

Most importantly, I realize that while the amount of work produced by our engineers reflects a very high-performing organization, we’re at a breaking point. The deliverables in the [currently-being-drafted] short- and long-term Operations road maps far outstrips the processes and resources available. More so than any team I’ve managed previously, and I’ve had to deal with some pretty gnarly resource constraints.

State of the DW LiveOps Union

We build and maintain backend services for Activision/Blizzard games such as Call of Duty – services such as leaderboards and matchmaking. (pretty sweet, right?) Our work load is mostly dictated by the road maps of third-party game studios, and while the work is cyclical, not every game requires the same features or infrastructure. Currently, LiveOps is the tail being wagged, with late-binding requests generating a make-or-break race to hit the hard holiday shopping deadlines.

Engineering and Operations were both re-structured just a few months ago to better reflect the work load. This seems to have gone well for the SDE world, where structures based on services makes a lot of sense. We’re still working through the transition in Operations- these exercises typically take much longer to shake out in our more interrupt-driven, diverse realm.

We’re very, very lucky to have fantastic support from DW senior management. (and I’m not just saying that because my boss will most likely be reading this post at some point) It’s only been two weeks, but I feel ‘mind meld’ coming on, and that’s only happened one other time in my career. Our management understands the value that a world-class Operations team provides to the company. It’s a rare occurrence, in my experience, and I plan to take full advantage of it. 🙂

LiveOps is a technically high-performing team, and…. entertaining. It’s filled with some of the most driven, intelligent and open engineers I’ve worked with. The company has done a fantastic job of hiring for culture as well as technical skill, and that really does make all the difference. Prima donnas can suck the life out of an Ops team.

We’re just beginning to think about Scale-with-a-capital-S. It’s a rare and exciting time in the life of an adolescent company. I thank my lucky stars that I’ve been fortunate enough to experience scaling challenges and seen some amazing solutions to them at Amazon and Facebook. I feel like my time at both of those companies was the best prep I could have ever had for the challenges we’re now facing.

My Dirty Little Assessment

First off, I can’t give enough credit to The First 90 Days for providing me a solid framework for approaching the assessment of my new organization. I’m learning to take my time to focus on observing and building relationships, rather than jumping in and making lightly-considered/rash decisions just to try to make my mark. The book’s common sense is forcing me to focus on defining a few quick-strike wins to build momentum and credibility. If you’re ever faced with transitioning into a new role, read this. It’s bible-worthy IMO, even though none of the concepts are particularly foreign. Now on to what I think I might be blogging about over the next year…

Have I Mentioned We’re Hiring?

Hiring is one of our top priorities. First of all, we have a great recruiting team, and the people who Demonware has hired are fabulous. Just like Amazon and FB, we’ve placed just as much emphasis on culture fit as technical acumen. Like it or not, the work doesn’t stop coming in just because we’re being selective in our hiring process though. To help fill our roles more quickly, we’ll be re-factoring job descriptions, and working with recruiting on updating our processes to include base technical pre-screen questions (to save our phone screeners time and headaches), more timely and descriptive feedback, and using our engineers’ penchant for social networking to get the word out.

“Traditional” Ops Processes

Demonware is just coming out of their startup phase, and it seems that a common denominator in companies at this stage in their progression is lack of mature processes (makes sense). We actually have a great start- it’s all about streamlining and improving upon what we already have. Process should be an enabler, not a hindrance. People who balk at this idea or think that ‘process’ is a four-letter word obviously haven’t seen it implemented the right way. Just sayin’. Here are a couple of deliverables that we’ve talked about as a management team that are on my personal road map:

  • Event Management: We already have a decent (not perfect) Event Management process documented, and we follow it most of the time. We also have a fantastic start on an incredible tool set that covers the basics of notification and engagement. The information we need exists, but we still need to tie it all together. We also need to remove more of the human element in the process (notice I said we follow it most of the time, just like most other shops). In the middle of an event, engineers just want to fix the issue, rather than concentrating on following the process. And, of course, we could always tighten our post-event actions to ensure that we’re lengthening MTBF.
    These are important things to address, but the most important deliverable for this point is the ability to measure the effectiveness of the process (MTTD, MTTR, MTBF). We honestly won’t know how to take this a step farther until we know how we’re currently doing.
  • Change Management: We’re in the same boat with CM as we are with Event Management. Good process that’s well-documented, but no way to measure the effectiveness of it, the time spent per change, number of planned vs. emergent changes, or a solid way to track customer impact/fallout programmatically. This isn’t to say that we don’t pay attention to this- we definitely do. We just need to make it much easier to get at the data we need quickly, and we need to build on that data to improve upon our susceptibility to fallout.
  • Monitoring/Alerting: We monitor A LOT of stuff, and we have the basics covered pretty well. The next step is to refine our monitoring configurations to pare down the noise. We must be able to definitively say that yes, we’re monitoring the right stuff at the proper thresholds, that the correct personnel are notified for the right alarms, and that we’re able to measure our effectiveness at reducing the number of alarms through everything from code re-factoring to architecture standardization.
  • Operational Acceptance (OAC): Ops teams routinely complain about stuff being ‘thrown over the fence’ for them to support. OAC is a great way to ensure that before the team signs off on a new support request, it’s actually supportable. Providing a well-designed OAC checklist to customers will not only address that, but it will oftentimes spawn different design decisions that will make a service/stack more extensible and reliable. Theo Schlossnagel says it’s about “putting more ops into dev”, rather than the inverse. Can’t argue with Theo, right? 🙂

Streamlining

We have to make our own lives simpler. That’s just a given for any Ops team, regardless of how long the team or company has existed or how successful they are. Now that we’re starting to hunker down, we need to begin approaching Operations as a business unit, just like every other organization. It sounds like an awful concept to engineers, but once the framework is in place, those same engineers are grateful that they can depend on the way work flows into and out of the team, there are clear escalation paths, etc.

  • Planning and Prioritization: It’s the same with most Ops teams, but the resounding feedback from our team is that “we never have time to get to the stuff we really need to do”. We need to answer the questions, “what is it that is taking up your time currently?” and “what exactly should we be doing instead, and why?”. Prioritizing work in the Ops world is typically tougher than in the engineering world due to the interrupt-driven, break/fix nature of the role. There’s no reason you can’t just make an “Operational Interrupts” line item in your road map, assign it the proper resource level, and devote the remainder of the team’s time toward the projects which pop the stack in terms of business value.
  • Communication/Partnering: The more of a partnership you can cultivate with engineering and senior management, the easier it gets. We already work well with both sets of customers, but this will always be a focus for us. Reviewing road maps and priorities to make sure we’re all on the same page, participating in design reviews (so that Ops has a seat at the table before a service launches), and consistently setting and resetting expectations will all make our lives easier as Ops personnel.

Event Management

Something blew up in your infrastructure and you have no idea what’s wrong or where to even start looking. A large percentage of your customer base is impacted, and the company is hemorrhaging money every minute the episode continues. Your senior management team is screaming in both ears, and your engineers are floundering in your peripheral vision, trying to find root cause. Sound familiar?

True Ops folks tend to thrive in this type of environment, but businesses don’t. And engineers, regardless of whether they write software or maintain systems & networks, hate seeing the same events over and over again. Managing these events doesn’t just last for the duration of the event itself. To do it right, it takes copious amounts of training, automation, process innovation, consistency and follow-through. This is my ‘take’ on how to go about rolling out a new process.

This may seem like a lot of overhead (it’s a lot of words), but the process itself actually pretty simple. The effort is really in making the right process designs up front and in creating the proper tooling & training/drilling around it. It’s a very iterative process; it took well longer than a year to solidify it, and we were constantly re-factoring it as we learned more about our evolving architecture. Most of what’s described below is for Impact 1 events (site outages, etc) and doesn’t necessarily apply to run-of-the-mill or day-to-day requests (over-rotating burns people out and diminishes the importance of those major events). Not all of this applies to a small, 20-person company either, although the responsibilities contained in the ‘During an Event’ section will apply to almost any sized team or event. Perhaps you’ll need to combine roles or re-distribute responsibilities depending on the size of the team or event, but the process itself is pretty extensible. The examples follow distributed websites, since it’s what I know, but the concepts themselves ought to apply to other architectures and businesses. (I also assume you’re running conference calls, but the same applies if you run your events over IRC, skype, etc).

Culture Shift

If you’re one of the few lucky people who work in a company where IT Operations garners as much attention and love as launching new features/products, then we’re all jealous of you. 🙂 Engineers and business people alike would absolutely love to have 100 percent of the company’s time focused on innovation. In my experience, any time I mention ‘process’, I receive looks of horror, dread and anger from engineering, including management. The knee-jerk reaction is to assume that a new procedure will only create more delay or will divert precious time from what ‘truly matters’. Taking a measured approach to dispelling those rumors will pave the way to a successful roll out. It just takes a lot of discussion, supporting metrics, the ability to translate those metrics into meaningful improvement to the bottom line, a considered plan, and the willingness to partner with people rather than being proscriptive about it.

  • Act like a consultant. Even if you’re a full-time employee who’s ‘grown up’ in an organization, you should begin with a consultant mind set so you can objectively take stock of your current environment, solicit objective feedback, and define solid requirements based on your learnings. This can be difficult when you’re swimming (drowning?) in the issues, and gathering input from people who are participants but not owners of the process will help immensely.
  • Use metrics You have to know the current state of affairs before diving headlong into improvements or prioritizing the deliverables in your project. If you don’t have a ticketing system or feature-rich monitoring system from which to gather metrics programmatically, then use a stop watch to codify the time it takes to run through each step of the current process. If all you have is anecdotal evidence to reference initially, then so be it. And if that’s the case, gaining visibility into the process should be at the top of your priorities.
  • Be truly excited. Don’t pay lip service to a change in process, and don’t allow the leaders in your organization to do so either. The minute you sense resistance or hesitation in supporting the effort, intercept it and start a conversation. This is where the numbers come in handy. If the engineers tasked with following a new process are hearing grumblings from managers or co-workers, then it adds unnecessary roadblocks. To be sure, we encountered our fair share of resistance which bred some frustration during our roll-out. But we used the fact that every improvement decreased the number of outage minutes, added to the bottom line and helped with the stock price- even if it was an indirect benefit. That’s something that everyone can and should be excited about.
  • Incremental progress. Not everything included here has to (or can) happen overnight, or even in the first six months. I hate the saying, “done is better than perfect”, but sometimes it actually applies. I’ve included ideas on how to roll most of the process out in an incremental fashion while still getting consistent bang for the buck.
  • Continual refinement. No good process is one-size-fits-all-forever. Keep an open mind when receiving feedback, ensure that the process is extensible enough to morph over time, and continually revisit performance and gather input from participants. Architectures change, and the processes surrounding them must change as well.

Prepping for success

The following deliverables are fundamental to securing a solid Event Management process that’s as streamlined as possible. It will take time to address the majority of the research and work involved, but basing prioritization on the goals of the program and the biggest pain points will allow measurable progress from the outset.

Impact Definitions

You need to know the impact or severity level of the event before you know what process to run. The number of levels may vary, but make sure to decide on a number that is both manageable and covers the majority of issues in your environment. I have to admit that over time, my previous company moved to looking at events as “pageable with a concall” (sev1), “pageable without a call” (sev2) or “non-pageable” (sev3) offenses, rather than adhering to each specific impact definition. This isn’t right or wrong; the behavior reflected our environment. Although each organization is unique, here are some examples to consider:

Impact 1: Outage Condition: Customer-facing service or site is down. Percentage of service fatals breaches a defined threshold (whatever is acceptable to the business).
Sev1 tickets/events follow all processes below and have a very tight SLA to resolve which triggers auto-escalation up the relevant management chain. The escalation time will depend on the types of events involved in a typical sev1, but we escalated through the management chain aggressively, beginning at 15 minutes after the ticket was submitted. The additional setting of rotating the ticket to the secondary on-call if the ticket isn’t moved to the appropriate next state or updated (thus paging them) should also be fairly tight (ie- if a ticket isn’t moved from ‘assigned’ to ‘researching’ within 15min, the ticket will auto-reassign to the secondary and page the group manager).
Impact 2: Diminished Functionality: Customer-facing service or site is impaired. Percentage of service fatals breaches a defined threshold (whatever is acceptable to the business).
Sev2 tickets/events will page the correct on-call directly, with a moderately tight SLA to resolve which triggers auto-escalation up the relevant management chain. These tickets will also rotate to the secondary on-call and page the group manager if the ticket isn’t moved to the appropriate next state after the agreed-upon SLA.
Impact 3: Group Productivity Impaired: Tickets in this queue will most likely wind up covering issues that will either become sev1/sev2 if not addressed or are action items stemming from a sev1/sev2 issue. It may also cover a critical tool or function that is down and affecting an entire group’s productivity. These tickets don’t page the on-call, and the SLA to resolve is much more forgiving.
Impact 4: Individual Productivity Impaired/Backlog This sev level was treated more like a project backlog, and while there are other products that cover bugs and project tasks, I like the idea of having everything related to work load in the same place. It’s simpler to gather metrics and relate backlog tasks to break/fix issues.

Incremental progress

I will always recommend front-loading the sev1 definition and over-escalating initially. In my mind, it’s much better to page a few extra people in the beginning than it is to lose money because you didn’t have the proper sense of urgency or the correct people for an issue. If you can’t integrate automatic rotation of tickets into your current system, then add it into your checklist and make a conscious decision to watch the time and escalate when necessary.

Tools and Visibility

Tools

It doesn’t take an entire platform of tools to run an event properly, although that certainly does help. The following tools are fairly important, however, so if you have to prioritize efforts in this arena, I’d start here.

  • Ticketing System A flexible and robust ticketing system is an extremely important part of a solid Event Management process. It’s your main communication method both during and after an event, and it’s a primary source for metrics. If participants in an event are fumbling with the fundamental mechanism for communicating, then they’re not concentrating on diagnosing and fixing the issue. There are many important features to consider, but extensibility, configurability and API’s into the tool are all critical to ensuring that whatever system you choose grows along with your processes and organization.
  • Engagement/Notification System. Ideally this will be tied into your ticketing system. If you have your tickets set up to page a group, then you ought to already have that information in the system. While our first-line support team utilized a full version of a homegrown event management application, we always wanted to provide a pared-down version of the same tool for service owners throughout the company. I certainly hope that’s happened by now, since the more distributed a company becomes, the more difficult it is to locate the right people for cross-functional issues which may not be sev1-worthy.
  • Sev1 Checklist I’m a big proponent of checklists that can and should be used for every event. In the heat of battle, it’s easy to overlook a step here and there, which can cause more work farther into the event. Building a checklist into an overall Event Management application is a great way to track progress during an event, ensure each important step is covered and inform latecomers to the event of progress without interrupting the flow of the call or the troubleshooting discussions. Separate lists should be created for the front-line ops team, call leaders and resolvers. Each role owns different responsibilities, but everyone must understand the responsibilities of all three roles.

Incremental progress

Ticketing: If your system doesn’t include features such as service-based groups, automatic submission of tickets, reporting/auditing or fundamental search functionality, start investing in either bolstering the current system or migrating to another one. Depending on the scope of that work, beg/borrow/steal development resources to create a hook into the backend data store to pull information related to your specific needs. (this is a grey statement, but every environment has different needs).

Checklists: It’s fine to start small with a binder of blank lists that the team can grab quickly. Anything is better that nothing! Include columns for timestamps, name of the person who completed the action, the actual action and a place for notes at the very least. The facets of an event I would document initially are discovering the issue (goes without saying), cutting the ticket, initial engagement & notification, each subsequent notification, when on-calls join the call/chat, any escalations sent, root cause identified, service restored, and post mortem/action items assigned.

Visibility

  • Monitoring/Alerting. You have to be able to recognize that an event is going on before you can kick off a process. If you’re really good, your monitoring will begin the process for you by auto-cutting a ticket based on a specific alarm and notifying the proper engagement/notification lists. That takes time, of course, but you should be able to build a solid list of alerts around sev1 conditions as you go along- automation like that is rarely built in a day. Almost every post mortem I’ve been in for a high-impact event has included a monitoring action item of this type; if those conversations are happening then you’re bound to have fodder for monitoring and automation. I’ve chatted about monitoring and alerting in a previous post, so I won’t regurgitate it here.
  • Changes in the Environment. Understanding what’s changed in your environment can significantly aid in narrowing the scope of diagnosing and troubleshooting events. Accumulating this data can be a huge task, and visualizing the amount of change within a large distributed, fast-paced, high-growth environment in an easily-digestible format is a bear. The visibility is well worth it, however, so if you don’t have a Change Management system or process, it’s a fantastic deliverable to put on a road map. CM is an entirely separate post though, so I won’t go into it here.

Incremental progress

Changes: Start small by collating data such as software deployments for major services, a simple calendar of Change Management events (heck, even a spreadsheet will suffice in the beginning), and recent high-impact tickets (sev1/sev2). You can migrate into a heads-up type of display once you have the data and understand the right way to present it to provide high value without being overwhelming.

Standardized On-Call Aliases

Once your company has more than one person supporting a product or service, you should create a standardized on-call alias for each group. Adding churn to the process of engaging the proper people to fix an issue by struggling to figure out who to page is unacceptable- especially when the front-line team has a tight SLA to create a ticket with the proper information, host a call and herd the cats. For example, we used a format akin to “page-$SERVICE-primary” to reach the primary on-call for each major service. (page-ordering-primary, page-networking-primary, etc.) Ditto for each team’s management/escalation rotation (page-$SERVICE-escalation). Managers change over time, and groups of managers can rotate through being an escalation contact. As a company grows, a front-line team can’t be expected to remember that John is the new escalation point for ordering issues during a specific event.

Primary/Secondary/Management Escalation

When a group gets large enough to handle multiple on-call rotations, a secondary on-call rotation should be created for at least a couple of reasons. First, reducing the churn in finding the proper person to engage will decrease the mean time to engage/diagnose. Secondly, pages can be delayed/lost, engineers can sleep through events, etc. If you’re in the middle of a high-impact event, losing money every minute, and restoring service hinges on engaging one person, then you’re in a bad position. Lastly, there are times when an event is just too large for one person to handle. For example, having a backup who can pore through logs while the primary is debugging an application will usually speed up MTTD/MTTR. Less context switching during a high-pressure call is a Good Thing. (see On-Call Responsibilities for expectations of on-calls).

Management escalation should be brought in if the root cause for a major outage lies in their court, if you’re unable to track down their primary or secondary on-call or if the person engaged in the call isn’t making sufficient progress. Managers should help find more resources to help with an issue and should also serve as a liaison between the resolvers ‘on the ground’ fixing the problem and senior management, where necessary. See Manager Responsibilities below.

Engagement vs Notification

There’s a difference between engagement and notification during an event. Engagement is just that- it’s the mechanism for calling in the right guns to diagnose and fix an issue. Notification is a summary of where you’re at in the process of diagnosing/fixing and should be sent to all interested parties, including senior management. Each of those messages should contain different information and each audience group should also be managed differently.

Engagement

It’s my opinion that the list of folks who are engaged in fixing an issue should be controlled fairly tightly, else you risk the ‘peanut gallery’ causing the discussion to veer off track from the end goal of finding and resolving the root cause of the issue. At a previous company, we created engagement groups for each major bucket (ordering, networking, etc) and populated that with a particular set of aliases that would reach the on-calls of the groups typically involved/necessary in that type of event.

Engagement messages should contain ticket number and impact, contact information (conference call number, IRC channel, etc), and a brief description of the issue. If this is an escalation or out-of-band (engaging someone who isn’t on-call), include something to that effect in the message:

Plz join concall 888-888-8888, pin 33333. sev1 #444444, 50% fatal rate in $SERVICE. (John requests you)

Notification

Notification lists should be open subscription for anyone internally, but you should ensure that the right core set of people is on each list (VP of the product, customer service, etc). Even if a service isn’t directly related to the root cause of an issue, up- and downstream dependencies can impact it. Create notification lists for each major service group (networking, etc) so that people can be notified of problems with services that impact them, either directly or indirectly. The frequency of messages sent should be a part of the defined event management process, as should out-of-band notification practices for more sensitive events (communication with PR, legal, etc).

Notifications should include ticket number, brief description of the issue, who is engaged, whether root cause is known, ETA for fix and the current known impact. Be brief but descriptive with the message.

FYI: sev1 #444444, 50% fatal rate in $SERVICE. Networking, SysEng, $SERVICE engaged. Root cause: failed switch in $DATA_CENTER, ETA 20min

Incremental progress

Aliases: If you’re just starting out or don’t have an effective list management system, you can begin with a simple document or shared calendar containing who is responsible for each service. You can even go as simple as noting who the subject matter expert and group manager are for each team if the concept of an on-call doesn’t exist yet, then build aliases as you canvass that information. Contacting each team to request that they update the doc when on-call responsibilities change probably won’t be met with much resistance- you can sell it as a, “if we know who to page, we won’t page you in the middle of the night”. Engineers should love that. If you utilize a system like IRC, it’s fairly trivial to write bots that will allow ‘checking in’ as an oncall; storing that information in a flat file that can be read by another bot or script to engage them when necessary is a quick solution that doesn’t require navigating to multiples places while spinning up a high-impact call.

Engagement: Start with just using a standard template for both engagement and notification to get people used to the proper messaging. If you don’t have a tool, then begin with either email or CLI (mail -s, anyone?), but make sure you add a copy of each message sent to the relevant ticket’s work log so you have a timestamped record of who was contacted. Again, if you don’t have an effective list management solution, create templates (and aliases, if you’re running things from a commandline).

During an Event

Leading an Event/Conference Call

“Call Leaders”. No matter how much process, automation, visibility and tooling you have, there are always those really gnarly events that need management by an Authority of some sort. Appointing a specific group of people who have deep understanding of the overall architecture and who own the proper mentality and clout within the organization to run those events will go a long way toward driving to root cause quickly and efficiently. Call Leaders should not be at the forefront of technical troubleshooting; they’re on the call to maintain order and focus. These people should be well-respected, organized and knowledgeable. They also have to be a tad on the anal-retentive and overbearing side. Call Leaders are tasked with prioritizing, ensuring appropriate escalation occurs, progress is documented in the corresponding channel(s), the correct communication flows to the proper people, resolution of the issue is actually achieved & signed off on, and post-event actions are assigned. As long as they don’t over-rotate and step on the toes of the engineers who are fixing the issue, you’re all good. Re-evaluating this core group every once in a while is a great thing to do. Depending on how frequently these leaders are engaged, burnout can be an issue. (Btw, for years, our front-line operations team served this function themselves. As we grew and became more distributed, we implemented the additional Call Leader tier, with the aim of focusing on better tooling and visibility to drive down the frequency with which that new tier was engaged.)

  • Documentation: While the front-line team should be adding ticket updates, the Call Leader is responsible for making sure that happens. If done properly (and in conjunction with updates to the status of an event in an Event Management tool), a Call Leader shouldn’t have to interrupt the flow of the call to brief newcomers about the state of the event, nor should they need to ask themselves, “now, about what time did that happen?” after the event is complete. It also allows interested parties outside of the flow of the call to follow along with the event without interrupting with those annoying, “what’s the status of this event?” questions.
  • Focus on resolution. Ask leading questions to focus service owners on resolving the immediate issue (see ‘Common Questions’ below). Once root cause of an issue has been discovered, engineers may have a tendency to dive directly into analysis before the customer experience has actually been restored. There’s plenty of time after an event to do that analysis.
  • Facilitate decision making. The more people participating in an event, the more difficult it can be to make the tough decisions (or sometimes even just the simple ones). Call Leaders should act as a facilitator and as a voice of reason when necessary. For example, making the call on whether to roll back a critical software deployment supporting the launch of a new product isn’t typically something you’d want an engineer to make. They don’t need that stress along with trying to diagnose and fix a production issue. Since Call Leaders are typically tenured employees who understand the business, they should be able to engage the correct people and ask the proper questions to come to a decision quickly.
  • Escalate effectively Pay attention to whether progress is being made on the call or whether anyone is struggling with either understand the issue or handling the work load. Ask whether you can engage anyone else to help, but realize that engineers are notorious for not wanting to ask for help. Give it a few more minutes (this all depends on the urgency of the event), then ask, “who should I engage to help?”. If an on-call doesn’t offer a name, engage both the secondary on-call (if it exists) as well as the group manager. I usually say something along the lines of, “I’m going to grab John to help us understand this issue a bit better.”, which is a fairly non-confrontational way of letting the on-call know that you’re going to bring in additional resources.
  • Release unnecessary participants No one likes to hang out on a call if they’re not contributing to the resolution of the issue. Keeping the call pared down helps with unnecessary interrupts and also keeps on-calls happy. Prior to releasing anyone from the call, make sure that they have noted in the ticket that their service has passed health checks. (remember to note in the ticket when the person dropped off the call for future reference!)
  • Running multiple conference calls If you’re managing an event that includes multiple efforts then it can be a good idea to split the call. Examples of this are a networking issue that spawns a data corruption issue, or an event with multiple symptoms and/or multiple possible triggers/root causes. Communication between the two camps can become unwieldly quickly, so if you don’t have a secondary Call Leader, then utilize the group manager responsible for one of the issues. This necessitates a round of call leader training for primary managers, which ought to be completed in any case. This also makes it highly important that any proposed changes to the environment are added to your communication mechanism (ticket, IRC, etc) prior to making the change so that all parties involved in the event are aware. As you refine monitoring and visibility into the stack, those ‘unknown root cause’ events should happen more and more infrequently.

Common Questions to Ask

Depending on the environment, there will be a subset of questions that you can always ask during an event to clarify the situation or guide the participants. These are a few that usually helped me when driving complex events in previous roles.

  1. What is the scope/impact of the event?
  2. What’s changed in the environment over the past X hours/days?
  3. What is the health of upstream and downstream dependencies for the service exhibiting outage symptoms?
  4. Is a rollback [of a deployment or change] relevant to consider?
  5. How complex is the issue? Are we close to finding root cause?
  6. Do we have everyone on the call we need?
  7. Is sufficient progress being made?
  8. How do we verify that root cause has been addressed?

Incremental progress

Use front-line Ops team and managers if you don’t have sufficient staff for a call leader rotation. Invest in creating and holding training sessions for all of the major participants in your typical events, regardless. Just providing them information on questions to ask and how to interact during an event will set the proper direction. (Remember to continue measuring your effectiveness and make adjustments often.)

Front-Line Ops Responsibilities

The front-line Ops team typically sees major issues first and are the nucleus of managing an event. The team is known as ‘NOC’, ‘tier one’, ‘operators’ or any number of other terms. Regardless of what they’re called, they’re at the heart of operations at any company, and they ought to feel trusted enough to be an equal partner in any event management process. They typically have a broad view of the site, have relationships with the major players in the company, and understand the services & tools extremely well. There’s also some serious pressure on the team when push comes to shove, including the following responsibilities.

  • SLAs If you’re dropping money or hurting your company’s reputation every minute you’re down, then it’s vital that you define and adhere to SLAs for recognizing an event (owned by the monitoring application and service owner), submitting the tracking ticket, and engaging the appropriate people. The two latter responsibilities are owned by operations (or whomever is on the hook for ensuring site outages are recognized and addressed). I recommend keeping state within the trouble ticket about who you’ve engaged and why. We wound up building a feature into our event management tool that allowed resolvers to ‘check in’ to an event, which would add a timestamped entry into the tracking ticket. This allowed anyone following along to the event- including tier-one support and the Call Leader (see below) to know who was actively engaged in the event at any given time. It also provided a leg up on building a post mortem timeline and correcting instances of late engagement by service owners.
  • Engagement and Notification Ops should own the engagement and basic notification for each event. If you need to cobble together some shell scripts to do a ‘mail -s’ to a bunch of addresses in lieu of actual engagement lists to begin with, so be it! Just make sure it makes it into the ticket as quickly as possible so there’s a timestamped record of when the engagement was sent. Ops is closest to the event and typically has a better understanding of what teams/individuals owns pieces of the platform than anyone else. Call Leaders and service owners should request that someone be engaged into the event, rather than calling them directly. Not only does this allow other groups to focus on diagnosis/resolution, but it ensures that messages & the tracking of those messages is consistent. The exception to this should be more sensitive communication with senior management/PR/legal, which should be taken care of by the Call Leader, where relevant.
  • Documentation. Every person involved in an event should own portions of this. My opinion is that front-line ops should document who’s been engaged, who’s joined the event, who’s been released from the event, any troubleshooting they’ve done themselves (links to graphs, alerts they’ve received, high-impact tickets cut around the same time), and contacts they’ve received from customer service, where applicable. Adding action items as you go along (“we need a tool for that” or “missing monitoring here”) will aid with identifying action items and creating the agenda for any required post mortem. Ops should also have an ear trained to the call at all times and should document progress if requested by the Call Leader or another service owner.
  • Aiding in troubleshooting. Each on-call is responsible for troubleshooting their own service, but there are times when the front-line Ops personnel see an issue from a higher level and can associate an issue in one service with an upstream or downstream dependency. Ops folks typically have a better grasp on systems fundamentals than software developers and can parse logs faster & easier than their service owner counterparts. I’m a believer in ‘doing everything you can’, so if you have a front-line person who’s able to go above and beyond while still taking care of their base responsibilities of engagement and notification, then why not encourage that?
  • Keeping call leaders honest. Sometimes even Call Leaders can get sidetracked by diving into root cause analysis prior to the customer experience being restored. Front-line Ops people should be following along with the event (they need to document and help troubleshoot anyway), and should partner with the Call Leader to ensure that service owners stay on track and focus remains on resolving the immediate issue.

Incremental progress

This is a lot for a front-line team to cover, so pare down the responsibilities based on the organization’s needs. Engagement of the proper on-calls is imperative to reducing time to diagnose and resolve, so focus there first. If you have strong leaders to run and document events but still need to improve MTTD/MTTR, then concentrate the Ops team on providing on-calls with additional hands or visibility.

On-Call Responsibilities

A major goal of any IT Event Management process should be to enable engineers to act as subject matter experts and focus on diagnosing, resolving and preventing high-impact events. In exchange for this, on-calls should be asked to do only one thing: multi-task. 🙂

  • Be proactive If you’ve discovered a sev1 condition, get a hold of the NOC/tier1/event management team or leader immediately. Submitting a ticket outside of the typical process will likely introduce delays or confusion in engagement.
  • Respond immediately If you’re engaged into a sev1 event, join it immediately and announce yourself & what team you’re representing. A primary on-call should adhere to a tight SLA for engaging. Our SLA was 15min from the time the page was sent to be online and on the conference call. This allowed time for the message to be received and for the on-call to log in. I’m not a fan of trying to define SLAs for actually resolving an issue- some problems are just really gnarly, especially once you’re highly-distributed, and it’s just not controllable enough to measure and correct.
  • Take action Immediately check the health of your service(s), rather than waiting for the Call Leader to task you with that.
  • Communicate The worst thing to have on a conference call is silence when root cause is still unknown or when there isn’t a clear plan to resolution. If you’ve found an anomaly, need assistance, are making progress, need to make a change to the environment or have determined that your service is healthy, make sure that the call is apprised of what you’ve found and that the ticket is updated with your findings.
  • Escalate Don’t be afraid to escalate to a secondary, manager or subject matter expert if appropriate. No one’s going to think less of you. In fact, if you decrease the time to resolve the issue by escalating, you ought to be praised for it!
  • Restore service Stay focused on restoring service. Leave root cause discussions until after the customer experience is reinstated unless it has direct bearing on actually fixing the issue.
  • Ask questions If there’s ever a question about ownership of a task, whether something’s being/been looked at, what the symptoms are, etc., then ask the people on the call for clarification. Don’t assume that everything is covered.
  • Offline conversations: These should be kept to a minimum to ensure that everyone is on the same page. It’s not just about knowing what changes are being made to the environment during troubleshooting, although you must understand this so that engineers don’t exacerbate the issue, trample on someone’s change, or cloud your understanding of just what change “made it all better”. Something as simple as an off-hand comment about a log entry can spur someone else on the call to think of an undocumented dependency, change to software, or any number of other things related to the event. There are times when spinning off a separate conversation to work through a messy & compartmentalized issue is a good thing. Check in with the Call Leader if you feel it’s a good idea to branch off into a separate discussion.

Troubleshooting Best Practices

Not all engineers have extensive experience in troubleshooting, so here are a few hints to help participants in an event.

  • Determine actual impact before diving headlong into diagnosing, where possible
  • Check the obvious
  • Start at the lowest level commensurate with the issue. For example, if monitoring or symptoms point to an issue that is contained in the database layer, it’s relevant to focus efforts there, rather than looking at front-end webserver logs.
  • Assume that something has changed in the environment until proven otherwise
  • Making changes to the environment:
    • don’t make more than one major change at the same time
    • keep track of the changes you’ve made
    • verify any changes made in support of troubleshooting
    • be prepared to roll back any change you’ve made
  • Ask “if…. then….” questions

Manager Responsibilities

  • Take Ops seriously. Support your team’s operational responsibilities. Contribute to discussions regarding new processes and tools, and encourage your team to do the same. Take operational overhead into account when building your project slate; carve out time for basic on-call duties, post-launch re-factoring, and addressing operational action items where possible.
  • Prepare your engineers. Make sure that anyone who joins the on-call rotation receives training on the architecture they support, tools used in the company, who their escalation contacts are (subject matter experts, usually), and are provided with relevant supporting documentation.
  • Reachability As the management escalation, you should ensure that Ops has your contact information, or your escalation rotation’s alias. You should also have offline contact information for each of your team members.
  • Protect your engineers During a call, there may be times when multiple people are badgering your on-call for disparate information. As a manager, you should deflect and/or prioritize these requests so that your engineer can focus on diagnosing the issue and restoring service.
  • Assist the call leader You may be called upon to help make tough decisions such as rolling back software in support of a critical launch. Be prepared and willing to have that conversation. You are also the escalation contact for determining what additional resources can/should be engaged, and you may be asked to run a secondary conference call/chat, where necessary.
  • Help maintain a sense of urgency It’s possible that efforts to find root cause languish as the duration of an event lengthens. Keep your on-call motivated, and get them help if need be. Keep them focused on restoring the customer experience, and remove any road blocks quickly and effectively.
  • Post-event actions. If the root cause of the event resides in your service stack(s), you will be asked to own and drive post-event actions, which may include holding a post mortem, tracking action items, and addressing any follow-up communication where relevant.

Post-event Actions

For events with widespread impact, a post mortem should be held no later than 1-2 business days of the event. If you’ve documented the ticket properly, this will be fairly simple to prepare for. Either the group manager or the Call Leader will facilitate the meeting, which typically covers a brief description of the issue, major points in the timeline of the event, information on trigger, root cause & resolution, lessons learned and short- & long-term action items. Participants should include the on-call(s) and group manager(s), the call leader, and the member(s) of the Ops team at a minimum. It may also include senior management or members of disparate teams across the organization, depending on the type of event and outstanding actions.

Action items must have a clear owner and due date. Even if the owner is unsure of the root cause and therefore can’t provide an initial ETA on a complete fix, a ‘date for a date’ applies. Make sure to cover the ‘soft’ deliverables such as communicating learnings across the organization, building best practices, or performing audits or upgrades across the platform.