I vowed to take new risks when I started at Dropbox about a month ago, so when my Illustrious Leader volunteered my name as a potential panel member at SRECon, I decided to suck it up and go for it. Le sigh. I’ve never been great at speaking in front of large groups of people (or people in general). I want to make sense and provide value when I do speak, however, so I’m brain-dumping in the hopes that I’ll remember at least a few salient points. No guarantees though. ¯\_(ツ)_/¯
Panel topic: discuss the various approaches to structuring, building and maintaining teams dedicated to reliability across multiple types of organizations and infrastructure.
FWIW, I wrote this post back in 2012 about growing an existing team. Some of that’s repeated here. I also care less about the type of infrastructure (hosted, in-house, etc.) because I believe each model can apply to almost any infrastructure, provided some creativity is involved. I’m either still naive or a real veteran. #takeyourpick
How do we build teams that adapt to differing needs [over time]? How can we hire effectively for each combination of org and infrastructure?
First off, as the leader of the team, you have to be flexible and open-minded. There is no single formula for hiring a team based on the type of organization and/or infrastructure. While you can adopt/adapt portions of your past experience to each new role, you will (thankfully) always wind up with something different.
Before You Make Your Org Decisions
- Before you hire anyone, understand who your customers are, their expectations and the longer-term direction the company is headed w.r.t. infrastructure and your organization. You don’t want to hire a core of systems-focused engineers with python skills if your main responsibilities include providing deep-dive code reviews and troubleshooting for production services written in java.
- Get data to support any anecdotes from the conversations above. It’s not uncommon for people to grossly over- or under-state their needs, even if they’re in “ops” themselves. Don’t guess. Hiring and training n00bs is a huge investment, and hiring people before you have an understanding of where you’re headed can lead to a larger time suck and quick attrition if you make the wrong assumptions.
- Determine the metrics that will tell you whether you’re successful at any point in time (technical debt reduction, # of outage minutes, number of internal customers/teams the team supports, etc.). Re-visit these regularly to make sure you’re still on track, and don’t be afraid to admit that changes based on those metrics need to be made (that will almost invariably happen). Make smaller, more frequent adjustments rather than waiting until a wholesale re-structuring is necessary.
Hiring Effectively for Each Combo
- Regardless of the type of infrastructure or the size of the team, hire people who can help themselves through automation and standardization. This doesn’t necessarily mean “build a framework for All The Things”. Start with small scripts that can be cobbled together into something larger when the time comes. Build momentum by tying these efforts to actual metrics that the team and org can get behind and be proud of.
- Hire people who are open to new experiences and learning new skills. My worst hiring decisions have come from bringing someone on board who couldn’t or wouldn’t grow with the team and the changing landscape. Managing someone out is drastically more difficult than justifying keep a req open until you find the Right Person™.
- Don’t be afraid to hire juniors. When you do, make sure they have a tight mentoring relationship with a more senior engineer, a solid onboarding plan and a clear growth path.
Adapting to the Changing Landscape
- Accept that you can’t plan out more than 6-8 months in any great detail if you’re in a high-growth situation. Keep one eye on at least 12 months into the future & communicate what you think you see to your team/partners/customers regularly to gather feedback, adjust strategy, etc.
- Have a structure in place for ongoing education/learning & allow engineers time to take advantage of it. Make sure your customers/partners understand that N% of each team member’s time will be devoted to learning new skills that will make him more efficient and productive over the long term.
- Create a hiring ‘flywheel’ within the company if possible. Partnering with other teams (NOC, datacenter ops, etc.) opens up another avenue and provides you more control in the type and quality of candidate for your role(s) over time. Some things to consider if you’re partnering with other teams:
- make sure there’s a tight communication loop between the teams from the beginning
- include those teams in your education program
- create ‘pair programming’ or mentoring relationships with those teams
- follow through on your commitment to the flywheel when the time comes
- Unfortunately, not everyone on your team will be able to – or want to – change along with the role over time. Attrition happens. The best you can do for your team is to be honest about the challenges and provide everyone the opportunity to learn and grow. It’s okay if someone doesn’t want to take advantage of that though, so don’t beat yourself up over it.
What are the different organizational models?
Again, this should be underpinned by who your customers are, their expectations, your commitment to them, the direction of the company, the profile of the work load, and the team you’re assembling. This obviously isn’t an exhaustive list, and I’m sure I’ll miss many pros/cons to each. But that’s why there are multiple people on the panel.
Centralized “Infrastructure” Team, De-Centralized Service Ownership
Steph’s Def: Systems and Network Engineering, sometimes tiered, with “sysadmins” or NOC engineers comprising the first tier. More traditional (“so 7 years ago”) model, with less development work. Software development teams wholly support their own service(s) in production, although Infra may be responsible for overall availability/reliability.
|Devs “eat their own dog food” so they’re closer to the impact their changes make on production. The same benefit can be had from the “devops” option below.||Can lead to multiple disparate operational processes and tooling/automation solutions if solid solutions aren’t already in place.|
|Infrastructure engineers may have more opportunity to hone Subject Matter Expertise in their specific area(s).||Risk of inadequate prioritization of operational issues vs iterating on “product”|
|Built-in recruiting ‘flywheel’ with more junior engineers, as mentioned in the section above.||Requires a strong commitment from all teams to maintain operational excellence, and solid, consistent communication between Infrastructure and Development orgs.|
|There’s still a larger pool of “Infrastructure-only” engineers to draw from today than hybrid dev/ops engineers (at the level we’re talking about).||Can create a “black box” or “us vs them” mentality between dev & ops if the latter team is expected to support/troubleshoot production services.|
|Access to production may need to be restricted to a smaller subset of people, depending on your compliance profile.|
Steph’s Def: SREs are embedded within each development team, dotted-line reporting to the dev manager, direct-line to an SRE manager. Oncall duties are shared. SREs submit code and troubleshoot code-related issues alongside the devs, and devs are able to handle basic systems ops tasks.
Make sure your work load supports this structure. If you hire a bunch of senior-level engineers with software development experience, then ask them to babysit sick boxes all day, you’ll have a pretty high self-imposed attrition rate. As a leader, understand that making a distributed team an actual team, with shared knowledge and similar service levels and offerings, is hard. That effort must be shepherded by a strong leadership team (managers and engineers) who are all pulling in the same direction and communicating effectively.
|Better chance of creating a lasting partnership between dev & ops if team members are sitting together.||Consider how likely you’ll be to actually find people with this level of expertise who are interested in the position. You’ll most likely need to invest heavily in the education and growth of team members to “get there”.|
|Greater number of internal career options/paths for team members.||Must have a strong, communicative management team to pull together the embedded team members. Or just accept that you’ll provide inconsistent service offerings across the company.|
|Provides more fuel for promoting operational staff in dev-focused companies||Everyone must able to handle the ongoing conflict between operations and new product/service development in a mature way.|
Steph’s Def: Traditional sysadmins and more development-focused ops engineers together on the same SRE team. This feels like more of a transitional state or temporary stop-gap, rather than a long-term solution.
|Easier to tailor your support offering to each customer’s situation and needs.||Can create confusion across supported dev teams, particularly when teams compare what they’re getting out of your team. “Why do they have someone submitting code, and all our ops engineer does is maintain puppet configs all day?”|
|Functional structure when transitioning into a “devopsy” kind of structure over time.||Managing multiple career paths within the same org structure can be difficult for less experienced managers or organizations. (this matters, believe me)|
|SysAdmins who want to learn dev skills have a built-in pool of mentors to work with.||In a dev-focused organization, sysadmins may be seen as “low man on the totem pole”, with corresponding hurt feelings, etc.|
No SRE or Ops Team
Steph’s Def: All engineers are responsible for everything in production, from system configuration to writing/deploying code to maintaining production services. Hey, it’s an option. Don’t shoot the messenger. If it’s a small shop running in a hosted AWS-like environment with none of the bells & whistles needed for managing larger, more complex infrastructure, then this is probably an okay solution? (I dunno- it physically hurts me to say that) I don’t have any pros/cons here, other than to say that once the infrastructure needs outgrow a basic implementation, you should probably look for a few operationally-savvy engineers quickly, because you’re probably already behind the curve on scaling, tooling/automation and standardization.
Okay, that’s it! Now at least I have some talking points for next week. If you’re around, stop by the Dropbox booth to say hello!