Making software and software development teams. Mostly the people parts.

Downtime Prevention Technique: Exit if environment variables are not found

I've written about the concept of assertively throwing errors before. This is a specific example:

If the application, service or script you're developing requires an environment variable to be able to do the things it does (fulfill its contract and meet its SLAs, in enterprise-ie terms), then you should prefer a loop at the earliest stage possible in the runtime startup that iterates through the expected environment variables, build an array of any of them that are null / undefined / not found, and if that array is not empty then exits 1 with a console barf of all the required environment variables.

Coupled with a proper deployment pipeline that checks for a live instance before flipping or rotating traffic to the new instance, this pattern completely prevents a whole host of downtimes caused by well it worked on my machine and worked in the staging environment(s), but failed when deployed to prod because we forgot to set the environment variable / didn't have permissions to set or even check if the environment variable was set / fat fingered or copy-pastaed the environment variable.

In my experience, the problems will occur at the first deployment of something like 5-10% of any group's new features that require an environment variable, especially as groups grow and deployments are handled more and more by less experienced engineers.

The Two Dimensional Web Skills Stack

A mental model useful for:

  • An engineering leader reasoning about their team and strengths / weaknesses / gaps
  • An individual contributor trying to position themselves in the job market or work on skills.

The vertical dimension is the one most are familiar with: Browser, mobile or desktop applications in the Front-End, service code running in different deployment contexts on the Back-End, and Persistence aka database(s). And then in sufficiently complex systems, layers of the above with persistence either interspersed (edge caches) or distributed (local storage in the browser, an in-mem cache on the web server, etc).

The horizontal dimension is the Software Development Life Cycle (SDLC) stretched longitudinally. Run it left to right, like a timeline. Starting with learning and planning (usually in collaboration with a Product Manager) and design (visual, UX and/or technical) and running all the way through running, wiring together, maintaining and even all the way to sunsetting and tearing down.

Running this exercise on any group shows something that should be obvious, but isn't: the Full Full Stack Engineer who can operate in most if not all quadrants of the two dimensional spectrum and be very proficient in at least one of them is rare and has become an extremely valuable commodity.

50% in one vertical hemisphere is awesome as well: The front-end specialist who can sketch and design and understands every bit of how browser apps are tested, deployed, secured and observed; the back-end specialist who can anticipate future needs, run migrations, reason about whether code should be run in a container or serverless and the associated abstractions, support full-stack stack traces, secured within the network topology etc.

100% vertical but no width has gotten tough, at least in Software As A Team Sport context. You have to hide the engineer who can't collaborate with the PM and just wants to throw their code over the fence to a DevOps team who worries about the fiddly bits of actually running this stuff.

If I was starting out my career in tech today, I'd pick either front-end or back-end, learn the basic tech, and then dig in on the heart of the SDLC... design through deployments through operating. In 3-5 years you can either learn the other half of the stack, or go the Devops route.

For building an engineering team today, I would (and do) look to value the horizontal axis (SDLC coverage) over the vertical (full tech stack capability) and try to get it in at least half to two thirds of my team.

Andy Grove's High Output Management and the Three High Impact Activities

Andy Grove's High Output Management is a classic. My first mentor insisted I read it when I was building my first team, and I'm happy to see it still relevant and referenced today.

While the book has a massive amount of gems, the wisdom can be summarized for fast consumption by bulleting out Grove's opinion of the three highest impact activities a manager can focus on, these are:

  • Gathering information
  • Making decisions
  • Influencing others

I'm not sure those map perfectly to my opinion of the highest leverage activities a leader can focus on in 2022, but they're close. Real close. You won't lose with that as your go-forward how do I organize and plan my week framework.

There's a Medium post out there that summarizes the book nicely if you don't have time for the full read. But this is one you really should make time for the full read for.

Roll Back or Fail Forward

How to decide between rolling back or failing forward (or fix-in-place) as a policy for a given experience in a given environment.

Rolling back is generally the more mature habit. It enables everyone to take a breath, identify and fix problems (ideally with test coverage) in a low-pressure environment. Failing forward, on the other hand, is cowboy coding. And it's fast, and if it works it generally lets teams move on to the next problem much quicker and gets value in the hands of users faster.

Often I've found myself in situations where an engineering squad is making an argument for why they should fail forward in a given environment.

For me, as an engineering leader, I use the following framework:

If, in a given incident, users and stakeholders of the experience would be fine with an order of magnitude (10x) the downtime you've taken, then failing forward is fine.

Example 1: A production retail site. A typical incident might be 12 minutes long. 12x10 = 120 minutes. During that time the business would lose an estimated $50k in sales. That's not acceptable, therefore the policy should be to roll back failed changes.

Example 2: A non-prod environment is used by a vertical squad (PM+Design+Engs) to demo changes to stakeholders once a week. Same typical incident time but in this case, 120 minutes of downtime is fine unless of course it happens to overlap with the weekly demo. Failing forward is an acceptable policy in this case, trusting the team to manage around that weekly demo (policy can be written such that the team gets autonomy to choose between rolling back or fixing in place by accepting the accountability of managing around their own schedule... this is probably a separate post).

A counterpoint is when you have a team that's too aggressive for your current state. A while back I was on a team that had been acquired, and many of us were still acting like it was a startup even after revenue in some streams had 100xed. That team needed a strong counterpoint to its previous behaviors, and so "never fail forward in any environment" became our tacit policy.

The tacitness there is important. As always, the best policy is no policy. A team that expects to be told what to do in every situation needs either more capability or more confidence or both. A team that requires it is functionally useless other than as a feature factory. First principles and a strong values-based decision making culture, on the other hand, obviate the need for explicit policy in every space. 


Late 2022

Director of Engineering at AllStripes Research, a Series B healthcare tech startup working on advancing treatments for rare disease. Coming up on two years now... still a newb in health tech, coming around and still lots to learn every day.

Living in Portland, Oregon for 13 years now... long enough to love the place, not nearly long enough to shed the Californian label.

What's happened recently: Mom reached the end of her adventure. Those who know me personally likely know how close Mom and I were, and how committed I was to helping have high quality time at the end of her journey. She's home now, and I'm finding in a strange way that I can appreciate her more now that I'm removed from the day to day management of her care.

What's exciting: Despite it being my third decade now, I still find myself excited by tech and the opportunities it creates. So many of us would be working in factories just a generation or two ago. And now in the After Times, with so many remote opportunities, you can be in Columbus, OH or Greenville, SC or Wichita, KS (to name a few tertiary cities I've spent time in during the last decade) and have access to the best jobs in the world. Tech isn't perfect - far from it - but I remain adamant that the good far outweighs the bad.

What am I digging in on:

  • Healthcare and health tech. I've been in sponge mode for nearly two years now, not wanting to write much because I was still learning, and also experiencing the journey as caregiver for my mom during her final years.
  • Quantified self. Starting on January 1, 2020, I've been recording every day what I do at a high level, what I focus on, and a rating of my day -2 to 2. Recently added an Apple Watch for additional health metrics, and pumped to start getting quarterly testing added to the mix. Where it's all going is a set of outcomes and the inputs put into the system to hit those outcomes...
  • Travel. Or getting excited about travel again. I've really gotten to know Portland so well over the last 3 years. Now it's time to get into the world again...