Featured

Houston, we have a problem

Monologue

Apollo 13 is one of my favorite movies. An intriguing story unwinding in space, a lovable protagonist, and a gripping yet nail-biting finish. It has got it all.

One particular scene I love the most though. Yes, it involves the engineers. It involves solving a problem under pressure!

Sounds familiar? Isn’t it like solving a production bug directly into the production environment? Engineers across industries face such firefighting situations all the time. Many times, one can not plan for such disasters ahead of time. You just get the coffee going, gather around a table, collaborate, apply your brains and solve the problem.

The Problem

The three astronauts are forced to go to the Lunar Module because of the explosion in the Command module. But, the Lunar module is not designed for three astronauts. With each breath, the extra astronaut is increasing the CO2 level and overloading the Co2 scrubbers in the Lunar module.

There are plenty of scrubbers for the Command module, but they do not have backups for the Lunar module. Why can’t they just plug the Command module filters into the Lunar module?

They are of different shapes and sizes. The lunar module uses round scrubbers while the Command module uses square ones.

Solution

Engineers gather into a conference room and dump a bunch of hardware onto a table. This is the stuff that is available to the astronauts that the engineers need to use to build a CO2 filter. The goal? Fit a square peg into a round hole.

And they not only build the filters, they narrate the deployment steps to astronauts and it succeeds!

My Takeaways

  • As an engineer we need to solve the problem using whatever limited resources available, without complaining!
  • Use creativity, and ingenuity.
  • In firefighting situations, teamwork is extremely important. It might sound a cliché, but getting that final code review done for the SQL query that you are about to run in production is super-important!
  • Natural leaders typically shine in such situations. Backing each other and showing decision-making by remaining calm is extremely important.
  • Managers need to give enough freedom to their engineers in situations like this. Just stay out of the way and let engineers do their job. Don’t ask stupid questions, and don’t try to postmortem the situation at that time. That will come later.
  • Documenting the hotfix procedure and possibly testing that in some other environment will help. That will boost confidence in the fix being deployed. Staging (mirror of production) would be a good candidate to deploy and try out such fixes.
  • While it is understandable that production issues might not warn ahead of time, having a contingency plan is super important. E.g. having two web servers vs having one. This way, if one goes down , there is a backup.
  • Communications is the key. Communications internally as well as communication externally (e.g. to the end users or clients). Keeping people in dark is not beneficial here.
  • Trust in each other, trust the process is another key, I feel.
  • Duct tape can solve any integration issues.

PS: Idea for this post came from the podcast I love. Shout out to This Developer’s Life’s Space episode.