Design To Fail
Never build your software in the belief it will run correctly. It won't.
Modern day software design always tries to make the software we run close to stable as possible. Deadlines and limitations the teams will always have make it next to impossible to ship something bug free. Most likely than not, a 1.0.0
release will be a 1.0.1
in the same week or even day the version is available. The infrastructure that runs said software is no different.
In this post of #StrikingABalance, I'll be focusing on the work that goes into the infrastructure design and what can be done to make it more manageable. Infrastructure design should always be thought of as something you want to let crash. The whole idea of treat your services as as cattle rather than pets is with this precise notion. Letting things die, but having automation to bring it back alive.
Designing to fail means designing your software and infrastructure to gracefully manage itself. Ever had the moment that you get a call at 2:00 AM, because something is on fire on production? If you design your system to fail, that call would come by with the commonality of lightning striking twice.
So how do you go about in designing to fail? You could go with the route of always adding tests. Similar to unit and integration tests in software design, you add unit and integration tests to your infrastructure design. But the test then need to record what goes wrong. A way track events of your infrastructure's failures. Those tracked events would then trigger some automation tooling, you most likely have made, to do the response you want it to do for your infrastructure.
At the same time, due to the almost certain countless of moving parts in your infrastructure, you probably cannot test the whole platform in a silo. It may simply take too long, cost too much, or just not be feasible with what you have available. Meaning, you need to learn how to let infrastructure fail in production. You end up having load balancers, canary deployments, roll back releases, injected load tests, structured automated service rerouting, event triggered automation processing, database vertical scaling, horizontal service scaling, and countless other practices and paradigms.
No right way for how to design to fail exists. What you need above all else is to be comfortable with letting the infrastructure crash. Have a way to learn from those crashes. Then, make practices that can catch these crashes and recover before you have to be involved. So when those bugs creep up, you don't need to be ready, because your infrastructure is built to fail.