Resilient thinking after CrowdStrike incident
Now that the dust is settling on the CrowdStrike outage, it's worth looking at some of the technical things we should be doing to prevent this in the future. Naturally, there is the question of why the QA process never picked up the issue in the content update, but that is not the focus of this article – Here we are looking at what we can implement to mitigate any similar future issues.
Firstly, it's worth saying big thanks to all the technical folk out there who worked to get the systems back up and working. They probably didnt choose the software, but they are responsible for ensuring it is up and working. These are the folks that when everything is working perfectly, there is no “thank you” being said! As someone who works in IT and does work behind the scenes ensuring systems are “up” and “performing as intended”, I understand and appreciate the work being done to resolve computers affected by this incident.
Thank you! you do an amazing job! 👍
Resilience
Before we look at some ideas to prevent\mitigate these types of issues in the future, lets just remind ourselves of how resilience is deployed in modern computing.
Network
When thinking about mission critical systems, we are always taught to ensure there is resilience across multiple layers. The most obvious and basic is multiple internet links from separate carriers. This requires some magical routing to work, and is generally now well established. We access sites and services across the world, circuits go up and down, and still the traffic flows. Ok, it maybe a little slower, but it gets there and we don't notice.
Virtualisation
The next layer of resilience is that of computing power. We run our services on multiple servers, that are distributed across multiple physical hosts, across multiple racks. If a rack of physical servers goes down, or need maintenance, then our service continues working. This is now also a very well established design pattern in computing architecture.
Application
Within the virtualisation layer, further resilience is applied – our application runs on multiple operating systems, spread over multiple physical hosts, in multiple racks. Now, if our appliation crashes on one virtual server, then another instance picks up the slack. Yet again, this is now a very common design with technologies such as containerisation and application load balancers.
Cloud
So we have all our wonderful networking routing and our application is load balanced across multiple virtualised servers. Everything is wonderful, but its hosted in my office. The next part of resiliency is that of “The Cloud”. The cloud involves using all the above resilience techniques but in a highly resilient and managed set of data centres across the globe. Now, we can have our application hosted on multiple virtual machines, across multiple racks, in multiple data centres, all with multiple network connectivity.
This is it now isn't it? What more do I need?
Multi-Cloud
Ok, so what if my region in my cloud provider goes down? That's OK, I have my application in two regions. But what if my cloud provider suffers a major outage in their networking? Yes, it's rare, but does happen. On exactly the same day as the CrowdStrike outage, Microsoft had an outage in their Central US region (ID: MO821132), which impacted the majority of their M365 services. Hang on a moment – surely Microsoft have their services geo-distributed? Yes, they do, but there will no doubt be some backend services that are only hosted in a specific region, and this incident highlighted that.
The three major cloud providers don't make this particually easy, although there are signs it is getting easier. For example, Microsoft Defender for Cloud can monitor AWS and Google Cloud workloads for security issues. Azure Traffic Manager (a DNS load balancer) can route traffic between endpoints hosted anywhere on the public internet.
(I'd like to provide examples of where AWS and Google allow multi-cloud connectivity, but my day job is mostly focused on Azure, thus this is my area of expertise!)
Of course, it is possible to join networks together with Virtual Private Networks, Mesh Networking, or leased lines (such as AWS DirectConnect \ Microsoft ExpressRoute) but these can be costly. The use of multi-cloud wouldn't however helped any specific company with this outage, however, you can see that it's another aspect of resiliency that is getting more popular.
Improvements
So what improvements can we make beyond the above to help weather the storm for future issues similar to this?
1. Critical Vendors
What about Vendors? Why do we insist on just having Acme Corporation provide all our endpoint security services? As we've seen with CrowdStrike, one dodgy update that slipped through QA can have dramatic effect. How about having two vendors for your endpoint security services? Everything else in our tech stack above has been made resilient, but why not security services? In the example of our application being hosted on four virtual machines, two VM's each would run different security software.
One issue that maybe preventing this, is the management platform. For this to work, IT Departments would require separate management platforms for each of their endpoint security services they are running. There is a gap in the market here – for an open source platform that can provide basic management of such software, reporting on elements such as:
- are my hosts up?
- when where they last updated?
- push out latest update
Of course, this would probably need some co-operation between vendors, but it's a nice thought that one day we could do this? 🤔
2. Delays
Back in the days of Windows NT, I recall we would never apply a service pack as soon as it was released. This trend still happens now in some areas, for example – Never installing latest OS in production as soon as it's released. At a number of organisations I've worked at, I've recently been introducing a delay to patching in production.
For example: Following Patch Tuesday – on the following Monday, all Development Servers get patched. Then one week later, all Test Servers get patched. And the following week (exactly two weeks after Patch Tuesday) Production servers gets patched.
Another routine I introduced at another company was after Patch Tuesday, we would auto-deploy to internal servers on Wednesday, and then production servers on Friday and Saturday.
Of course, these may not suit every businesses, but you get the idea – even just having a short delay of a day can help, and give the technical teams a critical thinking break should anything untoward happen.
For endpoint security which relies on signature\definition\content updates (of some type), it's a little more difficult – we the technical user assume these are small updates that can be applied without second thought. But yet again, could these updates be delayed just by one day? Do they really need to be installed the minute they are released?
3. Pragmatic Deployments
This of course is very dependant on the environment, but does the endpoint security software have to be installed on every server in the environment? For example, in this incident there were many iamges in the media of photos showing the BSOD (Blue Screen of Death) error. I'm making assumptions here, but are these servers just displaying information for the general public? Could other security prevention techniques be deployed instead, such as network isolation, server hardening (e.g. disabling ALL unrequired services) or robust user access control?
Final Thoughts
There is a lot of finger pointing at Microsoft, but in my eyes this is wrong – The Microsoft shared responsibility model outlines that the customer is responsible for any services\applications deployed on their virtual machines (IaaS).
There is also further critism saying that we should all move to Linux or Mac. (but it went without notice that Crowdstrike had a very similar issue with Debian Linux in April 2024). We all have a choice to make on what operating system or application we use, however we should always assume that there will be issues in the future and plan appropriately to mitigate them.
No technology or process is perfect.
By looking holistically and pragmatically at the systems and their dependancies that we support, we can hopefully reduce any future impact of outages similar to the one that we have all just suffered.