Global Software Outage caused by failed update
In an unprecedented event, we are witnessing what may be the largest global software-induced outage in recent history. Airports, hospitals, pharmacy services, flight operators, train services, TV broadcasters, supermarkets, and numerous other critical services have been severely impacted. This blog post aims to provide an overview of the situation, its causes, and the implications for software deployment practices.
The Root Cause
The outage has primarily affected Windows machines using Crowdstrike for endpoint protection. Crowdstrike, a cybersecurity company valued at $80 billion and holding approximately 22% market share in Windows endpoint protection, pushed out a software update that has resulted in widespread system crashes.
The update, which operates at the kernel level, appears to have been deployed globally and simultaneously across all client machines. This approach, reminiscent of a "YOLO (You Only Live Once) deploy," has had catastrophic consequences due to the critical nature of the affected systems.
The Impact
The repercussions of this outage are far-reaching:
- Transportation disruptions: Airports and train services are experiencing significant delays and cancellations.
- Healthcare setbacks: Hospitals and pharmacy services are struggling to maintain operations.
- Retail challenges: Supermarkets are facing difficulties in processing transactions and managing inventory.
- Media interruptions: TV broadcasters are experiencing technical difficulties.
The scale of this outage suggests that its effects may be noticeable at the global GDP level.
The Fix and Its Challenges
Crowdstrike has advised a manual fix for the issue, which is both time-consuming and labor-intensive. The process involves:
- Booting the affected machine in safe mode
- Deleting a specific file
- Rebooting the machine
This procedure must be repeated for every impacted Windows machine, making the recovery process slow and resource-intensive.
Lessons Learned: The Importance of Staged Rollouts
The most perplexing aspect of this incident is the apparent lack of a staged rollout or canary deployment strategy. For a cybersecurity vendor of Crowdstrike's caliber, bypassing such crucial deployment practices is both surprising and concerning.
Staged rollouts are essential for several reasons:
- Risk mitigation: They allow for early detection of issues before widespread deployment.
- Impact assessment: Problems can be identified and addressed with minimal disruption to the user base.
- Rollback capability: In case of issues, changes can be easily reverted without affecting the entire user base.
The fallout from this incident underscores the critical importance of never skipping staged rollouts, especially for software that operates critical infrastructure.
Implications for Crowdstrike and the Industry
This incident will likely have significant repercussions for Crowdstrike's business and reputation. Customers may question the reliability of a security vendor that inadvertently caused such widespread disruption to the very systems it was meant to protect.
Moreover, this event serves as a stark reminder to the entire software industry about the potential consequences of rushed or improperly tested deployments. It highlights the need for robust deployment strategies, especially for software that operates at a fundamental level of system operations.
Conclusion
As we continue to monitor the situation and its resolution, this incident serves as a powerful case study in the importance of careful, staged software deployments. It reminds us that even for routine updates, the potential for widespread disruption should never be underestimated. Moving forward, it's crucial for all software companies, especially those dealing with critical infrastructure, to reinforce their commitment to safe and responsible deployment practices.