As the dust settles following the massive Windows BSOD tech outages caused by CrowdStrike in July 2024, the question is now, how do we prevent this happening again? Microsoft convened a summit with members of its Microsoft Virus Initiative (MVI – of which CrowdStrike is one) to discuss a problem that has no simple solution.
The CrowdStrike incident
Simplistically, back in February 2024, CrowdStrike introduced a new InterProcess Communication (IPC) Template Type with Falcon sensor version 7.1 that defined 21 input fields. CrowdStrike’s rapid response mechanism uses content delivered via Channel Files. The content interpreter for the Channel File 291 provided only 20 input values to match against.
On July 19, 2024, two additional IPC Template Instances were deployed. This required a comparison against the 21st value when only 20 were expected. In CrowdStrike’s words, “The attempt to access the 21st value produced an out-of-bounds memory read beyond the end of the input data array and resulted in a system crash.”
From a technical perspective, Microsoft was as much a victim of this incident as were the endpoints that suffered the BSOD – Microsoft had no direct involvement. The CrowdStrike kernel driver had been evaluated and signed by the Microsoft Windows Hardware Quality Labs (WHQL) after a full evaluation. The cause of the crash was not the driver per se, but the content passed from outside of the kernel to the driver.
“That’s something Microsoft would never have seen. It traversed Microsoft. It’s not documented. Microsoft doesn’t know what’s in that file. It’s a binary code that only CrowdStrike knows how to interpret,” explained Weston.
The MVI Summit
While there was no current way Microsoft could have prevented this incident, the OS firm is obviously keen to prevent anything similar happening in the future. It was to discuss what could be learned from the past and developed for the future that Microsoft convened the MVI summit held on September 10, 2024. There were two related but separate issues to be discussed: access to the Windows kernel, and software testing prior to deployment.
The kernel
The advantage of having a driver within the kernel for third party security providers is clear: greater security for themselves (and by extension, the users) and better performance. The disadvantage is the damage that can be done from a failure in the kernel is more extensive and less easy to reverse.
“The difference in the severity of issues between kernel mode and user mode,” explained Weston, “is that if you crash in the kernel, you take down the whole machine. If you crash an app in user mode, we can generally recover it.” This is an argument for maximizing the use of user mode and minimizing the use of kernel mode. It would benefit Microsoft’s own Windows customers, but Weston further suggests that some of the third party software vendors would also welcome the opportunity to employ a user mode component. “Microsoft is now investing in a capability to do that.”
“The difference in the severity of issues between kernel mode and user mode,” explained Weston, “is that if you crash in the kernel, you take down the whole machine. If you crash an app in user mode, we can generally recover it.”
This has already raised several concerns. Is Microsoft intending to increase user mode as an option, or is it intending to phase out third party kernel drivers? Noticeably, ESET (one of the MVI summit attendees), commented at the time, “It remains imperative that kernel access remains an option for use by cybersecurity products.”
Pressed on this, Weston admitted that some vendors are concerned that Microsoft may kick them out of the kernel. “Can user mode framework be as good as the access they currently have in terms of performance, etcetera? These are valid concerns. But at this point, we have no plans to revoke kernel access from anyone. It doesn’t mean that can’t change in the future, but we have no plans to do that. Our goal is to create an equivalent, and an option, for user mode.”
While ‘to kernel or not to kernel’ may be the issue that catches attention, Weston believes it is the smaller part of a two-part problem. Of greater importance is software testing prior to deployment – and the use of safe deployment practices (SDP).
Safe Deployment Practices
“Whether your security product is in the kernel or operating as an app,” explained Weston, “you can still destroy the machine or make it unavailable. If you’re operating as an app and you delete the wrong file, you can cause the machine not to boot. That alone proves the argument that effective SDP is the better ROI in terms of protecting an incident, because whether you’re in kernel or user mode, you must have SDP to avoid accidental outage.”
SDPs are not a new idea. USENIX published a paper out of Utrecht university in 2004 titled ‘A Safe and Policy-Free System for Software Deployment’. Its opening line reads, “Existing systems for software deployment are neither safe nor sufficiently flexible.” This problem with SDPs has yet to be solved, and such a solution is an important aspect of Microsoft’s plans to limit future outages.
This was discussed at some length at the MVI summit. “We face a common set of challenges in safely rolling out updates to the large Windows ecosystem, from deciding how to do measured rollouts with a diverse set of endpoints to being able to pause or rollback if needed. A core SDP principle is gradual and staged deployment of updates sent to customers,” comments Weston in a blog on the summit.
“This rich discussion at the Summit will continue as a collaborative effort with our MVI partners to create a shared set of best practices that we will use as an ecosystem going forward,” he blogged. Separately, he expanded to SecurityWeek: “We discussed ways to de-conflict the various SDP approaches being used by our partners, and to bring everything together as a consensus on the principles of SDP. We want everything to be transparent, but then we want to enforce this standard as a requirement for working with Microsoft.”
Agreeing and requiring a minimum set of safe deployment practices from partners is one thing; ensuring that those partners employ the agreed SDP is another. “Technical enforcement would be a challenge,” he said. “Transparency and accountability seem to be the best methodology for now.”
It’s not like Microsoft has no teeth. If it finds that a partner has ignored the SDP, it can withdraw signing any kernel driver. “It’s the same way we work with root certificate agencies today. We have a standard, and if you don’t abide by that security standard, we can remove you, which would impact your business significantly.” At the same time, the insistence on transparency would show customers that this provider is not being honest with them. “We think that level of enforcement is pretty effective,” he said.
“My TLDR,” Weston told SecurityWeek, “is that SDP is the best tool we have in the toolbox for stopping outages. Kernel mode, user mode – not saying those are invalid, just saying those are a much smaller part of the problem. SDP can help prevent outages both inside and outside of the kernel.”