Jon Masters’ Post

View profile for Jon Masters, graphic

Computer Architect #ArmServers @Google | Previously @Red Hat, @NUVIA | Author of several Linux programming books

In case anyone is still looking for a super simplified explanation for what happened with that CrowdStrike issue today: Your Windows computer has different types of software. The most critical piece that controls the hardware (chips, memory, etc.) is called a "kernel". The kernel can be extended to support new hardware using "drivers". Usually, these are for something you added to your computer (e.g. a GPU/graphics card). But some drivers can be used to do different things. One thing that is popular (but wrong, in my opinion) is to do "cyber" "security" or anti-virus type stuff by extending the kernel using a driver. This is needed in order to intercept certain basic operations (like opening files or network connections to other computers) and monitor them for "compliance" and the like. Using a driver, you can (for example) have code that monitors every file that is opened and screens it for malware. It plugs in at such a low level that anything opening a file on the computer will be detected. That's why drivers are used. The problem is that drivers extend the Operating System kernel and the OS kernel is not a forgiving environment. While normal software having a bug might "crash" and need to be restarted, bugs in OS code will cause the whole computer to stop running (blue screen). It appears as if in this case, the driver itself was ok, but an updated file was provided that was used to tell the driver what to do. A bug in the driver (couldn't handle corrupted config files) meant it tried to access a bad memory location and crashed, taking down the machine. Because this driver is loaded so early during system bootup, it can cause a "boot loop" where it loads, crashes the machine, the machine restarts, and the process repeats. Until someone manually boots in a special recovery mode and deletes the bad update. Anyway, that's the super simplified explanation folks haven't given you all day.

Eric Curtin

Principal Software Engineer working on Red Hat In-Vehicle Operating System

1mo

A lot of Linux systems and other OSes have automated protection against this early boot loop problem... Android, ChromeOS, rpm-ostree based systems with greenboot do. Which prevents this in theory. When a system upgrade is applied you atomically switch root into that new updated rootfs on reboot. You store a boot counter somewhere, let's say as a GPT partition attribute, some Android devices store the counter here, let's say that's set to 7. Everytime a boot is attempted via bootloader that boot counter is decremented, if a boot is completed successful this boot counter stops and the boot is marked as healthy. But if that counter goes to zero, you rollback to the old rootfs with the old software so you have a bootable system. RHEL for Edge and Red Hat In-Vehicle OS work like this. In fact any modern built from scratch OS should work like this, at least for core components (things like crowdstrike that have kernel drivers are in this bucket) 👆 Legacy OSes like Windows have an excuse though, when they were developed nobody was doing AB system upgrades with automated rollbacks.

Jeffrey Chamberlain

Principal Engineer at Intel Corporation

1mo

I have to agree with your parenthetical point: is this level of under the hood access to the machine's internals through an OS driver even worth the risk/reward tradeoff? It is a conversation that needs to be had. It is at least arguable that the risk that this kind of error poses is the same, if not greater, than the additional risk-mitigation that level of driver access is providing to the overall security solution.

Like
Reply
Irvan Krantzler (he/him)

Leading software teams to accomplish great things

1mo

It sure seems like they should have figured out that a driver with a bug or a compromised driver would certainly make them vulnerable. So rethinking that will most certainly happen. I’m curious about why they didn’t catch it during a slow rollout or something of that ilk. Why did the entire world have to be disrupted? Knowing how hard this all is, it’s easy to be a Monday morning quarterback. But I would like to understand that part of it, because the scope was what really surprised me.

Like
Reply
David Baeumler

Marketing Director at Red Hat Inc. | Creative brand & product storytelling

1mo

We should have made a video together about this.

Tim Ocock

Interim CTO and product delivery leader

1mo

This is a terrible layman's explanation. Why is it even necessary to introduce the concept of a kernel for laymen to explain this? Here's a better explanation "Anti virus software needs full access to the whole computer to spot and catch viruses. Unfortunately that means if there's a bug in the anti virus itself, it can crash the whole system. Since anti-virus software updates regularly, a new update came that had exactly that kind of bug in it, so every computer running that anti virus crashed."

Roberto Avanzi

Security and Cryptology Architect, Research Fellow of the CRI, University of Haifa — Security Engineering Veteran — Designer of QARMA, co-submitter of Kyber (FIPS 203: ML-KEM)

1mo

Well, no. The driver was not ok. If a malformed config file can cause it to crash, then the driver was defective. Software should always properly sanitize any input. The driver did not do it.

Berenice Mann PhD, FCIM, Chartered Marketer

Director of Marketing | Head of Marketing | Experienced, accomplished marketing leader with 20 years' experience | B2B | Technology | Product Marketing | Brand Strategy

1mo

Isn't the real problem that it seems to have been released without proper testing? On all current OSs. Surely testing on *just one* real computer would have shown the issue up instantly.

D. Scott Bonomi

Playing Bridge and waiting for the right gig

1mo

Anyone playing in kernel pace should have an automated rollback if the system fails to boot. Keep a copy of the last successful file set and if you get some number of boot fails, roll back to the last good boot set and then indicate the update is a bad file. A history mark as Failed instead of Installed and then if another attempt is made to load that update, report immediate failure. I am sure the amateur wizards in Redmond have never considered such an option. I work in the embedded space and a remote update cannot leave a system in an unusable state. I do recall being told to fix an issue with the statement "do not allow us to make any boat anchors" where the only possible use for failed boot system was as an anchor.

James Cuff

unix whisperer | hpc apprentice | advisor

1mo

Great write up. New info today. “On Windows systems, Channel Files reside in the following directory: C:\Windows\System32\drivers\CrowdStrike\ and have a file name that starts with “C-”. Each channel file is assigned a number as a unique identifier. The impacted Channel File in this event is 291 and will have a filename that starts with “C-00000291-” and ends with a .sys extension. Although Channel Files end with the SYS extension, they are not kernel drivers. Channel File 291 controls how Falcon evaluates named pipe1 execution on Windows systems. Named pipes are used for normal, interprocess or intersystem communication in Windows.” https://www.crowdstrike.com/blog/falcon-update-for-windows-hosts-technical-details/

Considering the abundance of computing resources (number of cores per processor), maybe it is time to switch to hybrid kernel architectures like macOS and iOS (probably based on NextOS of Next workstations by Steve Jobs) for desktops and workstations. Hybrid kernels are safer: Core functionality like memory management, process/thread management etc is run in Ring 0 and device drivers, file system services, TCP/UDP IP stack is run in a separate memory space (not the user space, in between). This is much safer IMHO. It is not necessary to start from scratch, FreeBSD and Mach kernel could be used as a starting point.

Like
Reply
See more comments

To view or add a comment, sign in

Explore topics