Crowdstrike has released an RCA concerning the latest outage
Crowdstrike has released an RCA about the recent major outage.
According to the introduction it is less technical and uses generalized terminology. I would not subscribe to that. The document is extensive... after fighting your way through a protective layer of bullshit. It seems that they are trying to use a lot of buzz-words to substantially inflate an explanation to a rather trivial problem.
The problems are fairly straight forward and true classics.
The actual cause is that a template parser only captured 20 parameters instead of 21. The 21st was apparently not used until now. However the validator assumed 21 and did not validate the length of an array. When attempting to read the 21st - and non-existing - element in that array there was an out-of-bounds memory read, which caused the systems to crash.
In addition there were automated tests that only tested the used case and not the general case and there was no staging, which would have been very favorable in this scenario.
In the second part of the document Crowdstrike promises improvements and explains what they intend to improve in the process or have already improved. That all sounds good. Unfortunately it must also be said that this problem was absolutely avoidable. The problem is spread across several issues. But it could likely have been discovered and prevented at any one of them.