|
|
|
|
|
by nullwasamistake
2536 days ago
|
|
The main problem is that their Regex library doesn't have a recrusion limit. I'm honestly amazed they've been able to scale Lua scripts to the point they can use it as a global WAF. Knowing this, it may be easy to create attacks against their filters. My takeaway is that it's time to move to a custom solution using a more flexible language. A simple async watchdog on total rule execution time would have prevented this. When running tons of Regex rules I'm amazed they didn't have this |
|
For example my company (nowhere near the scale of Cloudflare) does progressive deployments. New code is deployed only to a handful machines first, and then as the hours pass and checks remain green it propagates to the rest of the server fleet. Full deployment takes 24 hours. We never had code breaking changes in production in the past 3 years. And before that, us breaking things was the most common occurence for production issues. Of course that's not the only thing we do, good test practices, code reviews etc.
The second thing, is separation of monitoring and production. If production going down takes down the monitoring systems too, you will have a very hard time figuring out what's wrong. Cloudflare says "We had difficulty accessing our own systems because of the outage". That sounds very bad.
I 'd wager there are many wrong things at play here other than "regex is hard". But I guess HN loves cloudflare way too much to ask the hard questions.