I think that is small-think. The technical solution is only part of the problem and scaling up all systems to meet the .1% case seldom makes sense. They were smart to save 5-10%.
Eh.... On the flip side, processing and storing some simple text forms should be able to handle 1000s of simultaneous users on one box.
So, probably like most software of this nature, the reason it's not scaling is simply because the people who made it probably weren't the greatest engineers on the block.
These are the same kinds of assumptions that lead engineers to think they can build a [any product] clone in a weekend. It's unlikely that the problem or constraints are nearly as simple as one may think.
Consider: single auth across all the state's services, external APIs, identity verification, address verification, employer ID verification, federal/military ID verification, income/tax verification, phone verification, bank account information, translation into multiple languages, accessibility features, etc. Also, there's probably a lot of legacy infrastructure and process.
Also, if "ability to burst to 10x normal filings per week that might happen once every 40 years" wasn't in the spec, I think they were right not to engineer for it.
Admittedly it's a value call. My thought is generally if it's a small incremental cost that greatly increases the robustness then you should go for it. But - sometimes the money or time just isn't there. I'm bothered more by the people not even wanting to have the discussion than by those who do a summary analysis and decide it's not worth it.
The 0.1% case happens. And if it’s going to seriously wreck lives when it happens then you should solve for it. Does Instagram need to handle the 0.1% case? No. But the unemployment website should.
So, probably like most software of this nature, the reason it's not scaling is simply because the people who made it probably weren't the greatest engineers on the block.