Hacker News new | ask | show | jobs
by luma 2270 days ago
Further, unemployment benefits are managed by the states, and those states are running web services which typically see a few hundred hits a day. They are now trying to process tens of thousands of new records each day, and at least in MI the service is absolutely not up to the task.

My wife managed to get her filing completed a little after 1am this morning. She was the only one of her 20 coworkers to successfully file, the rest are continuing to attempt to get the state web site to work today, while more people pile in.

These numbers are going to get much, much worse.

7 comments

Somewhere there is an architect saying "I told you so!" I can almost guarantee the requirement was to handle several hundred requests per day, an architect pointed out if we get deluged then we won't be able to handle it, so maybe they were able to get them to allow for one or two thousand requests per day.

Now of course we don't know what the architecture of this system is and what the deltas in cost would have been to allow this to scale-out more - but I do know that all too often the more robust solution giving you much greater protection and lower cost down the road is often discarded if it costs even just 5%-10% more. Then the day comes when the people making these decisions get caught flat-footed and they try to blame everyone but themselves. It doesn't always happen like this - but it happens a lot.

This reminds me of an old story about an engineer who took initiative and automated the accounts receivable process at his company, now they get paid 25% faster! He shows his boss and gets a promotion.

He decides to do it again, this time with accounts payable, and is promptly fired.

I think that is small-think. The technical solution is only part of the problem and scaling up all systems to meet the .1% case seldom makes sense. They were smart to save 5-10%.
Eh.... On the flip side, processing and storing some simple text forms should be able to handle 1000s of simultaneous users on one box.

So, probably like most software of this nature, the reason it's not scaling is simply because the people who made it probably weren't the greatest engineers on the block.

These are the same kinds of assumptions that lead engineers to think they can build a [any product] clone in a weekend. It's unlikely that the problem or constraints are nearly as simple as one may think.

Consider: single auth across all the state's services, external APIs, identity verification, address verification, employer ID verification, federal/military ID verification, income/tax verification, phone verification, bank account information, translation into multiple languages, accessibility features, etc. Also, there's probably a lot of legacy infrastructure and process.

Also, if "ability to burst to 10x normal filings per week that might happen once every 40 years" wasn't in the spec, I think they were right not to engineer for it.

Admittedly it's a value call. My thought is generally if it's a small incremental cost that greatly increases the robustness then you should go for it. But - sometimes the money or time just isn't there. I'm bothered more by the people not even wanting to have the discussion than by those who do a summary analysis and decide it's not worth it.
That's a fair point. My comment comes from being in too many meetings where people want Twitter scale for conference-room-sized user bases.

It sometimes borders on sealioning.

The 0.1% case happens. And if it’s going to seriously wreck lives when it happens then you should solve for it. Does Instagram need to handle the 0.1% case? No. But the unemployment website should.
Unemployment forms being delayed by a day or two to deal with poor queuing will not "wreck lives".
Yes but for every architect there's an antiarchitect saying YAGNI!!1
You spelled pragmatist wrong.
Wow, just found what my state (CO) is doing to help manage the influx. Talk about a low-tech workaround.

>IMPORTANT NOTICE: Because of the high volume of claims, we are asking that you help us help you and our greater community.

>If you need to file an unemployment claim and your last name begins with the letter A - M, file a claim on Sunday, Tuesday, Thursday, or after 12 noon on Saturday.

>If you need to file an unemployment claim and your last name begins with the letter N - Z, file a claim on Monday, Wednesday, Friday or before 12 noon on Saturday.

Ooooh, like gas rationing in the 70s.
> Wow, just found what my state (CO) is doing to help manage the influx. Talk about a low-tech workaround.

> Ooooh, like gas rationing in the 70s.

I wasn't born back then, but I heard about that being based on License plates at a few car meetups by the older guys in the group and I had the same thought when I heard that on CPR.

Odd, but it could work if you have total compliance; lets see how that pans out.

Distributed load balancing!
well as long as it works...
As a developer I immediately though of the power of queues. Twenty people trying to submit same form does not work for everyone, but a queue processing one person at a time might allow the twenty people to submit within a short time. It is flattening the curve! If I was contracted to fix this ASAP, I would set up an nginx front-end proxy config that doesn't allow more than X sessions and suggest a time in the future when they could try again.
Having worked on this type of application in the past they should find a new company to work with if they can't handle this traffic. We were handling hundreds of requests per second with ease 10 years ago. That was with MySQL and the app running on the same server.

It doesn't take many resources to show the user a form, validate it, and save to a DB.

A bunch of armchair developers seem to have been summoned to tell the Federal government how to handle form submissions for an extremely security and privacy intense application using their fancy modern techniques.

You are talking about comparing a basic web form with an application for unemployment benefits which must go into a federal tax database and be processed using a what I assume is a garbage mainframe system.

It not only needs to be validated, it needs to securely store records, be able to compare them, and hook up to the system that handles payments, etc.

They can't just circumvent it and dump it into some silly Amazon or MySQL database and call it a day. That would require the employees to basically copy and paste that data into the actual warehouse and considering they have 3+ million to go through as it is making it easy for them to process is just as important as allowing people to submit.

For the time being the correct response is a queue gate.

Stop being silly.

Yep, USDS and 18F folks would have to agree with you here. The arcane crap that we have to deal with in payment and government information systems is beyond frustrating and makes it extremely tough. I read an article about having to fix a multi-decade Cisco router bug to get CI/CD and automated deployments working after USDS / 18F started setting up faster deployments but still needed to figure out how to deal with legacy stateful DB connections.

The reality of government paperwork systems on the backend is much, much closer to this hell and is part of why so many like myself ran screaming from public sector because when you see so many peers doing so well at FAANGS, why would you subject yourself to something that resists change and wants to keep it the same way? https://www.washingtonpost.com/news/federal-eye/wp/2014/03/2...

The point is that backend pain shouldn't stop you from accepting it on the front end and putting it into a queue. Making the problem of getting the application through backend systems the states' to deal with, not the applicants'.
So are you applying for the 44k/yr job to fix it? No, most of are not.
> The point is that backend pain shouldn't stop you from accepting it on the front end and putting it into a queue.

What if the backend rejects the form? The user's already moved on before their form made it through the queue. So then you're stuck re-implementing all the validations the backend needs in order to give the user feedback (which you may not even be able to do) or trying to get the user to come back later to try again.

> Making the problem of getting the application through backend systems the states' to deal with, not the applicants'.

Reducing permanent staff involved in processing applications is probably one of the main reasons the automated system was built in the first place. If they still have to do that, then you might as well just replace the frontend with a printable PDF.

I also completely agree with this sentiment. A gov't form could be an unsightly complex beast that can't be re-architected, sometimes, ever.
They could, they just don't want to pay for it. The government has no interest in being known for easily handling a huge spike of traffic during a crisis. They can just take the lower road and get by with less and saying 'try again later'. There's no repercussions here because it's the government.

Hence mainframe maintainers should really move to charging $1 million/year in a decade or two.

They aren't choosing to have crap infrastructure, their infrastructure is intentionally defunded as part of a political campaign to engender distrust in government functions and increase privatization. Government is incompetent because if it is, its easy to justify selling off the country to the incredibly wealthy so they can get wealthier.
So, legitimately, how to we make it so the government does have repercussions? I see a lot of people making jokes about guillotines and nooses, but is there no better way?
Unemployment benefits are handled by the states, not by the federal government.
In Canada, it isn't. The parent's username suggests they talkin about the CAN.

https://www.canada.ca/en/services/benefits/ei.html

In the US, though, it's by-state.

GP was talking about GGP's suggestion, which was about a US state (Michigan).
None of those comments really help explain why the bottle necks.

If the form has to go into a mainframe well just set up an asynconous Queue

Validation is the problem. If someone thinks they’ve successfully applied, rejecting them asynchronously is often worse than not letting them apply in the first place.
The government generally has no problem with rejecting filed claims after review.
It's "armchair" to say "you get what you deserve if your entire system depends on garbage"?

...okay, whatever you say.

I called it garbage, but really neither of us know.
It doesn't take many resources to show the user a form, validate it, and save to a DB.

I bet that's what the previous developers thought.

What happens if you need to validate the form data against an external service that's coming and going due to the traffic spike?

What if your database is rejecting transactions occasionally?

What happens when your backup process locks all the database tables?

How do you reject duplicate form submissions from people hammering the submit button? Do you query the database to find previous submissions?

What happens when a scriptkiddie decides it'd be fun to DDOS the site? How do you differentiate good traffic from bad traffic?

What do you do when the cloud provider runs out of space and you can't scale up any more (https://news.ycombinator.com/item?id=22691926)?

You need to think of all of these things and many, many more to run a robust online service that can handle spikes hundreds of times bigger than the usual level. It's really not straightforward or simple.

Or it's a much simpler problem that they didn't make it semi-fast because it didn't need to be semi-fast.

When "hundreds of times the usual level" is still only 50 page loads per second, and 10 milliseconds of CPU per page would be extreme overkill for anything written in a reasonable way, it actually is straightforward.

It's not just CPU though, but IO - I've worked with horrible enterprise systems before that had response times measured in seconds.
Even 5 seconds will work if the actions can overlap. If it can't do things in parallel then we have issues much more fundamental than "performance", and there's no defending it as a competent system.

(That is not to say it's necessarily the devs' fault.)

All of those items are manageable. Some are simple setup or programming errors, some require a bit of added complexity but are normal in modern web apps.
Completely agree with the sentiment. I think most often it is inadequate default configuration that bottle-necks somewhere, that never got tested with more than a handful of users at a time. Going to a hundred highlights some bugs. going to 1000 others. On the other hand, I have worked on a project for USDA and they had 10 year old servers running 15 year old software and did not allow any system administration, while the system admins were some unknown government employees completely inaccessible.

I have had to build python distribution completely in home/user-space in some cases, working on conservatively managed servers.

Usually it's not so much the form that causes things to fall down but some validation step that they are trying to do synchronously, that might have to access an IBM mainframe, and things time out. When you're getting a few an hour, it's not a big deal.
At this point introducing a new company could cause more problems than it solves, and I think it's understandable to not be prepared for a volume of jobless claims that is almost an order of magnitude more than at any point in US history.
Put the web form (plain static assets JS/CSS/HTML) on a globally accessible CDN. Then use SQS intake for each unemployment application form. Then firehouse it out, wherever it needs to go, at a rate which you can realistically deal with it.

Queuing access to the form itself and telling someone to wake up at 4:52 AM so they can then merely access the static assets is a less-than-desirable user experience.

It is more desirable than 504, and first thing I would do in 15 minutes with zero context. If I can get more context, of course something like your solution is more desirable, depending on the issue. It would take some time to figure whether it is necessary to bring in AWS or just database connection pooler, or whatever.
Even typeform/Google Forms would be better suited for the task.
>Even typeform/Google Forms would be better suited for the task.

And now you've given a private company access to market-moving unemployment data. And a million other issues, especially legal ones.

The technology part in and of itself isn't that difficult, it's all of the constraints (and, often, mountains of laws) that are the bigger issue.

In related news on Queue-it: https://tech.eu/brief/queue-it-funding/
Solid company
The matching UK system has a (huge) queue in it: https://www.computerweekly.com/news/252480546/Huge-queues-fo...
Is there a human factor in processing these?
Ocado (the IaaS for online supermarkets company, and, in the UK, online-only supermarket itself) has done this in response to the increased demand, and makes you wait in a 'virtual queue' (virtual relative to what in America you call a 'line-up', but we call a 'queue', at a physical supermarket) before you can place or edit your order.
> If I was contracted to fix this ASAP

You’re assuming that the people who built it in the first place (or the people that may or may not be contracted to fix it later) know or care. Remember, this is government contracting we’re talking about - lowest bidder wins. How do you win the lowest bid? By doing it as cheap and quick as you can. That means hiring inexperienced/cheap developers who can build something that looks like it will work for far less money than you can build something that actually will.

Secondhand story:

I briefly interned with a state judiciary's IT department around 2015 and got to get lunch with the CIO. He described to me how most court filings in the state had been manual prior to 2008 when the mortgage crisis hit and judges in the tax courts got _slammed_ with cases surrounding foreclosures. This , in turn, drove a need to develop a platform to automate the process of filing a case. It started with the tax court and gradually expanded to automate filings for other court divisions as well (e.g. Family, Civil).

I wouldn't be shocked if the revelation of "holy shit no one can file for unemployment" drove such an investment. I honestly think the next generation of politicians should take a page from product owners by isolate some shitty process that they'd have jurisdiction over, and finding some way to automate it. Bonus points if it's right before a watershed moment- imagine if someone had considered the problem you described prior to the coronavirus epidemic.

I mean, you can tell the numbers are extremely inaccurate via just a simple, cursory glance at the report.

Pennsylvania reported 378k claims.

California reported... 186k claims.

Yesterday, California's governor said they've received more than 1 million claims since March 13th (so, over a 12 day period from the 13th to the 25th). This DOL report covers March 14th through the 21st.

Are we to believe that the remaining 800k+ people all filed on March 13th, or March 22nd through the 25th?

But there's more. Utah reported an increase of only 9 claims compared to the week before. They went from 1,305 to 1,314.

Then, New York, where more than half of Covid-19 cases in the US are, reported only 80k?

As of 3/21. Lockdown in California began in the evening of 3/19. Still not enough time for the numbers to react.
"Are we to believe that the remaining 800k+ people all filed on March 13th, or March 22nd through the 25th?"

I could believe it over the 22-25 stretch.

Especially since "filed" may here mean when the paperwork was finally able to snake its way into some particular system.
They also tend to have some... interesting features dictated by the state UI office. When I'd applied in Wisconsin about 6 years back the site stopped accepting form submissions outside business hours.

I assume some less computer-literate higher up thought that someone needed to be around to actually accept the form, same as in-person submissions.

I confirmed that the state's unemployment website has slowed to a crawl (if it is working at all).