Hacker News new | ask | show | jobs
by pocketheyman 4285 days ago
Kind of interesting, according to the case file, the PACER records were being pulled en masse during normal court hours (typically when courts are also accessing the PACER database). A user noticed that PACER was going slow and notified PACER of the apparent slowness. Looks like they investigated, shut the PACER system down and were able to detect the requests were coming from an Amazon Web Hosting account linked to Swartz.

I find this interesting because it wasn't some flag on the PACER system screaming "HEY SOMEONE IS DOWNLOADING THESE EVERY TWO SECONDS" but instead was noticed because some law clerk was irritated at how slow the server was at responding.

2 comments

This is similar to how many breaches and DDoS attacks are discovered. Lots of companies have absolutely no controls to detect the most basic of flooding or spidering behavior.
First off: Totally true.

Secondly: Devil's advocate, but it is a "hard problem." It is easy to look for behaviour on the system, it is very hard to look for patterns of behaviour.

I mean let's say that some of your users are normal court clerks, it wouldn't be unusual to see them sit around and pull tons of records all day every day. So how do you pick up normal requests on-mass and unusual requests on-mass?

If I was in charge of protecting such a system I wouldn't even attempt to detect this (too hard). Instead what I would do is make it impossible to get records sequentially (e.g. 1, 2, 3...9999999) instead each record had a unique randomly generated token associated with it (a UUID/GUID).

So in order for someone to gain every single record they would either need to conduct a "real" break in and steal the files, or search for every possible criteria (which, for them, becomes a huge hassle/problem).

PS - Most DDoS are, these days, against layer 3 (network). Since it is far harder to defeat a layer 3 attack (as it can literally crash a lot of network hardware). While layer 7 (software) DDoS attacks still exist, they're often conducted by less formidable adversaries and they're much easier to stop (e.g. return a JavaScript redirect instead of the normal page, most browser-users won't notice, but it will defeat a targeted attack until they re-target (and you could rename it every 10 minutes)).

So, here's a story I heard recently.

The person involved wanted to create a local archive of records. An index of material was possible to obtain, but rapid sequential requests resulted in an IP block preventing further access.

Modest levels of restructuring the requests, in random sequence, with a significant (several minutes) delay between requests, and random delay, eventually succeeded in retrieving the material.

If that had failed, a distributed set of requests could have been attempted.

When I've faced issues of high (to the level of service-degrading) levels of traffic, I've found tools that allow me to aggregate requests by similar attributes, including requests coming from a defined network space (CIDR or ASN), which can be quite useful. Reading such patterns just from eyeball scans of logs is pretty bloody difficult, and tools to assist in this are ... poorly developed.

>Reading such patterns just from eyeball scans of logs is pretty bloody difficult, and tools to assist in this are ... poorly developed.

There's some enterprise software out there designed for use cases like this, but they're typically very expensive. There are also other issues, like the storage requirements of full logging of request headers and bodies if you really want to see the big picture.

Simple IP rate limiting will stop the majority of would-be scrapers/scanners in their tracks though. Especially if there's so much material that it could take days or weeks to finish a scrape if you had to add a random delay of 3 or more minutes per request.

I do network security for a large company, so I'm not talking completely out of my ass when I say you can have alerting in place to at least detect the most obvious behavior. There are also tools and even entirely inline appliances (look at RSA Silvertail) designed specifically to look for automated behavior against a web server.

Someone clever enough will be able to get around it, but it's really not hard to detect automated scanning or scraping behavior, especially if they're not delaying their requests in any way.

Stopping a layer 3/4 DDoS is another matter entirely. They're quite easy to detect but quite hard to mitigate yourself; you need your upstream provider to mitigate it for you.

Also, using Javascript interstitials against layer 7 attacks (like Cloudflare and Incapsula do in their default mode) will stop script kiddies, but they're not hard to get around if you know what you're doing. So you'd either have to, as you say, change the method every few minutes...or just use a captcha.

Do you find it credible that one request every 2-3 seconds could create a noticable load?
It depends on the pattern of the requests. If they're requesting different URLs each time, for example, then it could go under the radar for a period of time. If it's a resource where normally someone would not request more than ~15 articles in an hour (like what it might be for PACER), you can have alerting for when more than 50 articles are accessed in an hour.

Generally speaking rate limiting to that degree will help you evade detection though.

That really depends on the system, its provisioning, and how typical traffic patterns correspond to storage.

Some systems respond far better to random queries, hitting data in different places, often on separate spindles or storage devices.

Others prefer sequential requests, avoiding random seeks across heads.

And there are systems whose performance degrades spectacularly even under light load.

He was only making 1 request per 3 seconds. Must have been a really slow system.
"A user" unnamed "noticed" and complained to another unnamed person, etc, etc.

Yeah. Sure.

It probably wasn't Swartz's fault the servers were slow, but even if the slowness was unrelated he probably popped up in one of the queries they used to diagnose the system.

Or it was a conspiracy. Parallel construction has me freaked out too. I just don't think that it's the most likely explanation.

In light of the breezily incurious language about the person who supposedly "noticed" the supposed "slowness," I'd say it's one of those casual lies told to make a case look better and neater, and maybe hide someone who would be a troublesome witness.