Self-Hosted JA4 to combat AI bots

Y	Hacker News new \| ask \| show \| jobs

	Self-Hosted JA4 to combat AI bots (blog.miloslavhomer.cz)
	3 points by ArcHound 7 days ago

2 comments

mmarian 7 days ago

As a learning exercise - great. As an actual mitigation technique - even JA4s can be rotated pretty easily these days by motivated actors. Rotation patterns might still work (for now :D)

link

arbol 6 days ago

In combination with other signals JA4s are useful. You learn to spot obviously incorrect ones because Chrome always looks different from Safari which looks different to Firefox. Captcha solvers have their own unique JA4s based on whatever scripting language they're using (pyhton / rust / node). As another commentor pointed out, browsers have unique sets of headers like priority, DNT. So yes, it won't stop dedicated attackers but it is worth implementing as a coarse filter.

link

mmarian 6 days ago

If someone invests time/money in using a captcha solver, they're already dedicated enough and will easily get around a JA4 signature block.

Maybe there's some one-off exercise where this is useful, but it's very rare and I've seen people waste so much time with the whack a mole JA4 block just because they like the intellectual challenge.

link

arbol 6 days ago

It's not hard to setup JA4 monitoring and I think its valid as a coarse filter. There are various plugins for nginx/node.

> I've seen people waste so much time with the whack a mole JA4 block just because they like the intellectual challenge

You just store the ja4 on requests and build a catalogue of known JA4s over time using statistics. Outlier JA4s you treat with suspicion by default and challenge. It shouldn't be manual.

> If someone invests time/money in using a captcha solver, they're already dedicated enough and will easily get around a JA4 signature block.

Obviously, not for the regular user but captcha solvers are also blockable: - proxy detection - detection by running DNS server and capturing real IP over UDP request - abnormal TLS handshake latency - repeat behaviour at scale - rendering captcha on a fake origin instead of in the real page

link

ArcHound 7 days ago

This is the sad conclusion of the next part. JA4 is a great supplement, it can squeeze some additional info, but for a motivated attacker it can be avoided.

Now the question of how motivated are noisy AI scrapers is still open. Even a solution that cuts down 50 percent of the dumbest scraping attempts will still provide much needed relief to a struggling site.

link

mmarian 7 days ago

I'm curious, which site struggles are you envisaging? In my exp, JA4 is used as a hammer for which the nail must be found; simpler solutions oftentimes work better.

link

ArcHound 7 days ago

I think we agree that JA4 is situational. It really saved me when investigating a credential stuffing attack - random logins with random chance of success spread into many ASNs, all had the same fingerprint.

From my experience, there are all kinds of levels of bots. Add them all together and they can produce a ridiculous load on a site (especially a fragile one that you have to secure anyway). So I look at the volume, trying to block anything stupid I can get away with.

It is a game of whack-a-mole. It also can cut down the overall traffic to a fraction of the original, which has tangible infra costs benefits.

And yes, captcha works better in a lot of cases. Fortunately I'm not selling JA4, I'm just curious.

And yes, IP rate limits and ASN checks work really well in plenty cases. Side note: I got a high-throughput free offline asn-checker too! https://blog.miloslavhomer.cz/asn-check/

link

mmarian 7 days ago

I agree JA4 is situational; but the # of use cases is smaller than most people think. Like you said, Captcha works better; would've stopped the credential stuffing. Managed DDoS services (Cloudflare et al) + rate limits are better at DDoS.

Cool ASN project, but doesn't IPInfo already offer this for free: https://ipinfo.io/lite ?

link

ArcHound 7 days ago

Back in the day I couldn't find a downloadable DB for offline checks, which is very much needed when looking at approx 10k different IPs. Even with an offline DB I might need to create this tree structure so that I can process the data fast.

link

Bender 7 days ago

This is a good write-up. Is your blog running JA4 right now?

link

ArcHound 7 days ago

Hello again! Yes it is. If you have an exotic client, I'm here for it :D

link

Bender 7 days ago

Nice. I was more curious of the clients using HTTP/2.0 HTTP Protocol, what percentage of them is JA4 detecting as bots that spoof all the other headers a browser sends? That is the missing piece in my blog write-up as I don't do SSL fingerprinting. I am trying to see what percentage are getting through my very crude methods.

link

ArcHound 5 days ago

ok, so I've parsed some logs. I do see the ALPNs pointing to http2, but I don't capture all of the headers. The only thing I capture is the user-agent, which is the major spoof anyway.

Now, to differentiate between spoofed and non-spoofed header, I need to check the "valid" JA4 signature for a given browser and then proclaim that the rest of them are wrong. The "valid" JA4 signature can be observed, but I've found that sometimes browsers tweak their handshake a bit, so it's not 100% consistent.

The JA4 DB was recently taken down, I've requested full access, but no response (as expected). There might be some issues in getting those valid headers for the browsers, the hardware and software varies a lot (PC, Mac, Android, Iphone of all kinds of versions and browsers).

I was hoping for a quick win to share, but it doesn't seem like so and I'll have to do it properly. That should be my next post on JA4.

As a quick note, approx 30% of traffic claims to use http2 and approx 60% of that traffic has a non-bot user-agent (you know, along the lines of "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/149.0.7827.102 Safari/537.36"). I suspect majority of those are spoofed as I know how many readers I have on my blog.

link

ArcHound 7 days ago

I'll get back to you on this, I'll need to parse some logs. I should have at least ALPNs

link