Hacker News new | ask | show | jobs
by portInit 1199 days ago
It's a tricky question. Part of it is looking at APIs as the main source of data for scheduled queries/data feeds.

Crul sort of operates as a text only browser when interacting with a single page at a time, but when you expand and open up multiple tabs it becomes a little more challenging. We have the concept of domain policies which allow you to control how quickly/slowly you access something. There are also some puppeteer level options that could be relevant, even a headful toggle.

We have not invested too much time into this yet as we focused on getting the core functionality working. We think there are use cases (particularly with APIs) that don't run into this problem, but if it comes up more often we'll come up with some options.

1 comments

In my experience bot detection is moving more towards looking at network activity and IP reputations. Using a proxy will go along way, it's easy to implement, and the cost can easily be passed onto the customer.
In my experience, the thing that makes me actually have to lug out my ol' headless browser is that a lot of websites are starting to implement obfuscated cryptographic puzzles in their JS, making it really difficult to emulate without just running it in a browser.
Starting to? That's been going on for at least the last 3-4 years. Akamai tends to rely on that more than Cloudflare and in my recent experience Akamai is winning that game and browser emulation alone, headless or not, isn't going to bypass Akamai. Recently I have seen browser emulation not be effective at all for bypassing bot detection.

The only kind of emulation where I have seen success is mobile and in that case you need to run a device emulator.

In my experience it's starting to moving towards "use a headless browser with patched attributes" or nothing else will work.

edit: I have quite a bit of experience with Akamai and other vendors =)