Hacker News new | ask | show | jobs
by sunshadow 954 days ago
These days, I'm not even using Go for scraping that much, as the webpage changes makes me crazy and JS code evaluation is a lifesaver, so I moved to Typescript+Playwright. (Crawlee framework is cool, while not strictly necessary).

Its been 8+ years since i started scraping. I even wrote a popular Go web scraping framework previously: (https://github.com/geziyor/geziyor).

My favorite stack as of 2023: TypeScript+Playwright+Crawlee(Optional) If you're serious in scraping, you should learn javascript, thus, playwright should be good.

Note: There are niche cases where lower-level language would be required (C++, Go etc), but probably only <%5

3 comments

How does that help you mitigate when a site changes? If you’re fetching some value in a given <div> under a long XPATH and they decide to change that path?
You don't use XPath&CSS selectors at all (Except if you dont have choice). You rely on more generic stuff, e.g, "the button that has 'Sign in' on it":

    await page.getByRole('button', { name: 'Sign in' }).click();
See playwright locators: https://playwright.dev/docs/locators
I started putting data-testid attributes in my web app for automated testing using playwright. Prevents me from breaking my own script but it sure would make me more scrapable if anyone cared. Well.. I guess I only do it on inputs, not the rendered page which is what scrapers care most about.
Unless you start a war against scrapers, you don't need to worry about that as I'll always find a way to scrape your site as long as its valuable to 'me'. Even if it requires Real browser + OCR :)
Oh I know I couldn't prevent it. But if you wanted to scrape me, you'd have to pay the monthly subscription because everything is behind a pay wall/login. And then you'd only have access to data you entered because it's just that kind of app :-)
This is where you just train an LLM so you can write:

'get button named "sign in" and click'

Then on the back end, it generates your example code.

Adept is doing it.
Don't know about the poster, but I try to find divs and buttons in a fuzzy way. Usually via element text. Sometimes it mitigates changes, sometimes it doesn't. It's a guessing game. Especially when they start using shadow elements or iframes in the page. If I'm looking for something specific like a price or dimensions, I can sometimes get away with it by collecting dollar amounts or X x Y x Z from the raw text.
iframes have been a pain the butt to scrape against. I see it more and more in websites now.
Have you seen Crul??

I love the JS flow, but I thought crul was an interesting newer tool!!

But I agree, you gotta get in there and it’s easier with JS

Crul looks nice, though, you cannot imagine how many startups that I've seen failed doing a very similar thing as Crul. Wouldn't rely on it. The problem is complex: Humans generating messy pages
Thank you for the positive acknowledgment and insightful observation. As one of the creators of Crul, I fully understand the challenges inherent in this intricate business and software domain. Our initial emphasis on the browser abstraction layer, predating APIs such as SOAP, REST, GraphQL, etc., serves as a data driver and stateless cluster for interpreting DOM nodes. While we initially lacked programmatic extensibility for custom browser control, as you rightly pointed out, addressing complex edge cases often requires such a feature. Looking ahead, we are exploring the possibility of opening up the core, starting with "Krull," the browser cluster. We welcome feedback to gauge interest in this development.
Can you add a link to it?
I'm sorry to hear that your searches for that very specific name didn't provide the information you were looking for

its show hn: https://news.ycombinator.com/item?id=34970917

tfl: https://www.crul.com/

What are some examples of needing a lower-level language?