| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by sunshadow 954 days ago

These days, I'm not even using Go for scraping that much, as the webpage changes makes me crazy and JS code evaluation is a lifesaver, so I moved to Typescript+Playwright. (Crawlee framework is cool, while not strictly necessary).

Its been 8+ years since i started scraping. I even wrote a popular Go web scraping framework previously: (https://github.com/geziyor/geziyor).

My favorite stack as of 2023: TypeScript+Playwright+Crawlee(Optional) If you're serious in scraping, you should learn javascript, thus, playwright should be good.

Note: There are niche cases where lower-level language would be required (C++, Go etc), but probably only <%5

3 comments

hipadev23 954 days ago

How does that help you mitigate when a site changes? If you’re fetching some value in a given <div> under a long XPATH and they decide to change that path?

sunshadow 954 days ago

You don't use XPath&CSS selectors at all (Except if you dont have choice). You rely on more generic stuff, e.g, "the button that has 'Sign in' on it":

    await page.getByRole('button', { name: 'Sign in' }).click();

See playwright locators: https://playwright.dev/docs/locators

8n4vidtmkvmk 954 days ago

I started putting data-testid attributes in my web app for automated testing using playwright. Prevents me from breaking my own script but it sure would make me more scrapable if anyone cared. Well.. I guess I only do it on inputs, not the rendered page which is what scrapers care most about.

sunshadow 954 days ago

Unless you start a war against scrapers, you don't need to worry about that as I'll always find a way to scrape your site as long as its valuable to 'me'. Even if it requires Real browser + OCR :)

erhaetherth 954 days ago

Oh I know I couldn't prevent it. But if you wanted to scrape me, you'd have to pay the monthly subscription because everything is behind a pay wall/login. And then you'd only have access to data you entered because it's just that kind of app :-)

latchkey 954 days ago

This is where you just train an LLM so you can write:

'get button named "sign in" and click'

Then on the back end, it generates your example code.

bluecrab 953 days ago

Adept is doing it.

nurettin 954 days ago

Don't know about the poster, but I try to find divs and buttons in a fuzzy way. Usually via element text. Sometimes it mitigates changes, sometimes it doesn't. It's a guessing game. Especially when they start using shadow elements or iframes in the page. If I'm looking for something specific like a price or dimensions, I can sometimes get away with it by collecting dollar amounts or X x Y x Z from the raw text.

aynyc 953 days ago

iframes have been a pain the butt to scrape against. I see it more and more in websites now.

mikercampbell 954 days ago

Have you seen Crul??

I love the JS flow, but I thought crul was an interesting newer tool!!

But I agree, you gotta get in there and it’s easier with JS

sunshadow 954 days ago

Crul looks nice, though, you cannot imagine how many startups that I've seen failed doing a very similar thing as Crul. Wouldn't rely on it. The problem is complex: Humans generating messy pages

docyes 950 days ago

Thank you for the positive acknowledgment and insightful observation. As one of the creators of Crul, I fully understand the challenges inherent in this intricate business and software domain. Our initial emphasis on the browser abstraction layer, predating APIs such as SOAP, REST, GraphQL, etc., serves as a data driver and stateless cluster for interpreting DOM nodes. While we initially lacked programmatic extensibility for custom browser control, as you rightly pointed out, addressing complex edge cases often requires such a feature. Looking ahead, we are exploring the possibility of opening up the core, starting with "Krull," the browser cluster. We welcome feedback to gauge interest in this development.

reyostallenberg 954 days ago

Can you add a link to it?

mdaniel 954 days ago

I'm sorry to hear that your searches for that very specific name didn't provide the information you were looking for

its show hn: https://news.ycombinator.com/item?id=34970917

tfl: https://www.crul.com/

gymbeaux 953 days ago

What are some examples of needing a lower-level language?