Hacker News new | ask | show | jobs
Show HN: MrScraper AI – Dead simple web scraper (powered by AI) (mrscraper.com)
38 points by buffer_overflow 1116 days ago
I've decided to test a new approach in my web scraping app.

What do you think?

10 comments

It's just a landing page with a video and an email signup form. No usable product. Is this kind of thing allowed on Show HN threads?
Maybe the tag line could be “extract anything from the web without the selectors” or something. Then put your ai stuff in the subtitle.

Either way, currently the grammar doesn’t quite work right now.

Definitely a cool idea though!

Web scraping needs to be 100% deterministic. That would be my only question is if/how you’ve achieved that.

You’re not only sacrificing simplicity for determinism but also stability. What good is being deterministic when the underlying web page keeps changing all the time, breaking your selectors? This seems like a more stable approach.
However if the selectors break you can notice that quite easily.
That’s true. Perhaps using the LLM approach you could extract a deterministic selector, and notify the users if it changes in some meaningful way.
> Web scraping needs to be 100% deterministic

Says who?

well evidently the potential customer above, deterministic though can mean multiple things - for example if you are scraping the major headline on a page based on a selector and the class is .headline, when .headline goes away and the content becomes <div class="t1110 C373">My cool fashions!</div>

well someone can say that the crawler should get the new version of the title, but others can say it should warn you that the selector no longer works.

If the crawler determines via AI what the headline is and you have crawled 10000 pages and it turns out the crawler has made a mistake regarding the headline then you might be soured on the idea of AI making this kind of decision for you and be more amenable to being warned, but then you have to do a lot more work with your crawler than you might otherwise want to do.

Seems very expensive, 1 token = 1kb of data.

Then adding unknown amount of openai tokens on top of that.

Nice work. I’m trying something similar with https://kadoa.com/playground
I think you should run your homepage copy through the AI. It's clear that a non-native writer wrote it.

In particular, the hero, the beta message, and some of the FAQs. The main features section already looks like the AI wrote it.

What kinds of things are you using it for yourself?

The UI looks really nice and straightforward. Congrats on shipping!

I find it encouraging that the homepage isnt AI generated, maybe the code also isn't, and the project may live more than a month before requiring a rewrite!
Would you run the LLM extractor across every page? Especially for larger scale projects, such as scraping entire product catalogues, this sounds very expensive. Maybe you could use the AI to generate selectors from examples that can then be applied to all other pages of the same structure?
This will definitely take away the burden of clients (mostly non-technical people) having to choose the selectors. I've had a scraping service business recently for this specific reason. I hope AI can be a great middle player here. Let's see how it turns out. Good luck Kai.
I think that you're building an unethical business, and should not get advice or publicity from this community.

Doubly so given you already had a very high visibility Show HN for this just a few months ago.

I don't think this community agrees with you that web scraping is inherently unethical.

In fact I think many (most?) in this community would argue that web scraping is an almost fundamental feature of the web itself, and that attempts at preventing it are more unethical than scraping.

I doubt the objections are about scraping per se, but about unethical scraping where no consideration is given to etiquette.

"proxy rotation" in the first line does not bode well for ethics.

What's wrong with proxy rotation? Big Tech attempts to prevent any scraping of their content whatsoever. In the context of that web, proxy rotation is table stakes.
Without any documentation about how etiquette will be respected and sites won't be hammered, it's fair to be sceptical about the ethics.

Not every crawled site is "big tech" or even commercial.

I don't disagree, but I also don't think "proxy rotation" immediately implies sites will be hammered either.
Ahh yes, the guy who worked at LinkedIn, I presume?
I think I understand pagination — but can you elaborate on proxy rotation?

> combines the practicality of language models with the powerful features of a traditional scraper such as pagination and proxy rotation

When scraping websites, it’s often necessary to change your IP address to bypass the website’s anti-scraping measures. To achieve this, there are proxy services out there that are designed with web scraping in mind- so it’s easy to programmatically change your IP address from within a scraper program.
It sends our request over a lot of proxies so your scraper does not get rate limited or blocked by ip address.
You basically switch out the proxy you use to scrape. Services by Google or others can identify scrapers cause they'll use the same proxy to request paged
Cool project but isn’t the first rule of proxy rotation not to talk about proxy rotation?
Shh! Don't talk about it