Show HN: MrScraper AI – Dead simple web scraper (powered by AI)

Y	Hacker News new \| ask \| show \| jobs

	Show HN: MrScraper AI – Dead simple web scraper (powered by AI) (mrscraper.com)
	38 points by buffer_overflow 1163 days ago
	I've decided to test a new approach in my web scraping app. What do you think?

10 comments

JSavageOne 1163 days ago

It's just a landing page with a video and an email signup form. No usable product. Is this kind of thing allowed on Show HN threads?

link

motoxpro 1163 days ago

Maybe the tag line could be “extract anything from the web without the selectors” or something. Then put your ai stuff in the subtitle.

Either way, currently the grammar doesn’t quite work right now.

Definitely a cool idea though!

Web scraping needs to be 100% deterministic. That would be my only question is if/how you’ve achieved that.

link

brap 1163 days ago

You’re not only sacrificing simplicity for determinism but also stability. What good is being deterministic when the underlying web page keeps changing all the time, breaking your selectors? This seems like a more stable approach.

link

tomschwiha 1163 days ago

However if the selectors break you can notice that quite easily.

link

brap 1163 days ago

That’s true. Perhaps using the LLM approach you could extract a deterministic selector, and notify the users if it changes in some meaningful way.

link

qup 1163 days ago

> Web scraping needs to be 100% deterministic

Says who?

link

bryanrasmussen 1163 days ago

well evidently the potential customer above, deterministic though can mean multiple things - for example if you are scraping the major headline on a page based on a selector and the class is .headline, when .headline goes away and the content becomes <div class="t1110 C373">My cool fashions!</div>

well someone can say that the crawler should get the new version of the title, but others can say it should warn you that the selector no longer works.

If the crawler determines via AI what the headline is and you have crawled 10000 pages and it turns out the crawler has made a mistake regarding the headline then you might be soured on the idea of AI making this kind of decision for you and be more amenable to being warned, but then you have to do a lot more work with your crawler than you might otherwise want to do.

link

tmikaeld 1163 days ago

Seems very expensive, 1 token = 1kb of data.

Then adding unknown amount of openai tokens on top of that.

link

t_a_v_i_s 1163 days ago

Nice work. I’m trying something similar with https://kadoa.com/playground

link

qup 1163 days ago

I think you should run your homepage copy through the AI. It's clear that a non-native writer wrote it.

In particular, the hero, the beta message, and some of the FAQs. The main features section already looks like the AI wrote it.

What kinds of things are you using it for yourself?

The UI looks really nice and straightforward. Congrats on shipping!

link

lionkor 1163 days ago

I find it encouraging that the homepage isnt AI generated, maybe the code also isn't, and the project may live more than a month before requiring a rewrite!

link

EveYoung 1163 days ago

Would you run the LLM extractor across every page? Especially for larger scale projects, such as scraping entire product catalogues, this sounds very expensive. Maybe you could use the AI to generate selectors from examples that can then be applied to all other pages of the same structure?

link

saasxyz 1163 days ago

This will definitely take away the burden of clients (mostly non-technical people) having to choose the selectors. I've had a scraping service business recently for this specific reason. I hope AI can be a great middle player here. Let's see how it turns out. Good luck Kai.

link

jsnell 1163 days ago

I think that you're building an unethical business, and should not get advice or publicity from this community.

Doubly so given you already had a very high visibility Show HN for this just a few months ago.

link

fastball 1163 days ago

I don't think this community agrees with you that web scraping is inherently unethical.

In fact I think many (most?) in this community would argue that web scraping is an almost fundamental feature of the web itself, and that attempts at preventing it are more unethical than scraping.

link

mellosouls 1163 days ago

I doubt the objections are about scraping per se, but about unethical scraping where no consideration is given to etiquette.

"proxy rotation" in the first line does not bode well for ethics.

link

fastball 1163 days ago

What's wrong with proxy rotation? Big Tech attempts to prevent any scraping of their content whatsoever. In the context of that web, proxy rotation is table stakes.

link

mellosouls 1163 days ago

Without any documentation about how etiquette will be respected and sites won't be hammered, it's fair to be sceptical about the ethics.

Not every crawled site is "big tech" or even commercial.

link

fastball 1162 days ago

I don't disagree, but I also don't think "proxy rotation" immediately implies sites will be hammered either.

link

fakedang 1163 days ago

Ahh yes, the guy who worked at LinkedIn, I presume?

link

d4rkp4ttern 1163 days ago

I think I understand pagination — but can you elaborate on proxy rotation?

> combines the practicality of language models with the powerful features of a traditional scraper such as pagination and proxy rotation

link

gymbeaux 1163 days ago

When scraping websites, it’s often necessary to change your IP address to bypass the website’s anti-scraping measures. To achieve this, there are proxy services out there that are designed with web scraping in mind- so it’s easy to programmatically change your IP address from within a scraper program.

link

jimvdv 1163 days ago

It sends our request over a lot of proxies so your scraper does not get rate limited or blocked by ip address.

link

cyanydeez 1163 days ago

You basically switch out the proxy you use to scrape. Services by Google or others can identify scrapers cause they'll use the same proxy to request paged

link

henriquez 1163 days ago

Cool project but isn’t the first rule of proxy rotation not to talk about proxy rotation?

link

imdsm 1163 days ago

Shh! Don't talk about it

link