Hacker News new | ask | show | jobs
by unionpivo 588 days ago
Because nowdays more than ever content you need is in silos.

Your facebooks/twiters/instagram/stack overflow/reddit ... And they all have limited expensive api's, and have bulk scrapping detection. Sure you can clobber together something that will work for a while, but you can't runn a buissness on that.

Aditionaly most paywalled sites (like news) explicitly whitlist google and bing, and if someone cretes new site, they do the same. As an upstart you would have to reach out to them to get them to whitelist you. and you would need to do it not only in USA but globaly.

Anothe problem is cloudflare and other cdns/web firewalls, so even trying to index mom and pops blog site could be problematic. An d most of the mom and pop blogs are nowdays on som ploging platform that is just another silo.

Now that i think about it, cloudflare might be in a good position to do it.

The AI hype and scraping for content to feed the models have increased dificulty for anyone new to start new index.

3 comments

This is the best (and saddest) answer. LLMs break the social contract of the internet, we're in a feudalisation process.

The decentralized nature of the internet was amazing for businesses, and monopolization could ruin the space and slow innovation down significantly.

> LLMs break the social contract of the internet

The legal concept of fair usage has and is being challenged, and will best tested in court. Is the Golden Age of Fair Use Over? Maybe [0].

[0] https://blog.mojeek.com/2024/05/is-the-golden-age-of-fair-us...

While LLMs have accelerated, it, it was already the case that silos were blocking non-Google and non-Bing results before LLMs. LLMs have only made existing problems of the web worse, but they were problems before LLMs too and banning LLMs won't fix the core issues of silos and misinformation.
You're thinking too much by the rules. You can absolutely scrape them anyway. Probably the biggest relevant factor is CGNAT and other technologies that make you blend in with a crowd. If I run a scraper on my cellphone hotspot, the site can't block me without blocking a quarter of all cellphones in the country.

If the site is less aggressively blocking but only has a per-IP rate limit, buy a subscription to one of those VPNs (it doesn't matter if they're "actually secure" or not - you can borrow their IP addresses either way). If the site is extremely aggressive, you can outsource to the slightly grey market for residential proxy services - for fifty cents to several dollars per gigabyte, so make sure that fits in your business plan.

There's an upper bound to a website's aggressiveness in blocking, before they lose all their users, which tops out below how aggressive you can be in buying a whole bunch of SIM cards, pointing a directional antenna at McDonald's, or staying a night at every hotel in the area to learn their wi-fi passwords.

> You're thinking too much by the rules. You can absolutely scrape them anyway. Probably the biggest relevant factor is CGNAT and other technologies that make you blend in with a crowd. If I run a scraper on my cellphone hotspot, the site can't block me without blocking a quarter of all cellphones in the country.

I am familiar with most of that, and there is a BIG difference between trying to find a workaround for one site, that you scrape ocasionaly, than to to find workaround for all of the sites.

Big sites will definitely put entire ISP's behind annoying capachas that are designed to stop exactly this (if you ever wonder why you sometimes get capatchas that seem slow to load, have long animations, or other annoying slow things, that is why etc.)

And once you start making enough money to employ all the people you need for doing that consistently, they will find a jurisdiction or 3 where they can sue you.

Also good luck finding residential/mobile ISP's that will stand by, and not try to throttle you after a while.

You definitively can get away with doing all of that for a while, but you absolutely can't build sustainable businesses on that.

There are many rationalizations to not try.
And JavaScript/dynamic content. Entrenched search engines have had a long time to optimize scraping for complex sites