I find it wild that "at scale" we can bypass anti-bot measures, but just "normal" internet use (i.e Non-Google Browser or VPN) will throw a million captchas at you.
Not at scale, what you’re seeing are a tiny tiny fraction of the potential captchas that can be thrown at you. Normally “we have seen this cookie before”, or “this browser does not have webdriver fingerprints” is sufficient to not get a captcha.
The big issue you sidestep not at scale is you can come from a single, residential IP with a good reputation.
Mandatory captchas for simply viewing a page are rare - most are saved for high impact actions like account creation.
When this does happen for a simple page view, AI is extremely good at solving basic captchas - especially basic “click the box” captchas.
If you don’t want to pay for AI, there are decaptcha services where someone in Southeast Asia solves the captcha for fractions of a penny. Save the cookies after a successful solve and you’re probably good in the future.
If you don’t want to pay for someone to solve a simple check the box captcha a little bit of attention and some properly simulated clicking (IE not a JavaScript injected event) will often work. Just don’t click literally the exact middle, fuzz the coordinates and you’re good.
Because there's been a string of bad actors including OpenAI with incredibly inefficient scrapers.
Previously captcha was just for spam limiting, but I actually looked at our system logs and about half of traffic was bad behaving scrapers.
In logs I see these scrapers are hitting every link on the page. If you have a collection page then it's hitting every filter option and then hitting each pagination button, the different sort orders, etc. People running something like Forgejo it will hit every commit.
If you have expensive to compute pages, they're getting hit by these incredibly naive bots that don't respect any robots.txt or discriminate on what they do.
The problem is that the web as we know it (useful, human-curated information that's put out there to help people) is also over. It's been totally overrun with AI slop. Even before AI could be used to create propaganda on a scale that we could only dream about 5 years ago, it's been declining under the weight of SEO sweatshops for a good 10 years. Meanwhile the actually decent content, the individual hobbyists who are just sharing their knowledge, have largely left under the weight of comment spam and DDoS attacks and doxxing.
So if another search engine does arise, it won't find anything useful, because the useful content on the web has been buried under slop, and largely removed. Your best bet today is a curated directory, sorta like the original Yahoo, where you allowlist the web to only real sites, download them, and make them searchable. I think this is actually Kagi's approach. But the open web as we knew and loved it is dead.
Bing has been better than Google for some time. Again, it's embarrassing for them to sacrifice marketshare for paid results and an intermediate-form AI fad that will turn into the same paid result funnel.
e.g. for a two keyword search, Google & DDG return results containing a similar (but more at the moment, more popular, so I understand why they do this) keyword as the first one, and no relation whatsoever with the second. Any search that manages to actually show results related to both of my input terms get the "better" award from me.
Also, usually, as soon as they realize they have a not-total-shit product, they immediately start to screw it up completely. So if bing ends up being better actually, it won't be long until they replace every good part of it with something ridiculous. I don't know how microsoft does it, but they are so incredibly good at that.
DuckDuckGo uses the bing index/backend. I’ve had it as default for 5-8 years. Probably once a day I’ll add the !g to pop it over to Google. Works great. I search a lot, many different types of queries. When I pop over to Google it’s usually a Boolean query looking for a needle in a haystack (that one comment somewhere where someone is using the same combination of two or three rare items together).
While there are good options like DuckDuckGo, Mojeek, Ecosia, there are plenty of (better) alternatives, where you're not the product [1], I'd recommend looking into!
I'm sure there's a niche for a product for search nerds. Something that leans into inverted indexes like the classic Lexis/Nexis search. But it's got to have Google-like coverage.
Niche + Google-like coverage is not very economically viable. To store and update a search index of that size requires a lot of resources, and being niche means you don’t have a lot of resources.
Very few of the smaller search engines actually do their own indexing for exactly this reason.
It’s possible but they would need to be so massive to even just start making a dent in google market share. And Google hasn’t blocked larger search engines from using their index
I've been using Startpage as my default search engine for a while now for any search where I actually need information and not sales or marketing bullshit.
When I use google, usually from my phone, I am reminded of why I don't use google on desktop.
With the announcement of this move by them, I just manually removed google as an address bar search engine option in all my browsers on desktop and mobile.
There’s not much room to squeeze in when your competitors hold the keys to 15 million top websites.