Hacker News new | ask | show | jobs
by ivraatiems 38 days ago
Kind of Google to create a market opening for its competitors like this. I hope Kagi, Bing, and DuckDuckGo are taking notes.
7 comments

reCaptcha is a pretty strong wall to allow only Google to index websites, especially now that you need device verification. Throw in Cloudflare too.

There’s not much room to squeeze in when your competitors hold the keys to 15 million top websites.

I write a lot of scrapers. Both of those are pretty trivial to bypass at scale.
What about not at scale?

I find it wild that "at scale" we can bypass anti-bot measures, but just "normal" internet use (i.e Non-Google Browser or VPN) will throw a million captchas at you.

cgnat is pretty bad too.

Not at scale, what you’re seeing are a tiny tiny fraction of the potential captchas that can be thrown at you. Normally “we have seen this cookie before”, or “this browser does not have webdriver fingerprints” is sufficient to not get a captcha.

The big issue you sidestep not at scale is you can come from a single, residential IP with a good reputation.

Mandatory captchas for simply viewing a page are rare - most are saved for high impact actions like account creation.

When this does happen for a simple page view, AI is extremely good at solving basic captchas - especially basic “click the box” captchas.

If you don’t want to pay for AI, there are decaptcha services where someone in Southeast Asia solves the captcha for fractions of a penny. Save the cookies after a successful solve and you’re probably good in the future.

If you don’t want to pay for someone to solve a simple check the box captcha a little bit of attention and some properly simulated clicking (IE not a JavaScript injected event) will often work. Just don’t click literally the exact middle, fuzz the coordinates and you’re good.

> reCaptcha is a pretty strong wall to allow only Google to index websites

Why would website authors _want_ to prevent crawling by other search engines?

Because there's been a string of bad actors including OpenAI with incredibly inefficient scrapers.

Previously captcha was just for spam limiting, but I actually looked at our system logs and about half of traffic was bad behaving scrapers.

In logs I see these scrapers are hitting every link on the page. If you have a collection page then it's hitting every filter option and then hitting each pagination button, the different sort orders, etc. People running something like Forgejo it will hit every commit.

If you have expensive to compute pages, they're getting hit by these incredibly naive bots that don't respect any robots.txt or discriminate on what they do.

The problem is that the web as we know it (useful, human-curated information that's put out there to help people) is also over. It's been totally overrun with AI slop. Even before AI could be used to create propaganda on a scale that we could only dream about 5 years ago, it's been declining under the weight of SEO sweatshops for a good 10 years. Meanwhile the actually decent content, the individual hobbyists who are just sharing their knowledge, have largely left under the weight of comment spam and DDoS attacks and doxxing.

So if another search engine does arise, it won't find anything useful, because the useful content on the web has been buried under slop, and largely removed. Your best bet today is a curated directory, sorta like the original Yahoo, where you allowlist the web to only real sites, download them, and make them searchable. I think this is actually Kagi's approach. But the open web as we knew and loved it is dead.

Curators will become desirable again. The Devil Wears Prada 3.
My literal first thought was "do I seriously need to use bing now?".
Bing has been better than Google for some time. Again, it's embarrassing for them to sacrifice marketshare for paid results and an intermediate-form AI fad that will turn into the same paid result funnel.
Bing is surprisingly not to bad. I don't use it anymore, but it's been providing better results than Google for sometime.
I hear people cite other search engines as "better" all the time. Better how?
e.g. for a two keyword search, Google & DDG return results containing a similar (but more at the moment, more popular, so I understand why they do this) keyword as the first one, and no relation whatsoever with the second. Any search that manages to actually show results related to both of my input terms get the "better" award from me.
There is a 99% chance (IMO) that Microsoft is going to go the same route as Google here
Microsoft has already gone down this road some three years ago...

https://blogs.microsoft.com/blog/2023/02/07/reinventing-sear...

Also, usually, as soon as they realize they have a not-total-shit product, they immediately start to screw it up completely. So if bing ends up being better actually, it won't be long until they replace every good part of it with something ridiculous. I don't know how microsoft does it, but they are so incredibly good at that.
DuckDuckGo uses the bing index/backend. I’ve had it as default for 5-8 years. Probably once a day I’ll add the !g to pop it over to Google. Works great. I search a lot, many different types of queries. When I pop over to Google it’s usually a Boolean query looking for a needle in a haystack (that one comment somewhere where someone is using the same combination of two or three rare items together).
While there are good options like DuckDuckGo, Mojeek, Ecosia, there are plenty of (better) alternatives, where you're not the product [1], I'd recommend looking into!

[1]: https://alternativeto.net/software/google-search/?license=co...

I've been using Brave for years. And I'm in the process of moving off of gmail. Why bing of all engines?
I'm sure there's a niche for a product for search nerds. Something that leans into inverted indexes like the classic Lexis/Nexis search. But it's got to have Google-like coverage.
Niche + Google-like coverage is not very economically viable. To store and update a search index of that size requires a lot of resources, and being niche means you don’t have a lot of resources.

Very few of the smaller search engines actually do their own indexing for exactly this reason.

I wonder if the same coverage as before is now more economically feasible. The internet has gotten .. smaller, lately.
Kagi relies on Google search.
But the results are still 1000x better than Google's. Something is being done there.
True in large part, but they've been diversifying their providers in the expectation that Google shut everybody out.
The thing is, they were diversifying with Russia's Yandex... Which is worse for some.
Yes yes, they use Yandex for a little bit. They use lots of providers.

At this point with Google contributing so much to the Trump administration, I'm not sure which is worse.

Sure but we are talking about the UI here, not the index being used
But if Kagi manages to become a serious competitor in the search engine space, Google will cut them off from their index. Why will not they?
Last I read, Kagi is using data from 3rd-party scraping of Google results, because buying directly from Google comes with onerous limitations:

- Must not alter the order of Google's search results - Must not alter the appearance or placement of Google-inserted ads

It’s possible but they would need to be so massive to even just start making a dent in google market share. And Google hasn’t blocked larger search engines from using their index
They mostly use Bing, at least from my testing.
I've been using Startpage as my default search engine for a while now for any search where I actually need information and not sales or marketing bullshit.

When I use google, usually from my phone, I am reminded of why I don't use google on desktop.

With the announcement of this move by them, I just manually removed google as an address bar search engine option in all my browsers on desktop and mobile.

Cloudflare seems like they have the capability to take this on.

Human produced content should be separated from sites primarily hosting slop. That seems solvable?