| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by ivraatiems 38 days ago
	Kind of Google to create a market opening for its competitors like this. I hope Kagi, Bing, and DuckDuckGo are taking notes.

7 comments

data-ottawa 38 days ago

reCaptcha is a pretty strong wall to allow only Google to index websites, especially now that you need device verification. Throw in Cloudflare too.

There’s not much room to squeeze in when your competitors hold the keys to 15 million top websites.

link

xmcp123 37 days ago

I write a lot of scrapers. Both of those are pretty trivial to bypass at scale.

link

HDBaseT 37 days ago

What about not at scale?

I find it wild that "at scale" we can bypass anti-bot measures, but just "normal" internet use (i.e Non-Google Browser or VPN) will throw a million captchas at you.

cgnat is pretty bad too.

link

xmcp123 37 days ago

Not at scale, what you’re seeing are a tiny tiny fraction of the potential captchas that can be thrown at you. Normally “we have seen this cookie before”, or “this browser does not have webdriver fingerprints” is sufficient to not get a captcha.

The big issue you sidestep not at scale is you can come from a single, residential IP with a good reputation.

Mandatory captchas for simply viewing a page are rare - most are saved for high impact actions like account creation.

When this does happen for a simple page view, AI is extremely good at solving basic captchas - especially basic “click the box” captchas.

If you don’t want to pay for AI, there are decaptcha services where someone in Southeast Asia solves the captcha for fractions of a penny. Save the cookies after a successful solve and you’re probably good in the future.

If you don’t want to pay for someone to solve a simple check the box captcha a little bit of attention and some properly simulated clicking (IE not a JavaScript injected event) will often work. Just don’t click literally the exact middle, fuzz the coordinates and you’re good.

link

einpoklum 37 days ago

> reCaptcha is a pretty strong wall to allow only Google to index websites

Why would website authors _want_ to prevent crawling by other search engines?

link

data-ottawa 37 days ago

Because there's been a string of bad actors including OpenAI with incredibly inefficient scrapers.

Previously captcha was just for spam limiting, but I actually looked at our system logs and about half of traffic was bad behaving scrapers.

In logs I see these scrapers are hitting every link on the page. If you have a collection page then it's hitting every filter option and then hitting each pagination button, the different sort orders, etc. People running something like Forgejo it will hit every commit.

If you have expensive to compute pages, they're getting hit by these incredibly naive bots that don't respect any robots.txt or discriminate on what they do.

link

nostrademons 38 days ago

The problem is that the web as we know it (useful, human-curated information that's put out there to help people) is also over. It's been totally overrun with AI slop. Even before AI could be used to create propaganda on a scale that we could only dream about 5 years ago, it's been declining under the weight of SEO sweatshops for a good 10 years. Meanwhile the actually decent content, the individual hobbyists who are just sharing their knowledge, have largely left under the weight of comment spam and DDoS attacks and doxxing.

So if another search engine does arise, it won't find anything useful, because the useful content on the web has been buried under slop, and largely removed. Your best bet today is a curated directory, sorta like the original Yahoo, where you allowlist the web to only real sites, download them, and make them searchable. I think this is actually Kagi's approach. But the open web as we knew and loved it is dead.

link

ares623 37 days ago

Curators will become desirable again. The Devil Wears Prada 3.

link

torben-friis 38 days ago

My literal first thought was "do I seriously need to use bing now?".

link

Supermancho 38 days ago

Bing has been better than Google for some time. Again, it's embarrassing for them to sacrifice marketshare for paid results and an intermediate-form AI fad that will turn into the same paid result funnel.

link

mrweasel 38 days ago

Bing is surprisingly not to bad. I don't use it anymore, but it's been providing better results than Google for sometime.

link

RyanOD 38 days ago

I hear people cite other search engines as "better" all the time. Better how?

link

edelbitter 38 days ago

e.g. for a two keyword search, Google & DDG return results containing a similar (but more at the moment, more popular, so I understand why they do this) keyword as the first one, and no relation whatsoever with the second. Any search that manages to actually show results related to both of my input terms get the "better" award from me.

link

sphars 38 days ago

There is a 99% chance (IMO) that Microsoft is going to go the same route as Google here

link

vitorsr 37 days ago

Microsoft has already gone down this road some three years ago...

https://blogs.microsoft.com/blog/2023/02/07/reinventing-sear...

link

endofreach 37 days ago

Also, usually, as soon as they realize they have a not-total-shit product, they immediately start to screw it up completely. So if bing ends up being better actually, it won't be long until they replace every good part of it with something ridiculous. I don't know how microsoft does it, but they are so incredibly good at that.

link

tedd4u 38 days ago

DuckDuckGo uses the bing index/backend. I’ve had it as default for 5-8 years. Probably once a day I’ll add the !g to pop it over to Google. Works great. I search a lot, many different types of queries. When I pop over to Google it’s usually a Boolean query looking for a needle in a haystack (that one comment somewhere where someone is using the same combination of two or three rare items together).

link

BrunoBernardino 38 days ago

While there are good options like DuckDuckGo, Mojeek, Ecosia, there are plenty of (better) alternatives, where you're not the product [1], I'd recommend looking into!

[1]: https://alternativeto.net/software/google-search/?license=co...

link

unselect5917 37 days ago

I've been using Brave for years. And I'm in the process of moving off of gmail. Why bing of all engines?

link

Zigurd 38 days ago

I'm sure there's a niche for a product for search nerds. Something that leans into inverted indexes like the classic Lexis/Nexis search. But it's got to have Google-like coverage.

link

cortesoft 38 days ago

Niche + Google-like coverage is not very economically viable. To store and update a search index of that size requires a lot of resources, and being niche means you don’t have a lot of resources.

Very few of the smaller search engines actually do their own indexing for exactly this reason.

link

edelbitter 38 days ago

I wonder if the same coverage as before is now more economically feasible. The internet has gotten .. smaller, lately.

link

raincole 38 days ago

Kagi relies on Google search.

link

hootz 38 days ago

But the results are still 1000x better than Google's. Something is being done there.

link

baggachipz 38 days ago

True in large part, but they've been diversifying their providers in the expectation that Google shut everybody out.

link

akazantsev 37 days ago

The thing is, they were diversifying with Russia's Yandex... Which is worse for some.

link

baggachipz 37 days ago

Yes yes, they use Yandex for a little bit. They use lots of providers.

At this point with Google contributing so much to the Trump administration, I'm not sure which is worse.

link

dgellow 38 days ago

Sure but we are talking about the UI here, not the index being used

link

raincole 38 days ago

But if Kagi manages to become a serious competitor in the search engine space, Google will cut them off from their index. Why will not they?

link

IIsi50MHz 37 days ago

Last I read, Kagi is using data from 3rd-party scraping of Google results, because buying directly from Google comes with onerous limitations:

- Must not alter the order of Google's search results - Must not alter the appearance or placement of Google-inserted ads

link

dgellow 38 days ago

It’s possible but they would need to be so massive to even just start making a dent in google market share. And Google hasn’t blocked larger search engines from using their index

link

AndroTux 38 days ago

They mostly use Bing, at least from my testing.

link

xerox13ster 38 days ago

I've been using Startpage as my default search engine for a while now for any search where I actually need information and not sales or marketing bullshit.

When I use google, usually from my phone, I am reminded of why I don't use google on desktop.

With the announcement of this move by them, I just manually removed google as an address bar search engine option in all my browsers on desktop and mobile.

link

kylehotchkiss 38 days ago

Cloudflare seems like they have the capability to take this on.

Human produced content should be separated from sites primarily hosting slop. That seems solvable?

link