Hacker News new | ask | show | jobs
by boardwaalk 1234 days ago
I know it’s easy to throw stones from the outside, but Google’s results are so compromised it seems like it’s a good time to get back in.

As one just example, I searched for a unique error message in code that exists on GitHub, is in a fairly popular repo, and is not new and Google just could not find it. That seems like a very basic failure.

15 comments

I was just searching for an old friend of mine who's last name happens to be a substring of another common last name. I tried everything, quotes, + signs, - signs, middle initials, middle names, cities we lived in together, etc.

Every single returned link after the first 3 had the superstring version of the name and not the correct name. It turns out that this returns endless results for a fairly well known singer, not my friend.

So now did I not get the results I was looking for, I got tons of results that were objectively wrong.

Then suddenly, about 6 pages into those results, I started getting ones for the correct last name, but now the first name is a mess.

This happened on Google, DDG, Baidu, Sogou, Haosou, Dogpile, the current Yahoo search, Bing, and to some extent on Yandex. Naver was worse, Daum totally worthless with incorrect results.

Utterly worthless.

The thing is, my friend's name is surprisingly fairly unique, there's probably less than 20 people in the world with that specific name. It's like the search engine's desire to fill the screen with worthless garbage results has overpowered the need to supply the 2 or 3 that are actually correct, even if the quantity is a little disappointing.

I would honestly pay at minimum $10 a month to a search engine startup that focuses on the top 10k, then top 100k Alexa sites, and does good indexing of top sites. If I google something programming related, give me all the stackoverflow you find relevant. I don't even care about image search, that can come later. I think the world has room for a search engine competitor, I'm just not sure what it would look like, but I hope someone is working on something that isn't just a repeat of hot garbage.
Just a suggestion, I'm not a subscriber but am investigating the service: Kagi might come close to that. You can up or down rank sites too boost their visibility in your searches. Would take some time to get going but eventually I think one would end up with a much better and almost curated result set.

Generally though these days I'm trying to distance myself from Google so if anyone has any other search engine suggestions (beyond the usual DDG, Bing, Yandex, Neeva) I'm very open.

I think this is okay so long as you can toggle it off since now you apply userbase bias, which isn't guaranteed to be perfectly neutral, since it is bias in aggregate, anything remotely ideological in nature would be skewed one way or the other.

I do like this idea however.

My understanding of is that it's just for you, not based on what every user of Kagi is choosing. Looking in the settings though I'm not sure if you can disable this feature on an ad hoc basis, which would be useful now and then if you want to get a fresh set of results and get out of your own biases.
Negating some string with a `-` prefix works in my experience. But haven't come across a case with superstrings, can't think of an example to try too.
Try neeva?
Nope, same shitty results.

I've basically decided that my friend's name is going to be my search engine quality test from now on, the results are so spectacularly terrible.

All I want is something the crawl the web, suppress SEO spam, and let me "search" on things exactly as I've quoted them. Like we used to in 1998.

I've found that with Google I have to use verbatim mode to get it to do anything vaguely sensible.

Synonymization is the worst thing to ever happen to search, and it keeps getting more and more aggressive.

Would you mind providing details like the search query and link to the page you expect to be found?

To test your hypothesis, I did a basic search for exact matches on "we do not synchronize on the update of the broker node" and Google returned 2 search results in 240ms:

- https://github.com/a0x8o/kafka/blob/master/core/src/main/sca...

- https://jar-download.com/artifacts/org.apache.kafka/kafka_2....

Which contain exactly the source code from GitHub that I was looking for. You'll notice that the first result is actually a0x80's fork of apache/kafka. Google states that some entries very similar to the 2 already displayed were omitted, and I'm able to remove that filter. With that filter removed, I can see the same document indexed from apache/kafka on GitHub.

There's nothing I can do or promise directly, but I can assure you that Google takes the quality of our search results very seriously. If you believe we're not delivering quality results, I strongly encourage you to click that "Send Feedback" link at the bottom of your results so that our teams can act upon your feedback.

Disclosure: I work on Search at Google.

Disclaimer: The words, views, and opinions expressed in this post are my own. They are not representative nor do they represent my employer in any capacity.

I dont know how common this is, but in my 12 years using this site this is the first time I see a Google employee address a customer regarding a product they work on.

Congrats and hope Google takes advantage of HN, similar to how startups use this forum to engage with users - it is now a meme that Google Search is unusable so there must be something to learn from the audience.

I will use the send feedback button tomorrow as you suggest.

Thank you for the kind words. Long time HN member here like you (going on 11 years) that recently started working on Search as a SWE.

Yes, that meme is very common. I hope I can contribute positively to these discussions by offering an outlet for feedback, and humanizing our organization. Google’s Search organization is large, so it’s certainly not monolithic, but we’re staffed with a bunch of normal, hardworking, genuine human beings like most companies, that care about the impact we’re having.

I’m happy you’ve found some value in our discussion. :)

There are cases where Google doesn't return anything close to all known exact matches.

1. Most large classic forums using vbulletin. Try picking any rare word or phrase with less than 100 total matches via the forum's search tool and compare to the Google verbatim results.

2. This very site, searching for an uncommon word such as "memeplex" returns hundreds of unique results according to hn.algolia.com, but only 65 according to Google via site:news.ycombinator.com "memeplex".

3. Fanfiction sites such as fanfiction.net . Try randomly picking an obscure 'fandom' with only a few hundred stories, and search for the name of one of the main protagonists. It will only retrieves a small fraction of all the existing stories that mention the protagonist's name.

EDIT: I originally had another example involving macrumors.com but then realized there was a formatting mistake in the search query.

Hey Michael, I appreciate the effort you put into describing a few examples:

1. If you could link to specific examples and queries that’d be super helpful for someone like me that’s not active on the forums you’re describing.

2. Algolia is a fuzzy matching search engine. Searching for memeplex [1] returns matches like “megaplex”, “memepher”, “meeples”, etc. Unchecking typo tolerance in the settings returns < 100 results in line with Google’s results.

3. Again, if you could link to specific examples and queries that’d be helpful.

[1]: https://hn.algolia.com/?dateRange=all&page=19&prefix=true&qu...

Thanks for looking into it Denzel.

I'm not quite sure what you mean for 2., I see every exact match highlighted in a rectangular box in a different color. Do you not see that on your end?

Just counting the exact matches, there are well over 100 unique results.

On the other hand why does one of the richest company on earth, who can afford to hire the smartest people on earth, resort to unpaid volunteers on a site like HN to fix their product?

Don't they use their own tools? Is there an internal search engine that everyone uses at Google? Are they trying to gaslight us pretending there's no problem? Can't they hire a hundred people to use Google search and report what they found?

Sure, props to that person for engaging with the userbase but we're not talking about an obscure bug here. Every day there are dozens of complaints about Google search on HN alone. Surely we're talking about low hanging fruits in terms of bug reproduction.

You should try to assume good faith when engaging with a person who’s just like yourself.

First, we can agree that Google Search is attempting to solve an astronomically hard problem. Like mind boggling hard. Indexing the entire web and serving quality results to unstructured queries from billions of users every day in under one second is no small feat.

Second, Google is not monolithic. We employ more people than most cities have citizens. Furthermore, many more people than our current staff have come and gone over our 20+ year existence. It’s better to think of Google as an organic entity than a rigid command-and-control hierarchy. Are you able to think of a city in the world that does everything perfectly? I certainly can’t, and yet, there are cities that are better and those that are worse for any set of criteria that one may care about. As it is with large companies like Google.

Third, while there’s an objective element to search result quality, there’s also a significant amount of subjectivity. Your idea of quality results may differ from another person’s idea.

Checkout Paul Haahr’s talk on “Improving Search over the Years” [1]. He summarizes our work the best when he says things that look easy on the outside can take a lot of work to implement.

As it was with our state-of-the-art automatic synonym system that works on any written language in our corpus. (More details in his presentation.) This system is a transparent workhorse from the user’s perspective.

Here’s a simple example you can compare between Google, Bing, and DuckDuckGo: “united flight formations”.

Two of those search engines will show a bunch of things about United Airlines as top results because that’s what you would expect to get when you’re only focused on matching terms. Only one of those search engines understands the meaning behind the query and returns everything to do with formation flying as the top results.

If you use our products and you mostly enjoy our products, it’s in your best interest to give feedback when you feel we’re not serving your needs. You’ll find that most of products, Search included, have open feedback channels that we do review and act upon.

[1]: https://youtu.be/DeW-9fhvkLM

I'm sure you're a real person deserving of respect and love. If I say that the search results are terrible it's not a comment on your humanity or that if your colleagues. People have genuine problems with Google and a reasonable expectation based on experience that they won't get any joy by trying to appeal to big G. You can say that you're just flesh and blood, but don't discount the well-founded displeasure of users.
> Google states that some entries very similar to the 2 already displayed were omitted, and I'm able to remove that filter.

I've definitely seen that sort of thing before but there is no such link there at the moment -- at least not when searching from my iPhone, whether or not I'm in desktop mode. I just see a large error box that says "It looks like there aren't many great matches for your search" followed by the link to the a0x80 fork.

By the way, the a0x80 result highlights a serious problem with search results: the GitHub URL is strangely modified. Instead of showing the full URL or even a prefix leading up to it Google is selecting parts of the URL, showing "https://github.com > src > transaction" on mobile and "https://github.com > kafka > coordinator > transaction" when I request the desktop site. In neither case is it obvious that the content isn't the canonical source from Apache. I've noticed this middle-out truncation for GH urls before but I'm not sure when it started.

How often do people use the send feedback button? How many of the reports are looked at?
I’m not authorized to disclose data that’s not public knowledge.

What I can say is that we have a feedback process in place for Google Search that we use to improve our product. You can send feedback and check the box to allow our teams to contact you if you’re interested in a follow up. Of course, given our scale, we’re not able to follow up on every bit of feedback but that doesn’t mean we don’t review or act upon that feedback in some way.

Yes, I remember several years ago --- more like 8 now(!) --- easily finding results in GitHub repos whenever I've needed to look up error codes and such. Now even site:github.com doesn't (and if you try too hard, you get the hellban for a while).

Another extremely noticeable degradation is in finding part numbers, IC markings, service manuals (NOT the useless user manual), schematics, and the like. Anything that proponents of right-to-repair would be extremely interested in, to the extent that I wonder if there's been some sort of conscious effort being made by certain interests to eliminate or limit such information.

Then there's the niche-but-legal adult content. I won't go into too much detail about that, but suffice to say it used to be far easier to find.

It's been 5 years since this notorious item here, and I've only seen Google get worse: https://news.ycombinator.com/item?id=16153840

Sundar Pichai has so Mckinsified and MBAfied Google that at this point Google search seems like an A/B test to deliver the best targeted ad . Probably better of using any other search including Yahoo .
I have the feeling that whatever you're talking about is explicitly not crawlable.
Yeah, the GitHub robots.txt is surprisingly restrictive:

https://github.com/robots.txt

   User-agent: *

   Disallow: /*/pulse
   Disallow: /*/tree/
That "/*/tree" rule means that search engine crawlers are allowed to hit the README file of a repo but effectively NONE of the other files in it.

Which means that if you keep your project documentation on GitHub in a docs/ folder it won't be indexed!

You need to publish it to a separate site via GitHub Pages, or use https://readthedocs.org/

(Side note: I just noticed https://github.com/ekansa/Open-Context-Data is explicitly listed in the robots.txt for GitHub - the only repo that gets a mention like that. I'd love to know the story behind that!)

That repo apparently used to be the largest on GitHub: https://news.ycombinator.com/item?id=5912922. I bet Google was repeatedly scraping the entire thing and putting too much strain on their servers at the time it was added. It's been 10 years, what are the odds nobody at GitHub today remembers why it was added?

Also, very relatable to see a decade old "I'll update this shortly" comment that was never updated. We all have a few of those.

It appears that the creator of the repo actually confirmed this: https://twitter.com/ekansa/status/1137052076062650368
/*/tree is only for directory listings. File contents will be under a /blob/ path, e.g. https://github.com/facebook/react/blob/main/AUTHORS, and should be, AFAIK, indexable.

(mandatory disclaimer: I'm a GitHub employee, not speaking on behalf of the company)

I asked about this on the support forum a while ago and never got a satisfactory response: https://github.com/community/community/discussions/20958
If they can't hit `/*/tree` is there a way to know the URLs of the files?
Direct links from crawlable pages
Sure, clone the git repo.
GitHub would not be happy with Google cloning all repos, and many of them at a high frequency, in order to circumvent a robots.txt restriction.
There's also two users:

    Disallow: /account-login
    Disallow: /Explodingstuff/
The first for obvious reasons, the second probably because they've uploaded nothing of substance besides a copy of WannaCry.
A public git repository is definitely crawlable. Google seems to have given up actively going out of their way to index things that are hard to crawl as they got so big and important it was easier to just tell people "thou must do X or we won't index you and you want to be indexed", but increasingly the content I want to find is in weird little silos.
Curious, if I had the list of repos, is there anything that forbids me from `while read url; do git clone $url data;./train data; rm -rf ./data; done`. Besides licensing, ie ratelimit/throttle, similar question, the search for code across all repos provided by github ui gets throttled pretty fast, what do people do? (not suggestion in a hundred(?) years to do the while loop for this tho ;))
That doesn’t change anything regarding the actual point of the comment.
Your idea for a search competitor is to ignore robots.txt?
or an advertising competitor that ignores DNT!

oh wait

Yes, clearly that's the best possible interpretation of what I said. /s
This seems perfectly logical to some people:

Google ignores robots.txt: "Google is evil! They're trespassing on my webserver!"

Google follows robots.txt: "Google search results suck! They're not indexing GitHub!"

It's a just a text file.
Seems like the giants that were nearly synonymous with "Internet" - Google and Amazon, are rapidly deteriorating and creating a massive market opportunity.

Pure speculation, but innovative companies at first, they started over-hiring and bloating, using questionable interviewing techniques (puzzles, Leetcode), taking on thousands of employees who were just there to game the system, coast, and collect the check.

It just looks like they stopped caring.

and it just straight up ignores keywords even when there's matches containing all of them. google has become so much worse, and yes part of it is that there's a ton of spam, which is also a problem, but it has also gotten worse in other respects too
> As one just example, I searched for a unique error message in code that exists on GitHub, is in a fairly popular repo, and is not new and Google just could not find it. That seems like a very basic failure.

I have recently almost completely stopped using Google's search engine due to the fact that I am very often offered zero search results for simple queries (usually involving quotes though) .. It's so bad I can't even believe it.

Note: I've been a Google search since it started... Gmail since Beta, etc...

At one point, I thought that maybe they started punishing ad-block users excessively.

Right now searching on Google is way worse than Yahoo, Altavista or Ask Jeeves.

Now only we get tons of ads back as first results, Google keeps rewriting the queries for whatever "helpful" nonsense.

> but Google’s results are so compromised

I read this constantly here (echo chamber) and I can't help but feel it's a little biased/overdramatic.

Honestly, I'd be all for using ddg exclusively. but I find myself doing !g (their google redirect operator) when I don't find what I want on DDG, and it's almost always the top result on Google. And this happens daily.
Most tech companies have ruined their products now. They’ll have 10,000 engineers and 15 iterations of the UI but you try and buy a hard drive and it’s a box with an SD card taped inside.

It’s time for competitors to start wiping them out.

Why do they even bother with the SD card?
Seems like the worst time, unless they're doing so with ChatGPT and the like. What regular search lacks is context and a natural way of refining queries by adding context that doesn't always work well with keywords.
Not only have they become compromised from a technical standpoint, for some searches in particular, the results have been modified to be heavily politically biased and woke.
Don’t downvote them until you check out Frank Zappa’s discography.
My guess is that there is now so much ml Blackbox shit going on in the search algo that no one can reasonably tell you why it returns what it does.
Do you think google cares that it's loosing it's edge? How do they not know it's getting worse.