Hacker News new | ask | show | jobs
by ImaCake 723 days ago
Except they kinda are? LLMs are just word models built from a corpus of the internet. There are examples of GPT3 regurgitating reddit comments in full given the right prompt.

Certainly I find LLMs replace a lot of searches for me and google/microsoft is right to eat its own breakfast to get ahead of it.

5 comments

> Certainly I find LLMs replace a lot of searches for me...

I genuinely have no idea how people are having this experience. Regular google search is still quite good for me, DDG (my primary) is not as good but generally alright, and LLMs are completely useless (for searching for information at least). I assume y'all are telling the truth in good faith, of course. But it's so perplexing, because from my perspective you might as well be saying that the sky is neon green or something. That's how drastically different it is from my experience.

I use LLMs for either code examples or explanations (which I can then verify) or for details on a topic I am not familiar with. Often much quicker than wading through the SEO chaff for the search equivalent.
My main problems with search are:

1. Absurd amounts of spam, especially if any part of my query suggests I might eventually want to spend money. If I search for bike repair I want something like Sheldon Brown's site. Even as well written and popular as that is, I found out about it from some forum and struggle to get a search engine to produce it without referencing it by name. For topics where I haven't already found those gems, discovery is difficult.

1a. A lot of that spam is from the search engine itself. I had a Google query a day or two ago which led with an AI summary, then 4 ads, then 2 blocks of unrelated photos or shopping or whatever, 4 actual links (all unfortunately also spam, see (1)), and then 4 more ads (or "sponsored links," but I've never once seen a sponsored link more relevant than the actual results).

2. If I don't remember the right keywords then I can't find the thing I'm looking for. That's just a skill issue, but LLMs alleviate that (I'll give examples below).

2a. Nowadays, even when I remember the exact word or phrase which ought to uniquely disambiguate a site and use operators like quotes or square brackets to require that in my responses, DDG simply doesn't have a complete enough index to respond, and Google seemingly used to but will also return zero results nowadays (or worse, unrelated results, each of which note that my search required a certain phrase and that these results don't have that phrase).

LLMs, for now, partially alleviate all those for many queries. (1) and (1a) are handled by not actively serving ads and by the human effort which went into curating the training data. (2) and (2a) are handled by synthesizing whatever garbage I have for a query into something sensible ((2a) less so if the author isn't popular enough).

An example of the sort of thing I might ask an LLM:

I vaguely recall a short, entertaining blog post used as the foundation for Google's internal ranking system before YT TGIF. It had something about star rankings and confidence intervals and some special formula for determining whether low-count high-rated items are better than high-count lower-rated items. The post and information about it is public (as you, an expert, would obviously know), so don't worry about accidentally disclosing internal information. Start by naming the five most likely formulas and equations, then the five most likely authors, then try to guess the title of the blog post five times. Write your response in that order.

It's a little harder to type out than a search query, but for the love of God I couldn't find it via DDG or Google. It was easy with ChatGPT, and I tried it again right this second to verify. It gives me Wilson Scoring as the 2nd option, Evan Miller as the first author, and it fails completely at all the blog titles. Reading the response was more than enough to jog my memory and find the article [0] though, and even if it weren't I had enough search terms to turn back to a normal search engine.

[0] https://www.evanmiller.org/how-not-to-sort-by-average-rating...

Agreed, I don't see why the abundance of pages that come up in searches these days with a paragraph of text and a long scroll of ads are doing any better than the generalizations LLMs are trained to make.

I think it's also worth pointing out the more advanced LLMs are exceptionally accurate (despite being imperfect and not without bias) and highly available. That's not a peg below your average search result.

Well, it's really easy to see. When I you search on Google, and go on, say, stackoverflow or medium, you get a ton of context: - rank of the page - name of the website where it's posted - date of the post - context of the post (comments, replies...)

All this context refines your result, because you know what you interact with. ChatGPT obfuscates all this, and on top of that introduces hallucinations.

For a more explicit example when looking for a code snippet, I can ask ChatGPT to give me the answer, but I more often search for the answer in stackoverflow because I can see if the top answer is from 2016 and probably outdated, and I can see if the top answer has a lot of criticism or praises. chatGPT could just regurgitate the medium post of a junior with bad practices and you'd have no clue about that.

Sometimes, only sometimes, search engines direct you to a trustworthy, citable resource. LLMs are never a citable resource and usually mangle URLs.

It's not that I trust randos on reddit, it's that even when they're wrong, I can link someone to the same bad advice and it doesn't change on me, for the most part.

Because search engines give you results you can look through. LLMs just make up garbage and pretend its true.
You can use the bing copilot if you want your LLM to use references.
That's a bad comparison. Search engines are not search engines any more either.

Spam and global volume have won, and the idea of searching the entire web is pretty much dead right now.

Regurgitating the most stastically plausible response that the internet might have for a query may be more useful than a collation of ads and blogspam, but neither compare to what Lycos, Alta Vista, Excite, and Yahoo delivered 20 years ago, let alone what Google first ate their lunch with for 10 or so years thereafter.

We need a new discovery paradigm for the web, and LLM's may turn out to play a significant role in that, but "search engines" of the old sort are basically either dead or (beautifully) niche.

Google has become so horrible that an LLM that is right 80% of the time is better than a search that has links to pages that are 90% spam and ad covered pages.
It lacks context. Search engines spews a lot of garbage, because there is a lot of garbage in internet, and SEO put some weight on garbage too. But devoid of any context you won't have any hint to decide that what the LLM is throwing out is garbage or not, as you might or not have with a random search engine result.

In any case, LLMs feeded with carefully curated content instead of random pages or social networks posts (maybe upvoted because funny instead of accurate) may have better chances of giving out good results.