| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by bunderbunder 751 days ago

The entire field of information retrieval is still here. This was touched on by the OReilly article on lessons learned working with LLMS that hit the HN front page yesterday [1], in their section on RAG.

My sense is that you can currently break the whole thing down into two groups: the proverbial grownups in the room are typically building pipelines that are still doing it basically how the top-performing systems did in the '90s, with a souped up keyword and metadata search engine for the initial pass and an embedding model for catching some stuff it misses and/or result ranking. This isn't how most general-purpose search engines work, but it's likely how the ones you don't particularly mind using work. Web search, for example.

And then there's the proverbial internet comments section, which wants to skip past all the boring labor-intensive oldschool stuff, and instead just begin and end with approximate nearest neighbors search using an off-the-shelf embedding model. The primary advantage to this approach - and I should admit here that I've tried it myself - is that you can bodge it together over a weekend and have the blog post up by Monday.

I guess what I'm getting at is, the people producing content on the Internet and the people producing effective software aren't necessarily the same people. I mean, heck, look at me, I'm only here to type this comment because I'm slacking off at work today.

1: https://www.oreilly.com/radar/what-we-learned-from-a-year-of...

1 comments

pradn 751 days ago

Your comment makes a lot of sense to me.

What I wonder though is - we've been a year and a half into the LLM craze and we still don't see a really good information processing system for them. Yes, there's chatbots, some that let you throw in images and PDFs.

But what we need is more like a ground-up rethink of these UIs. We need to invent the "desktop" of LLMs.

But the keys here, I think, are that

a) the LLMs are only part of the solution. A chat interface is immature and not enough.

b) external information is brought in by the user, and augmented by a universe of knowledge given by the provider

c) being overly general is probably a trap. Yes, LLMs can talk about everything - but why not solve a concrete vertical?

Semantic search helps with a part of this, but is just one component.

link

bunderbunder 750 days ago

Also, frankly, I don't think a chat interface is good UX. People are having fun with it right now because it's novel. But human-human interaction doesn't use natural language because it's somehow ideal; we rely on it due to hardware limitations. We don't have the same set of limitations in human-computer interaction. And we also have a lot of history (as in, literally all of history) demonstrating that, even when talking to each other, humans quickly start straying away from pure natural language interaction whenever their communication is modulated by a technology that allows for additional options.

You can even see some of this play out a bit over the course of the web's nearly 30 year history. 20 years ago, informational websites tended to be brief, highly structured, and minimally chatty. Nowadays, people produce walls of text that you have to dig through to find the actual content. Why the change? Search engine optimization. Which I'd argue is an example of essentially the same folks who give us AI basically dragging us back to a world where natural language dominates. Not because it's actually better for anyone, but because it's what they can more easily build a one-size-fits-all algorithm around.

link

pradn 750 days ago

Part of the reason why LLM summaries are so attractive IS a UI problem. The economics of the web has led to every publisher stuffing their websites with ads. No one wants that. Its much nicer to see a clean paragraph of text.

But we clearly have an ouroboros situation. If publishers lose views, they lose money and the ability to craft good information. Less new info to incorporate into LLMs.

LLM training over the internet corpus has really been a massive heist. Pulling a wool over publishers' heads, undercutting their business, hoarding the information.

But it's really unavoidable at this point. Everything has been democratized: compute on cloud platforms, data via Common Crawl, OSS algorithms and tool-kits. No one can put a stop to this, and there's powerful economic incentives to actually get some benefit out of the hundreds of billions that have been poured in already.

link