| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by lambda 224 days ago

I'm a significant genAI skeptic.

I periodically ask them questions about topics that are subtle or tricky, and somewhat niche, that I know a lot about, and find that they frequently provide extremely bad answers. There have been improvements on some topics, but there's one benchmark question that I have that just about every model I've tried has completely gotten wrong.

Tried it on LMArena recently, got a comparison between Gemini 2.5 flash and a codenamed model that people believe was a preview of Gemini 3 flash. Gemini 2.5 flash got it completely wrong. Gemini 3 flash actually gave a reasonable answer; not quite up to the best human description, but it's the first model I've found that actually seems to mostly correctly answer the question.

So, it's just one data point, but at least for my one fairly niche benchmark problem, Gemini 3 Flash has successfully answered a question that none of the others I've tried have (I haven't actually tried Gemini 3 Pro, but I'd compared various Claude and ChatGPT models, and a few different open weights models).

So, guess I need to put together some more benchmark problems, to get a better sample than one, but it's at least now passing a "I can find the answer to this in the top 3 hits in a Google search for a niche topic" test better than any of the other models.

Still a lot of things I'm skeptical about in all the LLM hype, but at least they are making some progress in being able to accurately answer a wider range of questions.

7 comments

prettyblocks 224 days ago

I don't think tricky niche knowledge is the sweet spot for genai and it likely won't be for some time. Instead, it's a great replacement for rote tasks where a less than perfect performance is good enough. Transcription, ocr, boilerplate code generation, etc.

lambda 224 days ago

The thing is, I see people use it for tricky niche knowledge all the time; using it as an alternative to doing a Google search.

So I want to have a general idea of how good it is at this.

I found something that was niche, but not super niche; I could easily find a good, human written answer in the top couple of results of a Google search.

But until now, all LLM answers I've gotten for it have been complete hallucinated gibberish.

Anyhow, this is a single data point, I need to expand my set of benchmark questions a bit now, but this is the first time that I've actually seen progress on this particular personal benchmark.

ozim 224 days ago

That’s riding hype machine and throwing baby with bath water.

Get an API and try to use it for classification of text or classification of images. Having an excel file with somewhat random looking 10k entries you want to classify or filter down to 10 important for you, use LLM.

Get it to make audio transcription. You can now just talk and it will make note for you on level that was not possible earlier without training on someone voice it can do anyone’s voice.

Fixing up text is of course also big.

Data classification is easy for LLM. Data transformation is a bit harder but still great. Creating new data is hard so like answering questions where it has to generate stuff from thin air it will hallucinate like a mad man.

The ones that LLMs are good in are used in background by people creating actual useful software on top of LLMs but those problems are not seen by general public who sees chat box.

illiac786 224 days ago

But people using the wrong tool for a task is nothing new. Using excel as a database (still happening today), etc.

Maybe the scale is different with genAI and there are some painful learnings ahead of us.

mikepurvis 224 days ago

And Google themselves obviously believe that too as they happily insert AI summaries at the top of most serps now.

ComputerGuru 224 days ago

Or maybe Google knows most people search inane, obvious things?

coldtea 224 days ago

Or more likely Google couldn't give a rat's arse whether those AI summaries are good or not (except to the degree that people don't flee it), and what it cares is that they keep users with Google itself, instead of clicking of to other sources.

After all it's the same search engine team that didn't care about its search results - it's main draw - activey going shit for over a decade.

vitorgrs 224 days ago

Google AI Overview a lot of times write wrong about obvious things so... lol

They probably use old Flash Lite model, something super small, and just summarize the search...

mikepurvis 224 days ago

Those summaries would be far more expensive to generate than the searches themselves so they're probably caching the top 100k most common or something, maybe even pre-caching it.

katzenversteher 224 days ago

I also use niche questions a lot but mostly to check how much the models tend to hallucinate. E.g. I start asking about rank badges in Star Trek which they usually get right and then I ask about specific (non existing) rank badges shaped like strawberries or something like that. Or I ask about smaller German cities and what's famous about them.

I know without the ability to search it's very unlikely the model actually has accurate "memories" about these things, I just hope one day they will acutally know that their "memory" is bad or non-existing and they will tell me so instead of hallucinating something.

Europas 224 days ago

I'm waiting for properly adjusted specific LLMs. A LLM trained on so much trustworth generic data that it is able to understand/comprehend me and different lanugages but always talks to a fact database in the background.

I don't need an LLM to have a trillion parameters if i just need it to be a great user interface.

Someone is probably working on this somewere or will but lets see.

ozim 224 days ago

Second this.

Basically making sense of unstructured data is super cool. I can get 20 people to write an answer the way they feel like it and model can convert it to structured data - something I would have to spend time on, or I would have to make form with mandatory fields that annoy audience.

I am already building useful tools with the help of models. Asking tricky or trivia questions is fun and games. There are much more interesting ways to use AI.

DeathArrow 224 days ago

Well, I used Grok to find information I forgot about like product names, films, books and various articles on different subjects. Google search didn't help but putting the LLM at work did the trick.

So I think LLMs can be good for finding niche info.

DrewADesign 224 days ago

Yeah, but tests like that deliberately prod the boundaries of its capability rather than how well it does what it’s good at.

andai 224 days ago

So this is an interesting benchmark, because if the answer is actually in the top 3 google results, then my python script that runs a google search, scrapes the top n results and shoves them into a crappy LLM would pass your benchmark too!

Which also implies that (for most tasks), most of the weights in a LLM are unnecessary, since they are spent on memorizing the long tail of Common Crawl... but maybe memorizing infinite trivia is not a bug but actually required for the generalization to work? (Humans don't have far transfer though... do transformers have it?)

lambda 224 days ago

I've tried doing this query with search enabled in LLMs before, which is supposed to effectively do that, and even with that they didn't give very good answers. It's a very physical kind of thing, and its easy to conflate with other similar descriptions, so they would frequently just conflate various different things and give some horrible mash-up answer that wasn't about the specific thing I'd asked about.

andai 224 days ago

So it's a difficult question for LLMs to answer even when given perfect context?

Kinda sounds like you're testing two things at the same time then, right? The knowledge of the thing (was it in the training data and was it memorized?) and the understanding of the thing (can they explain it properly even if you give them the answer in context).

jve 224 days ago

Counter point about general knowledge that is documented/discussed in different spots on the internet.

Today I had to resolve performance problems for some sql server statement. Been doing it years, know the regular pitfalls, sometimes have to find "right" words to explain to customer why X is bad and such.

I described the issue to GPT5.2, gave the query, the execution plan and asked for help.

It was spot on, high quality responses and actionable items and explanations on why this or that is bad, how to improve it and why particularly sql may have generated such a query plan. I could instantly validate the response given my experience in the field. I even answered with some parts of chatgpt on how well it explained. However I did mention that to customer and I did tell them I approve the answer.

Asked high quality question and receive a high quality answer. And I am happy that I found out about an sql server flag where I can influence particular decision. But the suggestion was not limited to that, there were multiple points given that would help.

fragmede 224 days ago

Even the most magical wonderful auto-hammer is gonna be bad at driving in screws. And, in this analogy I can't fault you because there are people trying to sell this hammer as a screwdriver. My opinion is that it's important to not lose sight of the places where it is useful because of the places where it isn't.

pretzellogician 224 days ago

Funny, I grew up using what's called a "hand impact screwdriver"... turns out a hammer can be used to drive in screws!

TeodorDyakov 224 days ago

Hi. I am curious what was the benchmark question? Cheers!

Turskarama 224 days ago

The problem with publicly disclosing these is that if lots of people adopt them they will become targeted to be in the model and will no longer be a good benchmark.

lambda 224 days ago

Yeah, that's part of why I don't disclose.

Obviously, the fact that I've done Google searches and tested the models on these means that their systems may have picked up on them; I'm sure that Google uses its huge dataset of Google searches and search index as inputs to its training, so Google has an advantage here. But, well, that might be why Googles new models are so much better, they're actually taking advantage of some of this massive dataset they've had for years.

grog454 224 days ago

This thought process is pretty baffling to me, and this is at least the second time I've encountered it on HN.

What's the value of a secret benchmark to anyone but the secret holder? Does your niche benchmark even influence which model you use for unrelated queries? If LLM authors care enough about your niche (they don't) and fake the response somehow, you will learn on the very next query that something is amiss. Now that query is your secret benchmark.

Even for niche topics it's rare that I need to provide more than 1 correction or knowledge update.

nl 224 days ago

I have a bunch of private benchmarks I run against new models I'm evaluating.

The reason I don't disclose isn't generally that I think an individual person is going to read my post and update the model to include it. Instead it is because if I write "I ask the question X and expect Y" then that data ends up in the train corpus of new LLMs.

However, one set of my benchmarks is a more generalized type of test (think a parlor-game type thing) that actually works quite well. That set is the kind of thing that could be learnt via reinforcement learning very well, and just mentioning it could be enough for a training company or data provider company to try it. You can generate thousands of verifiable tests - potentially with verifiable reasoning traces - quite easily.

grog454 224 days ago

Ok, but then your "post" isn't scientific by definition since it cannot be verified. "Post" is in quotes because I don't know what you're trying to but you're implying some sort of public discourse.

For fun: https://chatgpt.com/s/t_694361c12cec819185e9850d0cf0c629

eru 224 days ago

I didn't see anyone claiming any 'science'? Did I miss something?

nl 224 days ago

As ChatGPT said to you:

> A secret benchmark is: Useful for internal model selection

That's what I'm doing.

Turskarama 224 days ago

The point is that it's a litmus test for how well the models do with niche knowledge _in general_. The point isn't really to know how well the model works for that specific niche. Ideally of course you would use a few of them and aggregate the results.

akoboldfrying 224 days ago

I actually think "concealing the question" is not only a good idea, but a rather general and powerful idea that should be much more widely deployed (but often won't be, for what I consider "emotional reasons").

Example: You are probably already aware that almost any metric that you try to use to measure code quality can be easily gamed. One possible strategy is to choose a weighted mixture of metrics and conceal the weights. The weights can even change over time. Is it perfect? No. But it's at least correlated with code quality -- and it's not trivially gameable, which puts it above most individual public metrics.

grog454 224 days ago

It's hard to have any certainty around concealment unless you are only testing local LLMs. As a matter of principle I assume the input and output of any query I run in a remote LLM is permanently public information (same with search queries).

Will someone (or some system) see my query and think "we ought to improve this"? I have no idea since I don't work on these systems. In some instances involving random sampling... probably yes!

This is the second reason I find the idea of publicly discussing secret benchmarks silly.

grog454 224 days ago

I learned in another thread there is some work being done to avoid contamination of training data during evaluation of remote models using trusted execution environments (https://arxiv.org/pdf/2403.00393). It requires participation of the model owner.

theshrike79 224 days ago

Because it encompasses the very specific way I like to do things. It's not of use to the general public.

kridsdale3 224 days ago

If they told you, it would be picked up in a future model's training run.

jacobn 224 days ago

Don't the models typically train on their input too? I.e. submitting the question also carries a risk/chance of it getting picked up?

I guess they get such a large input of queries that they can only realistically check and therefore use a small fraction? Though maybe they've come up with some clever trick to make use of it anyway?

nl 224 days ago

OpenAI and Anthropic don't train on your questions if you have pressed the opt-out button and are using their UI. LMArena is a different matter.

jerojero 224 days ago

they probably dont train on inputs from testing grounds.

you dont train on your test data because you need to have that to compare if training is improving or not.

energy123 224 days ago

Given they asked in on LMArena, yes.

lambda 224 days ago

Yeah, probably asking on LMArena makes this an invalid benchmark going forward, especially since I think Google is particular active in testing models on LMArena (as evidenced by the fact that I got their preview for this question).

I'll need to find a new one, or actually put together a set of questions to use instead of just a single benchmark.

_heimdall 224 days ago

Is that an issue if you now need a new question to ask?

Marazan 224 days ago

Heres my old benchmark question and my new variant:

"When was the last time England beat Scotland at rugby union"

new variant "Without using search when was the last time England beat Scotland at rugby union"

It is amazing how bad ChatGPT is at this question and has been for years now across multiple models. It's not that it gets it wrong - no shade, I've told it not to search the web so this is _hard_ for it - but how badly it reports the answer. Starting from the small stuff - it almost always reports the wrong year, wrong location and wrong score - that's the boring facts stuff that I would expect it to stumble on. It often creates details of matches that didn't exist, cool standard hallucinations. But even within the text it generates itself it cannot keep it consistent with how reality works. It often reports draws as wins for England. It frequently states the team that it just said scored most points lost the match, etc.

It is my ur example for when people challenge my assertion LLMs are stochastic parrots or fancy Markov chains on steroids.

arisAlexis 224 days ago

can you give us an example of this niche knowledge? I highly doubt there is knowledge that is not inside some internet training material.

vitaflo 224 days ago

I also have my own tricky benchmark that up til now only Deepseek has been able to answer. Gemini 3 Pro was the second. Every other LLM fail horribly. This is the main reason I started looking at G3pro more seriously.