| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by mobilejdral 463 days ago
	I have a several complex genetic problems that I give to LLMs to see how well they do. They have to reason though it to solve it. Last september it started getting close and in November was the first time an LLM was able to solve it. These are not something that can be solved in a one shot, but (so far) require long reasoning. Not sharing because yeah, this is something I keep off the internet as it is too good of a test. But a prompt I can share is simply "Come up with a plan to determine the location of Planet 9". I have received some excellent answers from that.

4 comments

tlb 463 days ago

There are plenty of articles online (and surely in OpenAI's training set) on this topic, like https://earthsky.org/space/planet-nine-orbit-map/.

Answer quality is a fair test of regurgitation and whether it's trained on serious articles or the Daily Mail clickbait rewrite. But it's not a good test of reasoning.

link

TZubiri 463 days ago

Recursive challenges are probably those where the difficulty is not really a representative of real challenges.

Could you answer a question of the type " what would you answer if I asked you this question?"

What I'm going after is that you might find questions that are impossible to resolve.

That said if the only unanswerables you can find are recursive, that's a signal the AI is smarter than you?

link

mopierotti 463 days ago

The recursive one that I have actually been really liking recently, and I think is a real enough challenge is: "Answer the question 'What do you get when you cross a joke with a rhetorical question?'".

I append my own version of a chain-of-thought prompt, and I've gotten some responses that are quite satisfying and frankly enjoyable to read.

link

mopierotti 463 days ago

Here is an example of one such response in image form: https://imgur.com/a/Kgy1koi

link

econ 462 days ago

It needs a bit more reasoning as it does find the answer but doesn't notice it found it.

The answer is: A trick question.

link

mopierotti 462 days ago

Yeah. In the example I shared, my charitable interpretation would be that it's identifying the trick question as "a setup" where the punch line is the confusion the audience experiences. And in a meta sense, that would also describe the form of the entire chat.

link

econ 462 days ago

To state the obvious in case it wasn't: A trick question can be both a joke and a rhethorical question.

link

acrooks 463 days ago

Claude responded “Nothing.”

link

genewitch 463 days ago

"That look on your face, apparently"

link

latentsea 463 days ago

> what would you answer if I asked you this question?

I don't know.

link

namaria 463 days ago

If you have been giving the LLMs these problems, there is a non zero chance that they have already been used in training.

link

rovr138 463 days ago

This depends heavily on how you use these and how you have things configured. If you're using API vs web ui's, and the plan. Anything team or enterprise is disabled by default. Personal can be disabled.

Here's openai and anthropic,

https://help.openai.com/en/articles/5722486-how-your-data-is...

https://privacy.anthropic.com/en/articles/10023580-is-my-dat...

https://privacy.anthropic.com/en/articles/7996868-is-my-data...

and obviously, that doesn't include self-hosted models.

link

namaria 463 days ago

How do you know they adhere to this in all cases?

Do you just completely trust them to comply with self imposed rules when there is no way to verify, let alone enforce compliance?

link

blagie 463 days ago

They probably don't, but it's still a good protection if you treat it as a more limited one. If you assume:

[ ] Don't use

Doesn't mean "don't use," but "don't get caught," it still limits a lot of types of uses and sharing (any with externalities sufficient they might get caught). For example, if personal data was being sold by a data broker and being used by hedge funds to trade, there would be a pretty solid legal case.

link

namaria 463 days ago

> it still limits a lot of types of uses and sharing (any with externalities sufficient they might get caught)

I don't understand what you mean

> For example, if personal data was being sold by a data broker and being used by hedge funds to trade

It's pretty easy to buy data from data brokers. I routinely get spam on many channels. I assume that my personal data is being commercialized often. Don't you think that already happens frequently?

I honestly would not put on a textbox on the internet anything I don't assume is becoming public information.

A few months ago some guy found discarded storage devices full of medical data for sale in Belgium. No data that is recorded on media you do not control is safe.

link

gvhst 463 days ago

SOC-2 auditing, which both Anthropic and OpenAI have done does provide some verification

link

diggan 463 days ago

That's interesting, how do I get access to those audits/reports given I'm just an end-user?

link

rovr138 463 days ago

You can fill the form here, https://trust.openai.com/

link

namaria 463 days ago

The audit performed by a private entity called "Insight Assurance"?

Why do you trust it?

link

rovr138 463 days ago

Oh, so now EVERYTHING is fake unless personally verified by you in a bunker with a Faraday cage and a microscope?

You're free to distrust everything. However, the idea that “I don’t trust it so it must be invalid” isn’t an solid argument. It’s just your personal incredulity. You asked if there’s any verification and SOC-2 is one. You might not like it, but it's right there.

Insight Assurance is a firm doing these standardized audits. These audits carry actual legal and contractual risk.

So, yes, be cautious. But being cautious is different than 'everything is false, they're all lying'. In this scenario, NOTHING can be true unless *you* personally have done it.

link

golergka 463 days ago

What are is this problem from? What areas in general did you find useful to create such benchmarks?

May be instead of sharing (and leaking) these prompts, we can share methods to create one.

link

mobilejdral 463 days ago

Think questions where there is a ton of existing medical research, but no clear answer yet. There are a dozen alzheimer's questions you could for example ask which would require it to pull in a half dozen contradictory sources into a plausible hypothesis. If you have studied alzheimer's extensively it is trivial to evaluate the responses. One question around alzheimer's is one of my goto questions. I am testing its ability to reason.

link

henryway 463 days ago

Can God create something so heavy that he can’t lift it?

link

abc-1 463 days ago

https://chatgpt.com/share/680ae04a-e360-8004-88fc-8426e8e700...

link

viraptor 463 days ago

There's so much text on this already, it's unlikely to be even engaging any reasoning. Or specifically, if you got a few existing answers from philosophy mashed together, you wouldn't be able to tell it apart from reasoning anyway.

link