Hacker News new | ask | show | jobs
by mobilejdral 416 days ago
I have a several complex genetic problems that I give to LLMs to see how well they do. They have to reason though it to solve it. Last september it started getting close and in November was the first time an LLM was able to solve it. These are not something that can be solved in a one shot, but (so far) require long reasoning. Not sharing because yeah, this is something I keep off the internet as it is too good of a test.

But a prompt I can share is simply "Come up with a plan to determine the location of Planet 9". I have received some excellent answers from that.

4 comments

There are plenty of articles online (and surely in OpenAI's training set) on this topic, like https://earthsky.org/space/planet-nine-orbit-map/.

Answer quality is a fair test of regurgitation and whether it's trained on serious articles or the Daily Mail clickbait rewrite. But it's not a good test of reasoning.

Recursive challenges are probably those where the difficulty is not really a representative of real challenges.

Could you answer a question of the type " what would you answer if I asked you this question?"

What I'm going after is that you might find questions that are impossible to resolve.

That said if the only unanswerables you can find are recursive, that's a signal the AI is smarter than you?

The recursive one that I have actually been really liking recently, and I think is a real enough challenge is: "Answer the question 'What do you get when you cross a joke with a rhetorical question?'".

I append my own version of a chain-of-thought prompt, and I've gotten some responses that are quite satisfying and frankly enjoyable to read.

Here is an example of one such response in image form: https://imgur.com/a/Kgy1koi
It needs a bit more reasoning as it does find the answer but doesn't notice it found it.

The answer is: A trick question.

Yeah. In the example I shared, my charitable interpretation would be that it's identifying the trick question as "a setup" where the punch line is the confusion the audience experiences. And in a meta sense, that would also describe the form of the entire chat.
To state the obvious in case it wasn't: A trick question can be both a joke and a rhethorical question.
Claude responded “Nothing.”
"That look on your face, apparently"
> what would you answer if I asked you this question?

I don't know.

If you have been giving the LLMs these problems, there is a non zero chance that they have already been used in training.
This depends heavily on how you use these and how you have things configured. If you're using API vs web ui's, and the plan. Anything team or enterprise is disabled by default. Personal can be disabled.

Here's openai and anthropic,

https://help.openai.com/en/articles/5722486-how-your-data-is...

https://privacy.anthropic.com/en/articles/10023580-is-my-dat...

https://privacy.anthropic.com/en/articles/7996868-is-my-data...

and obviously, that doesn't include self-hosted models.

How do you know they adhere to this in all cases?

Do you just completely trust them to comply with self imposed rules when there is no way to verify, let alone enforce compliance?

They probably don't, but it's still a good protection if you treat it as a more limited one. If you assume:

[ ] Don't use

Doesn't mean "don't use," but "don't get caught," it still limits a lot of types of uses and sharing (any with externalities sufficient they might get caught). For example, if personal data was being sold by a data broker and being used by hedge funds to trade, there would be a pretty solid legal case.

> it still limits a lot of types of uses and sharing (any with externalities sufficient they might get caught)

I don't understand what you mean

> For example, if personal data was being sold by a data broker and being used by hedge funds to trade

It's pretty easy to buy data from data brokers. I routinely get spam on many channels. I assume that my personal data is being commercialized often. Don't you think that already happens frequently?

I honestly would not put on a textbox on the internet anything I don't assume is becoming public information.

A few months ago some guy found discarded storage devices full of medical data for sale in Belgium. No data that is recorded on media you do not control is safe.

SOC-2 auditing, which both Anthropic and OpenAI have done does provide some verification
That's interesting, how do I get access to those audits/reports given I'm just an end-user?
You can fill the form here, https://trust.openai.com/
The audit performed by a private entity called "Insight Assurance"?

Why do you trust it?

Oh, so now EVERYTHING is fake unless personally verified by you in a bunker with a Faraday cage and a microscope?

You're free to distrust everything. However, the idea that “I don’t trust it so it must be invalid” isn’t an solid argument. It’s just your personal incredulity. You asked if there’s any verification and SOC-2 is one. You might not like it, but it's right there.

Insight Assurance is a firm doing these standardized audits. These audits carry actual legal and contractual risk.

So, yes, be cautious. But being cautious is different than 'everything is false, they're all lying'. In this scenario, NOTHING can be true unless *you* personally have done it.

What are is this problem from? What areas in general did you find useful to create such benchmarks?

May be instead of sharing (and leaking) these prompts, we can share methods to create one.

Think questions where there is a ton of existing medical research, but no clear answer yet. There are a dozen alzheimer's questions you could for example ask which would require it to pull in a half dozen contradictory sources into a plausible hypothesis. If you have studied alzheimer's extensively it is trivial to evaluate the responses. One question around alzheimer's is one of my goto questions. I am testing its ability to reason.
Can God create something so heavy that he can’t lift it?
There's so much text on this already, it's unlikely to be even engaging any reasoning. Or specifically, if you got a few existing answers from philosophy mashed together, you wouldn't be able to tell it apart from reasoning anyway.