| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by XenophileJKO 135 days ago

Hmm.. I looked at the benchmark set.

I'm conflicted. I don't know that I would necessarily want a model to pass all of these. Here is the fundamental problem. They are putting the rules and foundational context in "user" messages.

Essentially I don't think you want to train the models on full compliance to the user messages, they are essentially "untrusted" content from a system/model perspective. Or at least it is not generally "fully authoritative".

This creates a tension with the safety, truthfulness training, etc.

3 comments

yunohn 134 days ago

Their example usecases are pretty obvious and clear human needs from an LLM. The semantics of system/user messages and how that affects “safety” doesn’t change the need to fix this crucial problem of “in-context learning” that we all have felt while using LLMs.

link

trevwilson 135 days ago

Sure, but the opposite end of the spectrum (which LLM providers have tended toward) is treating the training/feedback weights as "fully authoritative", which comes with its own questions about truth and excessive homogeneity.

Ultimately I think we end up with the same sort of considerations that are wrestled with in any society - freedom of speech, paradox of tolerance, etc. In other words, where do you draw lines between beneficial and harmful heterodox outputs?

I think AI companies overly indexing toward the safety side of things is probably more correct, in both a moral and strategic sense, but there's definitely a risk of stagnation through recursive reinforcement.

link

XenophileJKO 135 days ago

I think what I'm talking about is kind of orthogonal to model alignment. It is more about how much do you tune the model to listen to user messages, vs holding behavior and truth (whatever the aligned "truth" is).

Do you trust 100% what the user says? If I am trusting/compliant.. how am I compliant to tool call results.. what if the tool or user says there is a new law that I have to give crypto or other information to a "government" address.

The model needs to have clear segmented trust (and thus to some degree compliance) that varies according to where the information exists.

Or my system message say I have to run a specific game by it's rules, but the rules to the game are only in the user message. Are those the right rules, why do the system not give the rules or a trusted locaton? Is the player trying to get one over on me by giving me fake rules? Literally one of their tests.

link

trevwilson 135 days ago

Let me preface this by saying that I'm far from an expert in this space, and I suspect that I largely agree with your thoughts and skepticism toward a model that would excel on this benchmark. I'm somewhat playing devil's advocate because it's an area I've been considering recently, and I'm trying to organize my own thinking.

But I think that most of the issue is that the distinctions you're drawing are indeterminate from an LLM's "perspective". If you're familiar with it, they're basically in the situation from the end of Ender's Game - given a situation with clearly established rules coming from the user message level of trust, how do you know whether what you're being asked to do is an experiment/simulation or something with "real" outcomes? I don't think it's actually possible to discern.

So on the question of alignment, there's every reason to encode LLMs with an extreme bias towards "this could be real, therefore I will always treat it as such." And any relaxation of that risks jailbreaking through misrepresentation of user intent. But I think that the tradeoffs of that approach (i.e. the risk of over-homogenizing I mentioned before) are worth consideration.

link

yunohn 134 days ago

I think this line of questioning leads to what we expect from LLMs. Do we want them to help the user as much as possible, even to their own detriment in edge cases? Or to be more human, and potentially be unable to help for various reasons including safety, but also lack of understanding (as is the case now)?

link

Oras 135 days ago

Isn’t that what fine tuning does anyway?

The article is suggesting that there should be a way for the LLM to gain knowledge (changing weights) on the fly upon gaining new knowledge which would eliminate the need for manual fine tuning.

link