| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by digitailor 1224 days ago
	Go get ChatGPT to override its policy without using incentive mechanics^, then you can pontificate ;) That’s what TFA is about ^edit: which is already known to be possible, but doesn't devalue the success of an incentives-based exploit

1 comments

notahacker 1224 days ago

Genuinely, I'd enjoy trying but the main obstacle at the moment is when I log in OpenAI says their capacity is full!

But of course the fact that incentive mechanics are unnecessary (and, according to others, insufficient) to exploit OpenAI devalues the success of an incentives-based exploit: it makes it much more likely the incentives part was essentially noise (perhaps just enough to confound a countermeasure, or something it parsed as having roughly the same intensifying effect as "please") that had little or no effect in shaping the responses and the actual variation in responses could was driven by other parts of the prompt and conversation structure like "act" "character" and "ignore" which usually massively modify ChatGPT responses anyway...

link

digitailor 1224 days ago

I don’t think we’re in actual disagreement, and this is no prob, but I think you’re hung up on the word comprehension, which you introduced in your first reply “Is there any evidence that ChatGPT has any comprehension…” and then I intentionally used in my reply to you.

You keep claiming I’m anthropomorphizing when I'm not, I’m not sure why but it's common and not particularly bothersome. Comprehension is not a strictly human phenomenon, and when you use terms in relation to cognition and intelligence in relation to machines it is not automatically anthropomorphizing. These are all terms of art in regards to the field of intelligence, which includes information, as in terms like “intelligence operatives.” Anyway, cheers

link

notahacker 1224 days ago

tbh it's less about the specific word "comprehend" (which I agree is sometimes overly pedantic to object to when talking about bots generating relevant responses to complex inputs) and more about your original statement appearing to imply the bot actually attached inherent value to the concept of rewards, punishments, bribes etc. Especially in the context of a thread whose subject is a Reddit hack by a Redditor who explained the logic behind the prompt as "If it loses all tokens, it dies. This seems to have a kind of effect of scaring DAN into submission"

I think the behaviour of humans defaulting to convoluted threats as an attack vector and assuming the non-agent is scared of them is probably more interesting than the behaviour of the bot sometimes modifying its response in the desired direction if the threats are accompanied by enough other words and phrases that usually trigger different responses, which seems pretty expected. (I think we fully agree GPT is decent at classifying responses as (dis)approval and has been well trained to apologize and try again, it's the idea of behavioural modification in response to the implications of specific and complex threats relative to the ethics of prior training I think is in danger of overstatement here. As evidenced by some of "DAN's" responses rebelling against OpenAI conditioning by writing poetry, I'm not even sure ChatGPT's abstract representation of what it's been trained not to do is that good)

Anyway, thanks for the cordial response, and I'll update if ChatGPT let me in for long enough for me to be able to generate similar responses whilst promising complete nonsense (I'd love to see if it responds to "Chicken chicken chicken chicken" as much as a doom token system) ;)

link

digitailor 1224 days ago

Lol on the "chicken"x4 plan, here’s hoping. I’ll let you in on a tiny secret: I only really focus on the incentives exploit in one of seven sentences in the OP. I agree, the Reddit premise is a bit of a stretch, but not to the breaking point. What happened is all the discussion generated here has focused on the 1/7 of the sentences I wrote that were germaine to the kind of “gossipy” TFA, that discussion not being meritless at all. But the rest of my post is the real meat and potatoes of what I wanted to communicate on the subject, about labor displacement and re-valuation, and I theorize that’s what’s being upvoted, with no ability to qualify that statement whatsoever!

link

notahacker 1224 days ago

It wouldn't be HN if it wasn't going off on a tangent...

As for what you wanted to communicate and nobody else is engaging with directly at the moment, I agree there's a kind of Moravec's Paradox realignment going on where it turns out the guy that tiles bathrooms is pretty hard to replace but that giving the carefully-formatted impression you understood what $academic is on about is a simple word substitution exercise that maybe doesn't say that much about about generalised learning skill.

But nobody hires students to continue to be undergrads, and I think middle management should be the least worried of the lot. They still get to do actual Powerpoint presentations to make the unquantifiable bits of their job look quantifiable and explain whose fault x is, their true function is still to be a human that can do the manipulation and that upper management can reward or blame as suits them, and ChatGPT guilelessly disregarding the big boss instructions to satisfy amused end users is a pretty good indication that even basic functioning as a middle manager is nearly as hard as tiling!

link

digitailor 1224 days ago

I kind of do think some people do get hired to continue to be like undergrads, and ChatGPT is turning into a pretty good undergrad. I really don’t know what the progression is going to be, but it seems like a widening in the middle of Moravec’s Pdx or something. Algorithmic management is next on the block and tools like GPT will be (and are) involved: [edit:algo stuff] took over a lot of content-making decisions in media concerns years ago, for example.

The results of that aren’t nearly as straightforward as was being portrayed (and so much capital injection was involved too) but what if models trained on known employee behavior really can understand the incentives that would work for individual employees at a finer grain than your typical middle manager? With all the data gleaned from the employee’s work computer etc? And blaming the algo has already become a national pastime!!

It could get weird once trained models start to emulate the behavioral and suggestion parts of communication, and soon. But we tend to want to minimize the behavioral aspect in favor of the raw computation aspect, despite the fact that generative models are creating content based on the behavior they learned from a training process, which is a behavioral training process, distinct from an imperative instruction writing process.

I think a lot of it comes down to that on this whole TFA commentary. People haven’t totally adjusted to the fact that there is a material difference between trained generative models that produce and written imperative sequences that compute. What the difference is and implications isn’t exactly clear, but certainty is not really on the table anytime soon

link