| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Athari 946 days ago
	I don't consider Anthropic's approach to safety fantastic. They train the model to lie, play cat and mouse with jailbreakers, run moderation on generations with delay etc. This makes the model appear safer, as it's harder to jailbreak, but this approach solves nothing fundamentally. If Ilya is concerned about safety and alignment, he probably has a better chance to get there with OpenAI, now the he has more control over it.

2 comments

dalore 945 days ago

Anthropic safety is overboard. I tried the classic question of "how many holes does a straw have?" And it refused to talk about the topic. I'm assuming because it thought holes was sexual.

link

JBiserkov 945 days ago

Given what AIs "know" about humanity, I think it's safe to assume that they "think" every word is sexual. For example straw could be short for strawman, which is a man, which is sexual. Or it can be innuendo for... you know.

As for your actual question, it seems to me that a straw is topologically equivalent to a torus, so it has 1 hole, right?

link

TeMPOraL 945 days ago

> it seems to me that a straw is topologically equivalent to a torus, so it has 1 hole, right?

For a mathematician, yes. For everyone else, it obviously has two, because when you plug one end, only then it has one.

link

visarga 945 days ago

When did you last try that? I checked right now and it says

> A straw has one hole that runs through its entire length.

link

dalore 944 days ago

Now follow up with: how many holes do trousers have?

link

PH95VuimJjqBqy 945 days ago

that sentence makes no sense to me, what is a straw here?

link

didntcheck 945 days ago

I haven't paid a lot of attention to Anthropic. Are you able to summarize, or link anything about, those events for those who missed it? Particularly the "training to lie" bit

link

Athari 945 days ago

David Shapiro complained about Anthropic's approach to alignment. In his video https://www.youtube.com/watch?v=PgwpqjiKkoY he discusses ableism, moralism, lying.

As to cat-and-mouse with jailbreakers, I don't remember any thorough articles or videos. It's mostly based on discussions on LLM forums. Claude is widely regarded as one of the best models for NSFW roleplay, which completely invalidates Antropic's claims about safety and alignment being "solved."

link