Hacker News new | ask | show | jobs
by buu700 292 days ago
Because the false positive rate is unacceptably high — we're talking about a standard, widely used character — and because if the heuristic becomes widespread enough to matter, then it will be trivially circumvented by bad actors anyway. Who is it helping if we collectively bully ourselves into excising a perfectly good punctuation mark from human language?

If anything, I'd rather that renderers like Markdown just all agree to change " - " to an en dash and " -- " to an em dash. Then we could put the matter to bed once and for all.

2 comments

Oh no, I'm not advocating ditching em-dahses. I love them -- the form I use, anyway.

I was just curious why you've decided paying attention to them is a bad heuristic. Sure, it can change once people instruct their LLMs not to use them, but still, for now, they sure seem to overuse them!

That and "let's unpack this". I swear, I'll forbid ChatGPT from using "unpack" ever again, in any context!

That's fair. It's not like I don't pay attention to it myself. It's more that I wouldn't never use presence of em dashes in the absence of any other heuristics to predict whether or not something is LLM-generated, and it's a practically useless signal either way because I also wouldn't assume that content that used hyphens in place of dashes wasn't LLM-generated.

So the only real purpose of the heuristic is to add a tiny extra vote of confidence when I see a comment that otherwise appears to be lazy ChatGPT copypasta, but in such cases I'll predict that it was probably LLM output either way, and I'll judge that it appears to be poor writing that isn't worth my time regardless of whether or not an LLM was involved.

Fundamentally, the issue I'm seeing here is that we're all talking over each other because we need a better standardized term than "LLM output". I suppose "slop" could work if we universally that it referred only to a subset of LLM output, rather than being synonymous with LLM output in general, but I'm not sure that we do universally agree on that.

If someone types the equivalent of a Google search into ChatGPT, or a spammer has an automated process generically reply to social media posts/comments, that's what qualifies to me as "slop". Most of us here have seen it in the wild by now, and there's obviously a distinctive common style (at least for now), and I think we can all agree that it sucks. That's very different from someone investing time and/or expertise to produce content that just happens to involve an LLM as one of the tools in their arsenal; the attitude it isn't is just the modern equivalent of considering cellular phone calls or typed letters to be "impersonal".

I'm not suggesting that LLM output doesn't tend to have a higher density of em dashes than human output. I'm just pushing back on the idea that presence of em dashes is sufficient evidence to dismiss something as probably-LLM-generated, which is no better than superstition. I mean, I've used em dashes in a number of comments in this thread, and no one has accused me of using an LLM, so it can't be a pattern that anyone puts too much stock in.

> the false positive rate is unacceptably high — we're talking about a standard, widely used character

Citation needed.

> Who is it helping if we collectively bully ourselves into excising a perfectly good punctuation mark from human language?

Humans can adapt faster than LLM companies, at least for the moment. We need to be willing to play to our strengths.

Who is it helping if we bully ourselves into ignoring a simple, easy "tell"?

Citation needed.

https://en.wikipedia.org/wiki/Dash

Humans can adapt faster than LLM companies

No one said anything about LLM companies. If I were a spammer today, I'd just have my code replace dashes in LLM output with hyphens before posting it. As a human, I'm not going to suddenly stop using dashes because a handful of people are treating a silly meme as if it were a genuinely useful heuristic.

> https://en.wikipedia.org/wiki/Dash

That maybe backs up the claim that it's standard, but not that it's widely used or the false positive rate would be unacceptably high.

> If I were a spammer today, I'd just have my code replace dashes in LLM output with hyphens before posting it.

No you wouldn't, for the same reason spammers don't put more plausible stories in their emails: they want to filter for the most gullible segment before investing any human effort.

It's a standard punctuation mark available on Android/iOS/macOS keyboards, and automatically inserted into text by widely used software such as Microsoft Word. You guys are acting like it's an obscure Unicode character that GPT just spontaneously started using out of the blue, and ignoring the obvious answer that it's common in LLM output because it's common in training data. The burden of proof is on anyone claiming that it isn't common.

I was referring to social media spam. It would be a simple way to defuse people citing the use of dashes as "proof" that your spam was spam and having the hivemind bury it. You can't ensnare gullible readers if they never see your comment to begin with — not that following an absurd blanket rule of categorizing em dash usage as AI output has anything to do with whether or not the reader is gullible.