Hacker News new | ask | show | jobs
by mbanerjeepalmer 153 days ago
> Universities are increasingly turning to AI to spot AI-written work (even as students use services like Dumb it Down to make their AI-fuelled work sound more believable). It can be detected. Chris Caren, the boss of Turnitin, a popular plagiarism detector, describes plagiarised prose as “beige”: “well-written, but not very dynamic”. It has verbal tics: it is keen on dreary words like “holistic” and notably keen on “notably”.

I don't think you can say that AI-written can be reliably detected. Turnitin is only ~90% effective: https://teaching.temple.edu/sites/teaching/files/media/docum...

8 comments

I tried a lot of these tools, including Turnitin, and I think they are all wrong. Not because they are a bad implementation, but just because the problem is naturally impossible in a lot of cases.

There are people whose style is closer to AI, that doesn't mean they used AI. And sometimes AI outputs text that look like a human would write.

There is also the mix: if I write two pages and I used two sentences by AI (because I was tired and I couldn't find the right sentence), I may be flagged for using AI. Even worse, if I ask AI for advice and then I rewrite it myself, what would be the output? I can make a reasoning that both (AI written and not AI written) would be wrong.

> There is also the mix: if I write two pages and I used two sentences by AI (because I was tired and I couldn't find the right sentence), I may be flagged for using AI.

None of these tools are binary. They give a percentage score, a confidence score, or both.

If you include one ai sentence in a 100 sentence essay, your essay will be flagged as 1% AI and nobody will bat an eye.

They are not binary but the score isn't linear in my experience either. It isn't that they assign a score to each sentence and then do an aggregation.
It's not, but the fact that one sentence deserves a high score doesn't automatically mean that entire thing will flag false positive. Unless it's like, two sentences in total.
Yeah, and to be blunt, beige and not dynamic is how I would describe most student writing done entirely by the human. I just don't see how a model, trained on a vast corpus of such writing, could ever be successfully and reliably distinguished from human writing. You can distinguish good writing from so-so writing, that's about it.

In an educational context, the only purpose of the writing has traditionally been learning, and the purpose of turning it in has been to prove that the learning took place. Both of those are out the window now. Classroom discussion and oral presentations might be the only place you can still prove learning took place. Until everybody gets hidden AI-powered earpieces of course.

I take suspicious student papers and feed them to Turnitin, as well as the popular LLMs. Hey ChatGTP, give me a report on the likelihood that this paper was generated by an LLM. Do that with Gemini, Claude, etc.

Then if there's a high probability, I look through the references in the paper. Do they say what the student attributes to them?

Finally, if I still think it's AI-generated, I have the student in and ask questions about the paper. "You said this here in this paragraph -- what do you mean by that?"

AI detectors are a first-pass, but I think a human really needs to be in the loop to evaluate whether it's cheating, or just using something to clean up grammar and spelling.

> [can’t] be reliably detected… only ~90% effective

I’m surprised to see these comments in conjunction, 90% is pretty good, and much higher than i expected. I wonder what’s the breakdown of false positives/false negatives

Edit: from the linked paper

> Of the 90 samples in which AI was used, it correctly identified 77 of them as having >1% AI generated text, an 86% success rate. The fact that the tool is more accurate in identifying human-generated text than AI-generated text is by design. The company realized that users would be unwilling to use a tool that produced significant numbers of false positives, so they “tuned” the tool to give human writers the benefit of the doubt.

This all seems exceptionally reasonable. Of the samples with AI, they correctly identify 86%. Of the samples without AI, they correctly identify a higher proportion, because of the nature of their service. This implies that if they _wanted_ to make a more balanced AI detection tool, they could get that 86% somewhat higher.

> I’m surprised to see these comments in conjunction, 90% is pretty good, and much higher than i expected.

What standard of proof is appropriate to expel someone from college? After they've taken on, say, $40,000 of debt to attend?

Assuming you had a class of 100 students, "90% effective" would mean expelling 10 students wrongly - personally I'd expect a higher standard of proof.

Anyone expelling a student over a single “ai” label from turnitin alone is a complete idiot. Perhaps that happens occasionally, but that’s clearly the result of horrible decision making that isn’t really turnitins fault.

Anyone who gives 10 seconds of thought to how this could help realizes at 90% it’s a helpful first pass. Motivated students who really want to hide can probably squeak past more often than you’d like. And you know there will be false positives so you do something like: * review those more carefully, or send it to a TA if you have one to do so * keep track of patterns of positives from each student over time * explain to the student it got flagged, say it’s likely a false positive, and have them talk over the paper in person

I’m sure decent educators can figure out how to use a tool like that. The bad ones are going to cause stochastic headaches for their students regardless.

That's not what 90% effective means. Tests don't work that way.

Tests can be wrong in two different ways, false positive, and false negative.

The 90% figure (which people keep rounding up from 86% for some reason, so I'll use that number from now on) is the sensitivity, or the abitity to not have false negatives. If there are 100 cheaters, the test will catch 86 of them, and 14 will get away with it.

The test's false positive rate, how often it says "AI" when there isn't any AI, is 0%, or equivalently, the test's "specificity" is 100%

> Turnitin correctly identified 28 of 30 samples in this category, or 93%. One sample was rated incorrectly as 11% AI-generated[8], and another sample was not able to be rated.

The worst that would have happened according to this test is that one student out of 30 would be suspected of AI generating a single sentence of their paper. None of the human authored essays were flagged as likely AI generated.

Expulsions don’t happen. International students have been cheating rampantly for decades. Universities are happy enough to collect their tuition.
My son, who just finished his first semester at college, said the thing that surprised him the most was the blatant cheating all around him. He said it is rampant and obvious, and the professors don't seem all that eager to punish it. It pisses him off, because it puts him at a disadvantage because he doesn't want to cheat.
It's from a culture of people who cheat to get ahead, because they come from a society SO competitive, SO cutthroat, and SO obsessed with education & testing that cheating is encouraged and rewarded...because its rewarded in the workplace, in the broader economy (up to a point), and in the political body.

Of course, there's also the Chinese, who cheat because they are international students paying several multiples of the tuition and the university doesn't want to upset that gravy train in the wake of several federal funding cuts. Also because your rank-and-file Chinese students at most American colleges suck at speaking English so they, except in pure STEM, need to cheat in order to pass in the first place.

Problem is when the professors are being assessed on how the students do, instead of how honestly they assess their performance, there's a lot of disincentive to root out cheating. Universities have generally been marking their own homework on this front for a long time, and their morphing into a business which sells degrees has turned this conflict of interest into a real problem.
>90% is pretty good, and much higher than i expected.

Problem with that at scale is those that might skirt by within that 10% might one day be your doctor, your lawyer, or your accountant and you'd never know until it bit you in the ass.

You can read the linked article, they break down their analysis in detail. Seems like low false positives at least.

Edit: thanks for doing so

> Turnitin is only ~90% effective:

No it isn't. Stop.

The cynical part of me says that the people who share this link with that summary are the cheaters trying to avoid getting caught, on the basis of the fact that they are patently abusing the numbers presumably because they didn't pay attention in math class.

The tests are 90% SENSITIVE. That means that of 100 AI cheaters, 10 won't be caught.

The paper you linked says the tests are 100% SPECIFIC. That means they will *never* flag a human-written paper as mostly AI.

Honestly reading that article made me more less worried about AI-detection. My main concern is false positives (incorrectly identifying a human-written text as AI-written), but it seems Turnitin got that close to 0.

Of course the sample size is fairly small, I would want a larger scale study to see if the false positive rate is actually 5%, or 1%, 0.1%, 0.000001%, etc.

+1, i feel they’ve done a pretty good job, and have balanced the trade offs well
Turnitin is in weird spot. And probably impossible one. Academic writing is trained to be academic writing. With mesta text and phrases. And students and writers tend to follow conventions they see in other academic texts. As do AI.

On some level the human output in academic setting is expected to be well formulaic in way AI generated text is.

Which often could lead to false positives.

What would be high enough? I agree 90% isn't perfect, but neither are LLMs.
What can you do with 90%? Accuse people of plagiarism and ignore the fact you will hurt 10% of innocent people, while still allowing 10% of cheaters? Of course there's ambiguity in the "accuracy" term, but I assumed you can be inaccurate in both directions.
Actually, you're allowing a much higher percentage of cheaters if you read the paper. They optimized to avoid false accusations. It's only ~45-75% accurate at detecting AI writing. It's closer to 90% accurate at detecting human writing. Half the cheaters get through, and you still fail 10 percent of the people who didn't cheat.
> It's closer to 90% accurate at detecting human writing.

I know that's what they wrote, but I heavily disagree. It got 28/30 (93%) correct, but out of the two it got "wrong":

- one was just straight up not rated because the file format was odd or something

- the other got rated as 11% AI-written, which imo is very low. I think teachers would consider this as "human-written", as when I was being evaluated with Turnitin that percentage of "plagiarism" detected would have simply been ignored.

At this point the most basic users of could be easily picked off and that style and list will grow yearly.
> Of course there's ambiguity in the "accuracy" term, but I assumed you can be inaccurate in both directions.

The linked article breaks it down. The measured false positive rate is essentially 0 in this small study.

Are you going to fail 10% of students who did their own work because they supposedly cheated? What exactly can you do with this 90% accurate judgment from a black box? Perhaps not let them out on bail?
No, read the paper. They're going to pass 10% of students who cheated. The 90% figure is the false negative rate, how many AI essays it says are human.

The false positive rate is 0. The tool *never* says human writing is AI.

> The false positive rate is 0. The tool never says human writing is AI.

That cannot be true as it would be easy for a human to write in the style of AI, if they choose to. Whoever is making that claim is lying, because money...

Read the paper dude. It's not an advertisement, it's an investigation. They performed an experiment including 29 human written papers. One of them got a score of 11% likely to be AI, the rest got a score of 0% likely to be AI. The tool never labeled any human writing as AI with high confidence.

> That cannot be true as it would be easy for a human to write in the style of AI, if they choose to.

Is that the nightmare scenario that everybody in this thread is freaking out about?

Students who go to great effort to deliberately try to make it look like they are cheating, they're the ones you're afraid of being falsely accused of cheating?

We're on our way to dystopia because people who go out of their way to look suspicious on purpose, arouse suspicion?

The reliability of all AI tools with potentially severe consequences for people needs to be tested using adversarial patterns. This is nothing new, yet the mentioned article fails to do that. They test the happy paths and find the results to be satisfactory for themselves.

It is very common in academic investigations to achieve results with more than 95% accuracy, let alone 90%, when in the real world the same AI tools fail miserably.

So, yes, this is the nightmare scenario that I am afraid of where a simplistic "investigation" will be used to justify the use of unproven AI tools with real life consequences to people.

> Are you going to fail 10% of students who did their own work because they supposedly cheated?

The linked article analyzes their data into more detail. In particular, the measured false positive rate is essentially 0 in this small study.

90% accurate doesn't mean 10% false positives, I'd want the 90% accurate to be 100% accurate all of the time.

This isn't zoolander math. or is it.

If I get AI to generate an essay and rewrite every word with my own whilst keeping the same general meaning of the original text, surely there’s no reasonable way to detect that, right?

I mean, the solution is just in-class-only essays, right? Or to stop with the weird obsession with testing and just focus on actually teaching.

There will be because over time the lazy and passive copying will itself become lazy and bring in more of the ai patterns.

The better way to use ai is to get it to teach you to write the essay better and faster each time so it remains your voice and starts with how you write already and develop it from there.

Everyone generally speaks and writes different enough like a gait. Ai erases that both directly and indirectly.

People who think they’re clever with ai but won’t spend time developing any actual skills will always get exposed eventually.

Colleges are introducing rules that if they can detect AI after you graduate they will cancel your degree. Fun watches on YouTube showing it.

Have fun!

Just don't grade essay? Make it clear that eassy are optional and not required to get a grade, but it's a good way to learn. That will cut down the amount of work to be done too.

They failing exams because they don't do the work is on them.