Hacker News new | ask | show | jobs
by wand3r 1213 days ago
I saw this[1] interview with Sam Altman touching on interim AI impact. I really agree with his point that basically detecting output from LLMs is basically going to be futile and only really relevant in the near term. Accuracy is obviously going to improve in models and detection isnt that difficult now but will be in the future, especially if output is modified or an attempt to obfuscate origin is made.

[1]https://youtu.be/ebjkD1Om4uw

6 comments

>detection isnt that difficult now

I would have thought this, but every attempt I've seen at detecting chatGPT generated text has failed miserably.

It fails on false positives, but you don't often get false negatives, which might be enough at the moment for quite a few use-cases.

Also false positives are typically "this text is likely to contain parts that were AI generated" rather than "This text is higly likely to be AI generated" (which is what GPT-generated content generally produces).

When I've tried to prompt-engineer GPT to produce text that GPTZero will flag as negative it has been pretty tough!

Of course false positives matter; the naive heuristic “everything is AI generated” has zero false negatives, and mostly false positives. In the OP half of the positives are false. That’s not a useful signal IMO. You couldn’t use that to police homework for example.
I said they might not matter for quite a few use cases, not that they don't matter for all use-cases.

e.g. If you were Sam Altman at OpenAI and your use-case is mostly looking for training data and wanting to tell if it is AI-Generated or not (so you can exclude this from training data), you probably care much more about false negatives than false positives (false positives just reduce your training data set size slightly, while false negatives pollute it).

Of course they matter if you are marking homework (where conversely false negatives aren't actually that important!), but it's pretty trivial to think of use-cases where the opposite is true.

Even for that the false positives could end up mattering. E.g. if the data being incorrectly excluded turns out to be the most important part of the training set.
I've read that you can literally explain chatgpt what perplexity and burstiness is, and tell it to generate it's answer with low perplexity and high burstiness (what detectors check).

I haven't tried it though.

You don't prompt-engineer, this is solvable by changing parameters. You need a higher temperature
It's probably a short-term social phenomenon. We don't bother detecting mathematical output from calculators or spreadsheets; we just like that folks give us the right answer, even if they had easy tooling to produce it. However, watching someone do things the old way would seem bemusing. If you watched a manager notating all over a physical spreadsheet with a pencil (as was commonly done at one time) it would seem quaint or backwards depending on context. Likewise, waiting for someone to write a letter and taking more than 90 seconds because they didn't co-author it with AI might seem slow.
"Write an email that says I did X and they should do Y, but if Z then W, and we should schedule a meeting with P and Q."

I feel like for most emails I write, information density is close to a maximum. This means there's no actual gain to be had from a language model. The email I would write myself is going to be about the same length as the prompt I'd have to write anyway.

There is a huge gain if English is your second language, and you use ChatGPT to rewrite, or translate.

Even if English is your mother tongue, if your written English is crappy or you need to write in a style you are unfamiliar with (e.g. formal), then ChatGPT can help.

If you're writing in a style you're unfamiliar with, how do you know the model is doing it correctly?

I also think writing yourself might be far better practice. This tool can easily become a crutch. This is unlikely to be free anytime soon. In fact it's likely to be quite expensive.

I think we all tend to be better at picking out a correct answer than generating a correct answer from scratch.

I can ask ChatGPT to rewrite an email in the style an American news reporter from 1950, and I can judge whether some of the cliches it generates feel correct. I cannot write in that style at all.

ChatGPT is free now, although there is a paid tier, and MS and Google are building similar capabilities right into their search interfaces.
ChatGPT won't be free forever.

Whatever LLM search stuff comes along will only be free as long as it brings in ad revenue. Which involves making the models fundamentally worse most likely. Or they'll use it to collect personal data. Probably both.

Computationally, GPT is wildly expensive. This idea people have that it's gonna be used all over the place for all sorts of tiny tasks, as if it's just another REST API, is nuts. Unless something fundamentally new comes along that makes these models much cheaper, adoption is likely going to end up much more limited than people expect. Or siloed off into expensive business-facing products.

I hope gpt detectors will evolve into general bs detectors.
I wonder how good the best language models are at detecting bs.

So many experiments to run...

It sounds feasible to detect cases of low temperature output, as the text output this way sounds very generic (like an average of all content seen near the prompt's latent space).

However once you prompt the LLM with a higher temperature, or tell it to roleplay as someone with elaborate personas, or suggest to use certain linguistic styles, or train it on example text... then it becomes much harder.

I imagine pathological cases of formulaic word use, sentence/paragraph structure will only be detectable in longer form text. After all text is already pretty low-resolution, not much for adversarial models to work with.

Perhaps instead, the solution is to hold people to a higher standard. AI as a tool to help people in identifying logical fallacies (special pleading, anecdotal, the fallacy fallacy) written or otherwise could be very valuable. Maybe in the same way that ChatGPT can identify and explain the problems a piece of code has. It could even be integrated with moderation tools. This could in-effect make individuals less vulnerable to convincing nonsense said by professional bullshitters or parroting individuals.
> Accuracy is obviously going to improve in models

Well, to be clear, they can put rules based filters and other things on top of the neural net, but the core GPT will never get more accurate since it has no mechanism to understand what words mean.

GPT3 is far more accurate than GPT2. Seems reasonable that larger models trained on more data will continue to improve accuracy. I'd also expect larger models to be better at summarizing text, ie potentially fixing the Bing issues where it hallucinates numbers.

Our models sizes are a product of our scaling and hardware limitations. There's no reason to believe we are anywhere near optimal.

> Seems reasonable that larger models trained on more data will continue to improve accuracy.

It also seems reasonable to assume that they will eventually encounter diminishing returns, and that the current issues, such as hallucinations, are inherent to the approach and may never be resolved.

To be clear I don't have a clue which statement is true (though I don't see why scaling would solve the hallucination problem).

Scaling of models is a very researched area, and currently all the experiments show that scaling doesn't really get diminishing returns - that was checked in GPT-2 "era" with model sizes from very small up to GPT-2, and reconfirmed with GPT-3 and then with newer models. As far as we can see, scaling does not result in diminishing returns; and while it's certainly possible that we eventually encounter diminishing returns, it is not reasonable to presume that we actually will any time soon (as we have literally zero evidence for that and at least some evidence to the contrary), and even if we will, there's currently no reason to assume that the eventual breaking point is somewhere at "GPT-5" and not "GPT-15" or "GPT-55".
If significantly bigger models than now got better results we would have seen papers about that a long time ago so that the team/company can get more funding, lots of rich actors has worked on that for years.

If it doesn't produce better results however then they want their competitors to waste lots of money to make the same mistakes, there is really no benefit from publishing that and lots of drawbacks.

Otherwise it seems too much of a coincidence that Google and OpenAI ended up with models of basically the same size. Google could have trained a model 5x-10x larger easily, it isn't that expensive to them, but for some reason we didn't see that, and GPT-4 just never seems to launch.

It’s not just the cost of training the model, it’s the cost of doing inference at scale. ChatGPT boarder line too expensive to operate already. It’s hard to imagine a larger model that both economical and used by millions of people with our current hardware.
Might turn out that for rules based systems such as prescriptive grammars (for grammatically correct language rather than natural spoken) there is still use for a system that explicitly represents those rules.

Then again, we are a big old bulb of wetware and we can generally learn to apply grammar rules correctly most of the time (when explicitly thinking about them, anyway).

Maybe what we need is some kind of meta cognition: being able to apply and evaluate rules that the current LLMs can already correctly reproduce.

The biggest problem is that scaling is non-linear. The returns might well be non-diminishing wrt model size, but if we have to throw N^2 hardware at it to make it (best-case) 2N better, we'll still hit the limit pretty quickly.
> GPT3 is far more accurate than GPT2

Please say more about what you mean here because I disagree.

It’s certainly more eloquent, but it still can’t multiple 2 4-digit numbers…

But it did learn some basic arithmetic. If GPT4 can multiple 2 4-digit numbers will you change your mind?
GPT saw a bunch of arithmetic and can repeat it. That’s the joke behind the Reddit usernames and /r/counting.

The four digit number thing is just the current lower bound of where it gets confused because a lack of training data.

Once you teach a patient 8 year old the rules of multiplication once, they can multiple any two numbers (that they’ve never seen before) with an arbitrary number of digits. An LLM cannot and will not ever be able to do that because it is a specific tool and it is not designed to do that (doing that would be a bad outcome for an LLM since we have different tools that can do multiplication much more efficiently).

So yes, if a LLM learns rules based math (which it is not intended to do) I’ll eat not only my, but every hat in existence.

They probably mean that if you test it on common NLP benchmarks it performs better.