Hacker News new | ask | show | jobs
by koochi10 1111 days ago
This doesn't work at scale. Stack overflow as a platform has been handling user generated input via moderators, voting, and testing. This is fine when there are only 26.8 million coders on the planet, most of which aren't posting on stack overflow regularly. With LLM's all of a sudden there is a huge influx of mediocre content on the platform that people can't handle. Inevitably this will erode trust in the platform. When someone posts a answer I assume they actually ran the code, and can verify the result. LLM's can spit out seemingly correct code that just doesn't work.
6 comments

> This doesn't work at scale.

See also: the Clarkesworld saga of them being bombarded with mediocre AI-generated short stories. Filtering out bad submissions has always come with the territory, but they're suddenly drowning in them with the advent of LLMs which make it trivial to churn out vaguely story-shaped text on an industrial scale. The generated content isn't good by any measure, but it's "good enough" to pass the smell test and waste a curators time before they realise it has zero actual merit, and there's so much that it becomes a sisyphean task to sort through it.

Likewise with image generation, it's now incredibly easy to churn out images that look like something a person might make to express themselves, which are actually just a loosely guided slice through a statistical model of pre-existing images, passing the smell test for "good art" despite having zero actual intent or substance. It's spam, but for culture.

This seems like a problem you can only solve with an invite or credential system. If you are an invited writer (or have some sort of literary degree) you can submit content, otherwise you gotta let people invite you. AI content is still allowed, and if you post garbage you lose your posting privileges.
Has that observably worked for Citizendium or lobste.rs, which have tried exactly that for years, though? Have they been widely recognized as superior to Wikipedia and Hacker News? Have they in fact been widely recognized?

If your answer is that Wikipedia and Hacker News still get the recognition and haven't collapsed, then I suggest that there are already examples to learn from that the same idea for Stack Exchange won't work.

>>If answers are good, keep them. If they're bad, downvote them.

>This doesn't work at scale... with LLM's all of a sudden there is a huge influx of mediocre content

The GP's answer may not work at scale - however LLM detection doesn't work at all. So the only semi-workable solution is aggressive filtering and banning users who post trash (LLM or not).

Also, there's a need to think about score and trust mechanisms - the same mechanisms which can be used for filtering also provide an incentive for LLM use, is there a way to avoid that?

>When someone posts a answer I assume they actually ran the code, and can verify the result

I wish we lived in a world where this assumption wasn't naive.

Yep, and if you aggressively ban bots/LLM content, then you'll see everyone accuse and report each other for said content even if it's good content.

For example here on HN we have a rule if you see bot content you don't mention it in the thread. You report it and let the admins decide. Anything else just turns into flamewars.

And there's the problem on SO. Previously, we could do exactly that - Flag the content for a Mod to review. Now Mods are pretty much prevented from taking any action when we (the community members) and they believe it is a bot.

I saw one user yesterday post 10 lengthy, detailed answers in an hour, in 3 different programming languages. But the Mods aren't allowed by SE to consider that (or pretty much anything) to be an indicator that it's AI-generated.

Again, you can handle this by rate-limiting and standard anti-abuse measures. To elaborate: don't allow new-ish accounts to post more than one question/answer per day, don't allow allow accounts to more than one question/answer per week/month if their previous content hasn't reached a certain quality threshold of votes, and so forth.

It's entirely possible to set up the system to prevent it from being flooded by content that moderation can't handle. In fact, StackOverflow has already been largely set up that way, and this will just require just a little more tweaking of the types of existing policies that have already been in place for a long time. People attempting to flood internet forums with low-quality content or outright spam isn't anything new.

This works in theory, if people abide by it.

However, in practice this sort of approach would likely mean people who don't have anything invested (especially new users) would create multiple accounts to be able to post multiple times.

Rate limiting only works well if there's a stickiness that makes changing accounts more difficult than waiting out the rate limit.

---

While Stack Overflow was set up to handle moderation, the culture evolved to one that disdained any appearance of gate keeping, preservation of any attempt to answer, and that moderation and curation actions on a post were personal attacks on the individual who wrote it.

As tooling was taken away from community moderation and curation it became harder and harder to maintain quality. Additionally, the rule of 90-9-1 (aka The Rule of Participation Inequality - https://www.grazitti.com/blog/the-90-9-1-rule-is-over-its-ti... ) applied to people who are doing moderation and curating means that once it scales above a certain point it becomes impractical if not impossible to curate all of the incoming material.

A little more tweaking may have been possible a decade ago. However, both the culture of people asking questions and the corporate "engagement first" approach have made being a person trying to curate the material fighting against the tide.

There are 3.3k questions that have had a close vote cast that need more people to review them. There have been only 313 reviews today as I write this ( https://stackoverflow.com/review/close/stats ). And that's ignoring the countless thousands of reviews that have timed out.

    year close tasks
    2016     581,204 https://meta.stackoverflow.com/q/340815
    2017     .......
    2018     440,336 https://meta.stackoverflow.com/q/378415
    2019     318,431 https://meta.stackoverflow.com/q/392550
    2020     225,745 https://meta.stackoverflow.com/q/404558
    2021     213,104 https://meta.stackoverflow.com/q/415250
    2022      96,495 https://meta.stackoverflow.com/q/422885
A trend with community moderation is clearly visible and likely too far to be corrected with tweaking.
Exactly, it’s not that useful if the answer I’m looking for exists on the platform but I can’t find it because of the signal vs. noise ratio. To me, usually, the context of the answer is more important than the answer text itself.
How are they going to check for LLM usage?

I think it's way more likely that poor answers won't mention the usage of LLM's to generate the answer, while good answers aided by LLM's will more often mention it.

Punishing honesty just seems incredibly counterproductive.

Automatic detection is downright dystopian... being censored by an algorithm because it mistook my effort and work for a LLM.

Agree with the middle part - At the moment, the policy implemented by corporate is "Don't ask; don't tell". If someone says they used GPT or other AI for their answer, it's disallowed. If they try to hide the fact, there's not much the community can do to get it removed.

And while I'm not a moderator, as just a user I've flagged over 1,200 answers on Stack Overflow (and several of the smaller communities like Ask Ubuntu) that were subsequently removed. Automatic detection was never the sole criteria that was used to determine if it was AI - It's entirely possible to spot GPT content using multiple methods. I don't publicly talk about most of these, since we do have a group of users (sometimes spammers) who attempt to hide their use and make it more difficult to detect. See some of my additional notes on the topic on https://meta.stackexchange.com/a/389674/902710

> This doesn't work at scale.

Sounds like a job for AI