Hacker News new | ask | show | jobs
by nrmitchi 289 days ago
This seems like a (potential) solution looking for a nail-shaped problem.

Yes, there is a huge problem with AI content flooding the field, and being able to identify/exclude it would be nice (for a variety of purposes)

However, the issue isn't that content was "AI generated"; as long as the content is correct, and is what the user was looking for, they don't really care.

The issue is content that was generated en-masse, is largely not correct/trustworthy, and serves only to to game SEO/clicks/screentime/etc.

A system where the content you are actually trying to avoid has to opt in is doomed for failure. Is the purpose/expectation here that search/cdn companies attempt to classify, and identify, "AI content"?

2 comments

It's the evil bit, but unironically.
For today's lucky 10k:

https://www.ietf.org/rfc/rfc3514.txt

Note date published

>Attack applications may use a suitable API to request that [the evil bit] be set. Systems that do not have other mechanisms MUST provide such an API; attack programs MUST use it.

Potential flaw: I'm concerned that attackers may be slow to update their malware to achieve compliance with this RFC. I suggest a transitional API: Intrusion detection systems respond to suspected-evil packets that have the evil bit set to 0 with a depreciation notice.

deprecation notice
It says in the first paragraph it’s for crawlers and bots. How many humans are inspecting the headers of every page they casually browse? An immediate problem that could potentially be addressed by this is the “AI training on AI content” loop.
How many of the makers of these trash SEO sites are going to voluntarily identify their content as AI generated?
Moreover, I find it ironic that website owners will gracefully give AI companies the power to identify what is "good" data and what is not. I mean, why would I do the work for them and identify my data as AI, so that they would ignore it ? "yes please, take all my work, this is quality content, train on it, it's free !" that's what it sounds like
It would still be required for the content producer (ie, the content-spam-farm) to label their content as such.

The current approach is that the content served is the same for humans and agents (ie, a site serves consistent content regardless of the client), so who a specific header is "meant for" is a moot point here.

I believe this is why Google did SynthID https://deepmind.google/science/synthid/