Hacker News new | ask | show | jobs
by furyofantares 149 days ago
The article feels very confused to me.

Example 1 is bad, StackOverflow had clearly plateaued and was well into the downward freefall by the time ChatGPT was released.

Example 2 is apparently "open source" but it's actually just Tailwind which unfortunately had a very susceptible business model.

And I don't really think the framing here that it's eating its own tail makes sense.

It's also confusing to me why they're trying to solve the problem of it eating its own tail - there's a LOT of money being poured into the AI companies. They can try to solve that problem.

What I mean is - a snake eating its own tail is bad for the snake. It will kill it. But in this case the tail is something we humans valued and don't want eaten, regardless of the health of the snake. And the snake will probably find a way to become independent of the tail after it ate it, rather than die, which sucks for us if we valued the stuff the tail was made of, and of course makes the analogy totally nonsensical.

The actual solutions suggested here are not related to it eating its own tail anyway. They're related to the sentiment that the greed of AI companies needs to be reeled in, they need to give back, and we need solutions to the fact that we're getting spammed with slop.

I guess the last part is the part that ties into it "eating its own tail", but really, why frame it that way? Framing it that way means it's a problem for AI companies. Let's be honest and say it's a problem for us and we want it solved for our own reasons.

3 comments

The proposed solution is also pretty confused:

  > For each response, the GenAI tool lists the sources from which it extracted that content, perhaps formatted as a list of links back to the content creators, sorted by relevance, similar to a search engine
This literally isn’t possible given the architecture of transformer models and there’s no indication it will ever be.
Technically correct, but the workarounds AI search engines use for grounding results could be a close enough approximation. Might not be accurate, but could be better than nothing.

Also Anthropic is doing interesting work in interpretability, who knows what could come out of that.

And could be snake oil, but this startup claims to be able to attribute AI outputs to ingested content: https://prorata.ai/

Not every LLM implementation can use RAG against a Google-sized knowledge base. This proposal essentially says LLMs have to be paired with Google to be legit.
Could you ELI5 why this isn't possible? Google's search result AI summary shows the links for example.
Those citations come from it searching the web and summarizing, not from it's built in training data. Processes outside of the inference are tracking it.

If it were to give you a model-only response it could not determine where the information in it was sourced from.

Any LLM output is a combination of its weights from its training, and its context. Every token is some combination of those two things. The part that is coming from the weights is the part that has no technical means to trace back to its sources.

But even the part that is coming from the context is only being produced by the weights. As I said, every token is some mathematical combination of the weights and the context.

So it can produce text that does not correctly summarize the content in its context, on incorrectly reproduce the link, or incorrectly map the link to the part of its context that came from that link, or more generally just make shit up.

OK, I'll try to err towards the "5" with this one.

1. We built a machine that takes a bunch of words on a piece of paper, and suggests what words fit next.

2. A lot of people are using it to make stories, where you fill in "User says 'X'", and then the machine adds something like "Bot says 'Y'". You aren't shown the whole thing, a program finds the Y part and sends it to your computer screen.

3. Suppose the story ends, unfinished, with "User says 'Why did the chicken cross the road?'". We can use the machine to fix up the end, and it suggests "Bot says: 'To get to the other side!'"

4. Funny! But User character asks where the answer came from, the machine doesn't have a brain to think "Oh, wait that means ME!". Instead, it keeps making things longer in the same way as before, so that you'll see "words that fit" instead of words that are true. The true answer is something unsatisfying, like "it fit the math best".

5. This means there's no difference between "Bot says 'From the April Newsletter of Jokes Monthly'" versus "Bot says 'I don't feel like answering.'" Both are made-up the same way.

> Google's search result AI summary shows the links for example.

That's not the LLM/mad-libs program answering what data flowed into it during training, that's the LLM generating document text like "Bot runs do_web_search(XYZ) and displays the results." A regular normal program is looking for "Bot runs", snips out that text, does a regular web search right away, and then substitutes the results back inside.

> They can try to solve that problem

Well, they could always try actually paying content creators. Unlike - for instance - StackOverflow.

StackOverflow as built back in the days of Web 2.0 where the idea was that user generated content formed in the days of the (relatively) altruistic web.

There isn't any clean way to do "contributor gets paid" without adding in an entire mess of "ok, where is the money coming from? Paywalls? Advertising? Subscriptions?" and then also get into the mess of international money transfers (how do you pay someone in Iran from the US?)

And then add in the "ok, now the company is holding payment information of everyone(?) ..." and data breaches and account hacking is now so much more of an issue.

Once you add money to it, the financial inceptives and gamification collide to make it simply awful.

Stack Overflow is making money by selling its database to AI companies. It chose not to reimburse the people who built that database.
https://archive.org/search?query=creator%3A%22Stack+Exchange...

You can download the database for free.

Trying to say "give us your payment and tax information so that we can pay you $0.13 for your contributions" would be even more insulting than not paying anyone.

Doing renumeration for people in some countries could get legally challenging too.

Doesn't make a lot of sense, does it? But they adopted it as their new business model nonetheless. Just one more stupid decision on the pile.
“Well, Reddit is growing, which contradicts my point, but I really feel like it’s not”
Reddit is growing because they introduced automatic machine translation and Indians have been joining at an increasing rate. That content is mixed into the English language content, but is of very low quality and irrelevant to many native English speakers. Similarly they mix the English content in with the Indian content.

Essentially, Reddit is also eating it's own tail to survive as the flood of low quality irrelevant content is making the platform worse for speakers of all languages but nobody cares because "line go up."