Hacker News new | ask | show | jobs
by preciz 897 days ago
> A model that possesses the entire collective knowledge of our civilization is useless if it can't directly quote its sources.

That's a strong and baseless statement.

5 comments

Useless is probably not the right word but it's a good way of summing up a lot of the current problems. If the model can clearly identify when something is an exact quote and also know the source then its output could be trusted for the most part and much more easily verified. It would certainly elevate the output of the model from "random blog post or forum chat" to "academic paper or official report" levels of trustworthiness. Citing sources is hugely important for validation, cited text allows an immediate lookup and simple equality check for verification after which you can use it as context to validate the rest of the claims. Like I said, it's a standard we apply to humans who have an equal propensity for hallucination, mistakes, and deception because it's a tried and true method for the reader to check the claims being made.
I, for one, agree with the original statement. I think the hallmark of enlightenment (for example, in the scientific method) is that we are able to externalize the expert knowledge, that is, experts are usually required to provide reasoning behind their claims, and not just judgements. This is because we learned that experts cannot be 100% trusted, only if we can verify what they say we can somewhat reach what is truth (although expertise still provides a convenient shortcut).

So not demanding this (and more) from an AI (an artificial expert) is a regression. AI should be capable of wholly explaining its reasoning, if we are to consider its statements to be taken seriously. It is understandable that humans have only limited capability to do that, since we didn't construct human brain. But we have control over what AI brains do, so we should be able to provide such an explanation.

It is somewhat ironic that you yourself do not provide any argument in favor of your disagreement.

This isn’t meant to totally disagree with your point (there’s some stuff I agree with in here) but I’m having trouble seeing the point about regressions.

To use another example, a new NoSQL DB not having joins is a regression. Does that mean no one is justified in releasing a new NoSQL DB?

As long as from "NoSQL" it is clear that you mean "I can't do joins", then it is OK. I think LLMs (and similar models like Stable Diffusion) are really cool for things like fiction, but to rely on them to tell you the truth is dangerous. So I am not really sure why the models have to be trained on NYT articles in the first place.
Providing reasoning and providing citations are not the same thing. Reasons can be provided without citations; citations can be provided without reasons.

LLMs have astounding utility citations notwithstanding.

They are different, but perhaps you misunderstood my argument.

Issue of plagiarism aside, we reason from facts, and it's the facts (or some other analysis, which is itself a fact) that should be sourced. That's why I agree with the original statement, and I argue not from a (moral) POV of preventing misattribution or plagiarism, but from a (practical) POV of veracity.

We don't only reason from facts. We also reason from value.

Further, reasoning that rests on facts that does not cite facts still has massive utility. (See people, all day long.)

Citations are useful, but not required.

I think you just identified another problem with LLMs - we don't know their values, either.

Of course when you listen to human experts you're using the shortcut (and you do it based on trust), as I already argued. You have an option (in most cases, in free societies) to dig beyond just experts judgement, you can study their reasoning, and understand their sources of both values and facts.

Anyway, I disagree with citations not being required. If Wikipedia had no citations it would be less useful (and more prone to contain misinformation). Same goes for Google. So the next best things we have to "artificial brain that contains all the human knowledge" have citations, and for a good reason.

What are the citations actually required for though?

Another way to ask this is: what value remains without them?

I'll add this as well: humans produced valuable knowledge for thousands of years without the use or standard of citations.

To be clear, I think citations are highly valuable and desirable and I very much want LLMs to cite when appropriate. However, I think the necessity of this is overstated.

Edit: what you said of experts can be said of LLMs as well.

And also patently false. Knowledge is knowledge, it's useful without source citations.

Is the knowledge of how to do CPR somehow ineffective because I can't cite whether I studied the knowledge from website A or book B? Is reality a video game where skills only activate if you speak the magic words beforehand?

And this is a rather weak rebuttal.
Well sure, it's easy to make a statement look bad if you only include half of it.
The statement is equally hyperbolic both as quoted and in the original context. LLMs often can't quote sources, and those models are nevertheless useful to lots of people. Makes it hard for me to take the rest of the comment seriously.
A LLM that could quote sources would be even more useful, and in a world where both were available there’d be no reason to use the plagiarizing one.
That was the whole statement. It doesn't have qualifiers left out
The comment I replied to was updated to include the second half, it was originally just quoting

> A model that possesses the entire collective knowledge of our civilization is useless

The additional context doesn't do any work.