Hacker News new | ask | show | jobs
Scientists should use AI as a tool, not an oracle (aisnakeoil.com)
124 points by randomwalker 743 days ago
22 comments

> Unfortunately, most scientific fields have succumbed to AI hype, leading to a suspension of common sense. For example, a line of research in political science claimed to predict the onset of civil war with an accuracy2 of well over 90%, a number that should sound facially impossible. (It turned out to be leakage, which is what got us interested in this whole line of research.)

This coupled with people acting on its predictions is a kind of self fulfilling prophecy.

which is to ask, are AI safety folks building models of this pattern? :)

This is true for a lot of things, not just AI. But in AI, a guy who didn't get a High School degree and wrote Harry Potter Fan Fiction is one of the leading voices in doomerism.

The problem is you can't just "use logic and reason" because simple models are not good enough. The nuance dominates, but that's why we have experts.

What's funny to me is that people will confidently argue with experts and others value their opinion over the expert's knowledge. But on the other hand, people tend to just take machines at face value. Maybe these aren't overlapping groups, but it does appear that way. There's a great irony in trusting a machine but not the person/s that built said machine.

I don’t trust the machines or the people who make them, and I didn’t have to read the Harry Potter fanfic to know ad hominems are poor arguments. What group does that make me?
> I don’t trust the machines or the people who make them

This makes you consistent. I have no problem with this.

> I didn’t have to read the Harry Potter fanfic to know ad hominems are poor arguments

I mostly agree. But my point is that Eliezer Yudkowsky doesn't actually have the qualifications. I want to be clear that academic degrees aren't necessary to qualify someone, just like they aren't necessary to qualify someone as a good programmer. But it is generally harder and the foundation is shakier. In this case, most of his arguments are founded on incorrect assumptions. They are often logical, but it doesn't matter if something is logical if the premise is incorrect.

Not sure if you intended this, but it feels like the first sentence of your argument is more broadly a critique of the credentials of AI Safety proponents. Maybe you are distinguishing between doomers vs broader AI Safety proponents, but if not, I feel like the counterargument is that most people on the CAIS letter (https://www.safe.ai/work/statement-on-ai-risk) interface quite frequently with these AI models and are also (purportedly) seriously concerned about AI safety
> it feels like the first sentence of your argument is more broadly a critique of the credentials of AI Safety proponents

It's a not so thinly veiled critique of Eliezer Yudkowsky.

> Maybe you are distinguishing between doomers vs broader AI Safety proponents,

I do. These are different classes of people. But many doomers mascaraed as AI Safety proponents. Just as many conmen mascarade as ML/AI researchers. I suspect distinguishing the groups is quite difficult for those without domain expertise.

> most people on the CAIS letter (https://www.safe.ai/work/statement-on-ai-risk)

I don't care about the opinion of most of these people (there are some I VERY much do), nor do I think this is a meaningful letter.

Interfacing with a model does not endow one with any level of expertise. If this were true, the whole thread would be ill founded because people using GPT are interfacing with it. Instead, one needs to actually deeply study these models. There are things we know about them, and quite a lot. The term "blackbox" gets thrown around a lot, but that doesn't make everyone's expertise on the matter equally valid. In fact, the more complex something is to understand suggests the fewer number of people are qualified to have a reasonable opinion on the matter. My complaint is we often act as if the opposite is true.[0]

My second big problem with the CAIS letter is it means nothing. All it says is "I don't want to kill all humans." This is a fairly universally agreed upon statement and is in fact the default statement. It does not say anything about the potential risk. That's a completely different matter.

Worse, many of the people who have signed this are literally at the helm of the ships steering us into a dystopian future (which is not covered by this toothless letter). So I'm not sure what meaning this is supposed to have other than pageantry. Do not forget that these are the same exact people pushing and promoting abuse of these tools. I do not blame Average Joe for thinking that GPT is equivalent to Google (which itself cannot be trusted at face value, but this does not make it a useless tool) when that is often the way that it is promoted/advertised. So if you are concerned, I wouldn't use this as evidence.

[0] There's an added problem that you can become above average in any given subject relatively quickly. This is a double edged sword because knowledge is valuable but it often results in one being over confident. And the learning difficulty grows exponentially, which is why there are so few experts in any given subject matter. Because expertise is understanding nuance and complexity. The great irony of the doomers is that they fall back on "unknown unknowns" while not putting effort towards putting a bound on that.

> are AI safety folks building models of this pattern?

First you need to ask if AI "safety folks" actually understand the technology, and if they are thinking about it objectively. If they believe that we're a few years away from accidentally creating Skynet, they need to put down the crack pipe and go work in another field.

We have already created Skynet. Its name is Capitalism. Or the Internet. One of those things.
I know your downvoted but I don’t think you would be if you had of just said, corporations.

I think oil companies are the greatest existential threat to humanity via lobbying and eventually climate destruction. Second is social media companies. It’s so easy to spread misinformation against a whole populace and it’s going south fast.

There is just too much incentive for those vested in these corporations to just stop.

How would that work, do you think?
If you knew everyone would ask gpt before doing anything, you would make gpt say what woudl generally be considered the better option. Not going to war, not committing suicide, etc. In this way even if war was the optimal decision according to some other utility function, the behavior of people is directed in a positive way. (Presumably)
Sure, if you also assume people follow whatever advice so given. They won't, even before the covert influence effort becomes popular knowledge, as it inevitably will. This destroys consumer trust in your product after you have successfully made that product indispensable, thus opening up a previously impossible vacuum in epistemology and thus access to power.
For the record I believe it to be immoral to manipulate humanity in this way. And I also believe it might be bad for bussiness.

I was just trying to explain to the guy above what I think the guy above that meant.

'The guy above' is also me :) And yeah, I get it. I guess the thing I'm trying to get at by extension and example is just how hard a problem this is, and maybe also that the formulation given assumes LLMs have a level of control over human behavior that not only doesn't exist but in the general case sort of can't since the LLM's user is always free not to take its advice. (At the very least, if humans have no option but to follow LLM instructions, something has gone much more badly wrong than the risk of there being poor instructions...)

In general, I think it's a good example of the kind of social problem tech can make a lot worse but no better: when a society has lost its grasp of epistemology, multiplying the amount of information available, at a net decrease in quality and reliability, merely multiplies the scope of the problem.

"accuracy2" sigh - the 2 is a superscript to a footnote and not a domain specific term.

"facially impossible" ... does that really riff on "on the face of it", or is it farcically misspelt?

Garbage in, garbage out 8)

"Facially" in the sense of "on the face of it", roughly as a synonym for "obviously", seems like a pretty standard usage to me—this is certainly not the first place I've seen the word used in this sense.
Human recall failure. Probably wanted "seemingly", "apparently", or even "ostensibly", but who's got time for all that when the publish button's right there.
Also from the article:

> Also, ML code tends to vastly more complex and less standardized than traditional statistical modeling.

I mean, hey, it's proof that the text isn't AI generated, since ChatGPT is better at English than that, but it makes it hard to read and I'm not going to buy their book if it's going to be full of errors like that.

If this is already such a problem even in the professional discipline and vocation whose sine qua non is the accurate analysis of physical reality, I'm really nervous about the next few years. And I was nervous already...
In my professional work, I treat chatgpt as a search engine that I feel I can ask questions of in a natural manner. I often find small flaws in technical solutions it offers, but it can still provide useful starting points to investigate. I rarely trust code it generates (at least for the language I mainly work in) as i’ve seen it make some serious mistakes (eg: using keywords in the language that don’t exist)
> I rarely trust code it generates (at least for the language I mainly work in) as i’ve seen it make some serious mistakes (eg: using keywords in the language that don’t exist)

It's only a mistake from your perspective. The model just generates text based the probabilities it learned during training. In that respect, there is no such thing as "incorrect" output because the model doesn't operate at that level of abstraction.

Wait, no, it's "incorrect" in the sense that you asked it to do something, and the thing it gives you doesn't accomplish the task.

I asked it "what is the PS3 game where the full version of To Kill a Mockingbird is in there?" and it responded back with "The Sabateour", when the correct answer would have been "The Darkness". That is incorrect by most definitions of the word, whether or not it's a consequence of the training model doesn't really change that.

I suppose we could get into details about epistemology and ontology about the nature of what an answer "is", but I think it's fair to say that "incorrect" is when it gives you something that doesn't accomplish the task you asked it to do, or rather when it tries to accomplish the task but what it gives you don't work.

> Wait, no, it's "incorrect" in the sense that you asked it to do something, and the thing it gives you doesn't accomplish the task.

You believe you "asked it do something," but that's just you anthropomorphizing the model and your interaction with it. Of course the AI companies encourage that perspective, but it's a factually dubious one at best.

Judging whether a model's output is "correct" involves you imposing an external context on both the prompt and response that the model typically doesn't have access to. It also typically has no ability to test its responses.

This is part of why good prompt engineering can be so important - because what you get out is a function of what you put in, and pretending that the model is a question-answering oracle only takes you so far.

Of course what the AI companies are trying to do is train and prompt the models in such a way that their output is considered "correct" from a user's perspective more often than not. In an interaction with an AI company's salespeople, you might argue about "correctness". But that's not going to help understand what's actually going on.

It's actually not "anthropomorphizing the model".

I passed it input in the serialized form known as "English text". I expected a response also in serialized English that I can then decode in my brain to something that comports with reality. If I requested from a web server some JSON giving me my bank balance, and the balance it gave me is not accurately reflecting reality, it's not anthropomorphizing anything to say that it's incorrect, any more than pinging Nginx is.

And to be clear, we can wax philosophical all you want about "correctness", but that's really sidestepping the point: I don't care why it's giving me wrong information.

In my bank example, does it really matter, for the end user, if it's because of some integer overflow error or if it's a null pointer there's just a special `if` statement saying that antonvs account should always print out a different number for your balance.

I think nearly everyone would say that that's incorrect, and it actually wouldn't be clever or insightful for someone to say "no that's just a result of how the computer was programmed! You're imposing a human understanding of correctness on your bank balance!"

> I expected a response also in ...

Exactly, you expected it, but that doesn't change what's actually happening. The model doesn't know what you expect. It can't read your mind. The best it can do is infer some things, such as that English input should produce English output - and the models are indeed pretty good at that!

> to something that comports with reality.

This is a rather unrealistic expectation in general, when you examine it. You raised a good example with which to do that, though:

> it actually wouldn't be clever or insightful for someone to say "no that's just a result of how the computer was programmed! You're imposing a human understanding of correctness on your bank balance!"

You're right, it wouldn't, because that's a very different situation which helps illustrate the point. The code for the bank app has been written to match your notion of correctness. That's only possible because it has a narrowly defined, specific purpose. It has all the necessary information needed to produce a correct response. The acceptance criteria are clear, including validation and integrity checks on the response. As a result, your expectations should be satisfied, and if they aren't, it makes sense to say that the bank app is not correct.

None of that applies to the AI models we're discussing. An LLM or image model doesn't have a narrowly defined, specific purpose. It can't possibly have access to all the information it needs to "answer" any possible "question" "correctly". It can't possibly have access to acceptance criteria specific to a question unless they're provided explicitly and in detail as part of a prompt - again, underscoring the importance of prompt engineering. And its ability to validate responses - check whether they "comport with reality" - is very limited, at least currently.

An example that's closer to the situation with an AI model would be a tool like a hammer. If you hold a hammer by its head and try to hammer in a nail with its handle, is the hammer "incorrect" when it fails at the task you have "asked" it to do?

> I don't care why it's giving me wrong information.

Just as with the hammer, if you want to be able to use these tools effectively, you should care why.

You're anthropomorphising too much. The machine did the correct thing; it's just that it is a prediction machine, not a magic question answering machine, so its 'correct' may not be what you wanted.

> I suppose we could get into details about epistemology and ontology about the nature of what an answer "is"

The machine has no concept of an 'answer'; when people call these things autocomplete on steroids, they're not really being that inaccurate.

It really isn’t anthropomorphizing at all. If I treat it like a black box it really is quite simple. I gave it an input in the form of English text, there is an objectively correct and incorrect form of English text that responds to the input, it gave me the incorrect one.

Literally no one here disputes that the it’s a glorified autocomplete, but that is completely irrelevant to if it correctly answered a question.

I find this kind of pedantry extremely annoying because it sounds insightful without actually saying anything. Like, no shit, it’s just doing what its algorithm dictates, no one, and I mean no one disputed that. The question of correctness and incorrectness falls into “how accurately did it respond to my query?”

This is like saying the arguments put forth by a schizophrenic lawyer are rational and correct.

If the context is that it's a tool, correct is defined as reality within the context of the use of that tool. If it's to find facts, it can be incorrect, since the context of a fact is reality. If it's writing a story, then "correct" would be based on continuity, etc.

If you're using it as a tool to generate words related to previous ones, then sure, it's always correct, but that's not probably not a useful tool for most people. But, being a next word predictor doesn't mean it can't also be a useful tool in real world contexts. There are, literally, billions of dollars being spent on pushing them to be more "correct" in more contexts, so it's a useful concept being considered, even though they're "just" next word predictors.

While yes, this is the technical reason — it’s important to not overlook how non-technical people see LLMs. And not only that, how they are being marketed.

I’m struggling to think of any comparable technology where the regular median users understanding is both fundamentally wrong— and is being purposefully misinformed.

That's like saying "there is no such thing as a bug, it's just code working the way it was written"—true in some sense, but not useful.
“Correctness” is a property of a proposition determined by an observer. Sometimes what is output by an LLM is correct, sometimes not. That an LLM is aware of the output or not means literally nothing.
This habit of latching onto one word specifically to ignore what everyone knows is obnoxious, pedantic, and most of the time not even technically correct. It's just stupid quibbling over how words in English can be used to mean different things. And just so you know, the model doesn't "learn" anything, you're just adjusting weights until you get a desired result.
“What everyone knows” - https://www.lesswrong.com/posts/BNfL58ijGawgpkh9b/everybody-...

More broadly, the meaning and usage of specific words are important for these products because they shape how people perceive their utility.

If a thing isn’t “correct” because it has no sense of understanding, and therefore is only “correct” due to projection by the user, then that’s a super important distinction.

People treating tools like they're infallible has been a problem since computers were invented, but IMHO the biggest difference with AI is how confident and convincing it can be in its output. Much like others here, I already have had to convince, very carefully, many otherwise-decently-intelligent people who believed ChatGPT was correct.

Thus I think the biggest success of AI will be the arts, where imprecision is not fatal, and hallucinations turn into entertainment instead of "truths".

I think this misses something important. If it makes economic sense, corporations will figure out ways to integrate AI into their processes, even if it's imperfect. After all, companies are already built out of humans who are also often confidently wrong - but successful companies have ways to detect and mitigate that. In fact, that's one of the primary requirements for a company to survive, that it's able to build a functioning system out of imperfect components, particularly humans.

You can see an example of this in the use of LLMs to generate code. In that case, there's a whole SDLC pipeline designed to detect errors: type systems, language compilers and runtimes, tests of various kinds, QA, user feedback, etc. We don't just trust confident software developers to produce correct code.

Even a life-critical function like medical imaging - where imprecision can be fatal - can potentially benefit from this, where AI is used in conjunction with human review. It mainly requires development of some standards of practice - unlike with an average user blindly trusting the output of a model, radiologists would need training on how to use the models in question.

AI is a tool … a fool with a tool is still a fool … For natural sciences, there is no need to worry since nature would provide the ultimate check … for social “sciences”, it is entirely a different story.
The worst is having random people questioning your expertise because of what ChatGPT told them.
To be fair, people did this before ChatGPT. It's just the thing they point to as evidence now, and they'll always find something. The underlying problem is much bigger:

1) people confidently arguing with domain experts about topics that they have little to no experience in.

2) people valuing the opinions of arguers from 1 over experts.

To be extra fair, "domain experts" in some areas have had a bad few years; there are a couple of fields I can think of off the top of my head where the "experts" wheeled out to advise/scare the public are clearly more influenced by politics (or saving their own skin) than science. Replacing trust in experts with trust in LLMs is obviously dumb, but who is Joe Sixpack supposed to turn to?
> there are a couple of fields I can think of off the top of my head where the "experts" wheeled out to advise/scare the public are clearly more influenced by politics (or saving their own skin) than science

This feels like a thinly veiled jab at COVID era public health recommendations. Can you be more clear about which fields you’re referring to?

"domain experts" are often totally wrong and there is nothing new about this.

When our state of knowledge of the world changes , "domain experts" have the most to lose and our state of knowledge of the world is constantly changing.

Most domains also don't have the exactness of a programming language so are exposed to the same human processes as displayed in a middle school popularity contest.

The whole concept of the "domain expert" is really a modern superstition. An especially powerful superstition because it is the superstition of those who believe themselves beyond superstition.

I'm not sure which domains you're referring to.

I can think of domains where sensationalist opinions are lifted, but not ones where the general consensus is blatantly false. I can think of plenty of instances where large news organizations have grossly misrepresented conclusions of research.

> but who is Joe Sixpack supposed to turn to?

This, I agree with. It is why I actively voice dissent, as an expert and in areas where I have domain expertise, against so-called science communicators (not all are "so-called") and when the news gets it wrong.

Hell, I'll do this when actual science communicators get it wrong. Like when Niel DeGrassee Tyson is being dumb[0]. He also thinks hydrogen bombs don't have fallout...[1]. They do...

That said, I still don't think this is a reason to distrust scientists. But I think it is important for scientists to speak out when communicators get it wrong. I think this is a common problem and allows the conmen to gain power. But that's not the only force at play. Truth is complex. Approximate truth is bounded in complexity. But lies can be infinitely simple. So we get it wrong when we "reason our way through" something, because typically the base assumptions are wrong. This makes many conmen truly believe the lies that they are selling.

Joe Sixpack can reason through that. But Joe Sixpack can also reason through the concept that if he was easily able to reason through something and that experts disagree, it's pretty likely there's a reason why other than them being dumb and <Joe Sixpack> knowing better. Can, but doesn't. And we as the public let that happen. This may seem like an insurmountable problem, but instead it is a problem which just needs sufficient effort. Momentum builds, so the more people that push against this, the more common it'll become. And to be clear, it is perfectly fine to question experts. It is not perfectly fine to confidently disagree while not actually understanding the topic. If you don't know the difference, read a few papers/works in the topic and see if you can understand 90+% of it (if it is CS or Engineering, see if you can replicate).

[0] https://www.youtube.com/shorts/a-PHXGmexxM

[1] https://www.youtube.com/watch?v=QGa4ItIOCRg

Doctors had this moment when Google first came out
To be fair, I came across doctors who are no better than a static webpage from the CDC. I fire those doctors pretty quickly.
> People should use AI as a tool, not an oracle

There, fixed the title.

People must not use AI as an oracle, but rather as a tool.

I think this is even better

Wow I came into this article angry, idk if their book title accurately conveys the sober, expert analysis it contains! In case anyone else is curious why they’re talking about “leakage” in the first place instead of the existing term “model bias”, here’s the paper they cite in the “compelling evidence” paper that started these two’s saga with the snake oil salesmen: https://www.cs.umb.edu/~ding/history/470_670_fall_2011/paper...

Crux passage:

> Our focus here is on leakage, which is a specific form of illegitimacy that is an intrinsic property of the observational inputs of a model. This form of illegitimacy remains partly abstract, but could be further defined as follows: Let u be some random variable. We say a second random variable v is u-legitimate if v is observable to the client for the purpose of inferring u. In this case we write v € legit{u}.

> A fully concrete meaning of legitimacy is built-in to any specific inference problem. The trivial legitimacy rule, going back to the first example of leakage given in Section 1, is that the target itself must never be used for inference:

> (1) y !€ legit{y}

So ultimately this all about bad experimental discipline re: training and test data, in an abstract way? I’ve been staring at this paper for way too long trying to figure out what exactly each “target” is and how it leaks, but I hope that engineering-translation is close

Scientists have been obsessed with over-optimzing for FOMO for the past decade - what papers should I read that I don't have time for, what grants should I apply for that I don't know about, what projects should I work on that will give me the best ROI, who in my field is poised to disrupt or make a big leap, etc.

Some even think that the end goal is actually an autonomous research agent that can make decisions about what questions to ask and why, and that's one of the true marks of AGI. That to me is insane and misses the entire point of science altogether, even once we reach that technical feasibility. We ask questions about the universe to expand our human relationship with the universe, not to just amass more research capital for the sake of it. And the fact that the AI snake oil has infected big chunks of science reveals which parts of it are just gold rush speculation and which aren't.

There's a more fundamental challenge of training scientists to understand why we ask the questions we ask. You can't just offload that to some background task and trust that it makes sense.

I understand the point that you're making about overoptimizing for FOMO in science. I wanted to give you another perspective from a scientist working within the US government that doesn't care about playing that game.

Our governmental research agency, and NIH as a whole has TONS of research data that we don't have the manpower to screen and provess. There are also gaps in data that AI/ML could help us simulate. AI research assistants could potentially help us process and evaluate "what questions to ask" by, for example, looking for trends in QSAR (quantitative structure-activity relationship) models for novel chemicals and help us direct our attention to compounds of toxicological interest.

We've also been trying to use the AI research assistants to speed up the process of evaluating the scientific literature for toxicologists who have to make regulatory decisions. Our agency has a backlog of chemicals that we would love to evaluate, but lacks the manpower to do so.

No profit motive or much "clout" interest, at least that I've seen. Just a lot of public servant scientists who need some extra help protecting the public.

To know when to be skeptical to LLMs you have to know how it is trained and inferenced, and you have to use it often to see how it can screw up
It's marketed and sold as an oracle. The AGI crowd feels like a cult.
I would have thought scientists weren’t going to use these tools to do research considering they as a group are far more exposed to things like peer reviews and critical thinking than general society.

What worries me the most about these AI solutions, however, is their usage in the public sector. They can certainly be useful helpers, like, they can scan images for cancer and if added to existing processes involving humans, often lead to enhanced results. They can’t replace any existing methods, however, as we learned here in Denmark a few years ago. Unfortunately that lesson hasn’t been learned across the public sector. I think medicine and healthcare learned it, but right now, we’re replacing actual human controls, audits and sometimes decision making with AI or an unwarranted trust in AI results. Which is going to lead to some really terrible results considering how bad things like LLMs often are at being lucky in even “common knowledge” situations. It’s further enhanced by how some of the work it’s tasked to do isn’t as black-and-white as writing code is. We use AI tools in our daily work, and they are ok, but as anyone who’s used them for programming probably knows by now, they aren’t exactly great at being lucky. Sometimes they’ll hallucinate solutions that simply do not exist.

This is how they work, and as I said earlier, AIs can be great enhancers. They aren’t replacements though, and if we start treating them like they are, which is very tempting from a change-management and benefit-realisation perspective, we’re just going to get in trouble. This is unfortunately exactly what we’re doing, and why wouldn’t we? Most western public sectors have functioned on at least some form of new public management for two decades by now, sometimes longer. As a result the entire systemic culture is geared toward efficiency and cost reduction, even when it doesn’t really result in either efficiency and cost reduction on a broader perspective.

Now, if scientists are on board. Then what hope does a public bureaucracy have?

LLMs are basically Dissociated Press, but with deeper layers of statistics for a better function approximation than a simple Markov chain. It's really doing the same thing though: pick the next sequence of characters that best follows the foregoing characters.

Not something I'd trust as a "source of truth". Maybe a neat idea generator. And some of the deep learning algorithms can identify patterns that humans might miss -- patterns that could reveal useful insight. But they're not doing the knowledge work.

I feel like 90% of AI discussions online these days can be shut down with “a probabilistic syllable generator is not intelligence”
Even people who _know_ that often seem to have difficulty intuitively believing it, is the trouble; it's very good at _appearing_ to be intelligent, good enough that even people who should know better sometimes think that the correctness problems are just a case of "need more GPUs", rather than insoluble.
How do you define intelligence?
That hasn't worked for me.
Humans are not fact machines, we are often wrong. Do humans not have intelligence?

What do you even mean by "intelligence" when you say a probabilistic syllable generator "is not intelligence"?

Like clockwork, out come the "but humans" deflections. An LLM is not a human-like intelligence. This is patently obvious, such comparisons are nonsensical and just further the problem of people anthropomorphizing a tool and treating it like an oracle.
You didn't answer the question.
I did, I said they aren't human-like intelligences, so countering with "humans make mistakes, are humans not intelligent?" is drawing a false equivalence between humans and LLMs.

Since we do not possess a definition of intelligence that isn't human-like, it would be meaningless to argue if LLMs are intelligent in general. All that can be said is that they are not intelligent in the way that humans are.

The question you answered was rhetorical; obviously humans are intelligent. There is another question that actually has an interesting answer after it. In fact, it's not possible to have a meaningful discussion without answering it. I thought that was obvious :)
...but why wouldn't they use AI as an oracle? From an outsider's perspective, it seems that there's already plenty of incentive to test the margins of acceptable academic practice in order to produce more papers or publish more quickly. Sadly I feel like it'll become the norm to have a chatbot interpret your results and write your paper rather than using those expensive grad students.

I don't have answers; just the lingering question "why are we building this?"

We're building this because the ability to make narrow, specific predictions can be narrowly and specifically useful. This works if you have a good understanding of both the tools and the domain you're looking to make predictions in.

Unfortunately, from an outsider perspective, this looks like being widely and generically useful. If you don't understand your tools, you're going to misuse them, and this hype cycle is the result.

> Scientists should use AI as a tool, not an oracle

T in AI stands for tool.

Maybe they should also call it "curve fitting" instead of "AI" so they don't need to call a "poor fit" a "hallucination"
It's all very simple, eh?
Look, it's a bit late here so I don't really have time to fully refute your sophisticated argument. But let me ask you this: if AI is not curve fitting, what is it then?
I'm objecting to the implication that if we start referring to it as curve fitting rather than AI, then thinking about it becomes easier and it becomes less likely that we will make a huge collective mistake in thinking about it.

I'm not saying there aren't a few possible mistakes that do become less likely if we switch to "curve fitting, but I suspect that it does not matter much either way on the most serious mistakes.

I think it would alter the entire safety discussion that was started.

Let's say a company creates an automated system based on a curve fitting algorithm. Then things go wrong. Now it is quite easy to say the company is responsible for any damage and must pay for the rectification.

When we say an AI is deployed and things go wrong, we have a sci-fi movie and responsibility is somehow magically moved away from the company that deployed the algorithm.

To me it feels that "AI" as a term is a clever marketing term that companies will use to deflect responsibility. And I think it is one of the reasons why Open AI, Musk and others pushed this AI safety non-sense.

The aim of calling it "curve fitting" or something similar would be to take the magic out of it so the broader public doesn't get confused. I think that's worthwhile.

But surely if it's artificial intelligence then it'd know its limits and would respond appropriately? Oracle use no problem?

It is it because it's actually shit but it's the best thing we've seen yet and everyone is just in denial?

People constantly misevaluate their own limits though. Why should AI not be allowed to do that?
Professionals don't constantly misevaluate their limits, if the AI is to replace a professional it has to know its limits.
Current AI is for productivity boost, not to replace. And automation of certain use cases, but not all. It is already really good at those things.
It depends on who you mean.

Most normal people look at AI like ChatGPT as an amazing tool and have used it effectively as a replacement for Google, Grammarly etc. And for them it's fine because any mistakes are localised to them.

The problem are those building products on LLMs e.g. Legal, Customer Service who are knowingly misrepresenting the capabilities of what it can do to companies who don't know any better. And I would argue this is fraudulent and where we will see most of the problems.

Is "leakage" just another term for overfitting?
I think a popular example of leakage would be that of a tank recognition AI that perfectly handles training/testing data but fails in real use, because all the tanks of one country happen to have a tree in the background, while those of the other do not, effectively leaking the image label and making the model look for a tree instead of the tank. Even if you trained less or used fewer parameters, it'd still go for the easiest route of trying to detect features of a tree. You'd have to change the training data.
No usually it means the data that you intend to test the model on was accidentally used to train the model. There are more complex scenarios where you get leakage without actually showing the model the test examples. Where you have features that have future information in them that you won't have at actual inference time.

So usually it ends up in overfitting, but is more about having information at training time that it shouldn't.

These are two different definitions. Can someone please disambiguate?
No shit sherlock
Not just scientists, but everyone!

My partner recently went a bit nuts writing an article with the help of GPT4. She was very proud of how productive she'd been until I asked if she'd actually searched for the papers GPT4 had referred to.

Of course, many of the referred to papers didn't exist...

That is not writing with the help of GPT 4, that is letting it write for you! I can’t imagine doing anything creative and letting a computer source material for me without having reviewed the material first hand, even if it was accurate. Clearly, this is not where everyone’s head is at, and I suspect your wife’s workflow is more the common case.

I’ve said from the outset that in academic settings you should be able to cite an AI as a writing assistant, it would clear up a lot of the confusion about its use. If you used it poorly it’s still on you, but at least there’s some transparency by which to judge the work.

I've sort of worked out a workflow. Like say I had to write an essay and take a side for/against something. Then I would ask GPT to write the strongest argument for, and the strongest argument against, telling it to make up whatever sources it wants. Then after reading those, I would have some idea of my own opinions. I would write from scratch but with the GPT for/against pulled up alongside as reference for how to structure the arguments. Then I would put it through GPT again for proofreading and grammar (or just spelling, if there is AI detection software).

It is a bit tricky though, there are definitely points that come up with GPT that people would not think of normally. So in that sense it is still distinguishable from writing solely by oneself, but I would argue the GPT-assisted essays are just better writing and more well-rounded.

There is a subtle aspect of LLM AIs that is lost to most people: they are trained on the entirety of the Internet. That means whatever topic you ask these LLM AIs, there are multiple instances of that same information with different levels of seriousness and accuracy in their treatment of the subject.

For example: if one asks a question using street slang, the answer generated will be generated from training data about your subject, but from online sources that used street slang in their conversation about that issue. Likewise, if you use ordinary language for your question, the generated response will be from ordinary language conversations of your topic. However, if your question concerns any type of formalized knowledge, by asking your question using the formal language of experts in that topic, then the generated AI answer will come from training data that used this same formal expert terms, and are most likely to be correct, because they come from discussions of that subject’s matter experts.

Plus, don't use LLMs for fact retrieval, use them as strategy guides. They really excel as strategy advisors.

Theres actually even more subtlety here, in all of your examples the "knowledge" should theoretically be embedded nearby each other in the same vector space, so regardless of the style of language used, semantically they should all pull from similar weights, and thus give similar answers. This is one of the reasons why LLMs are so powerful.. because they seemingly understand the semantic relationships of words so regardless if the prompt is posed casually or formally it should give similar answers in terms of factuality. I agree with you that LLMs today should be primarily used for more creative output.
That assumes that street slang discussions, using entirely different conceptualizations of ideas, would indeed be embedded nearby one another. Plus, both the street slang and ordinary language will tend to treat the information in a less precise, a less concept discriminating manner (meaning the subtle distinctions between issues may be lost in their discussions). In my tests, I find one indeed needs to use the subject matter expert for precise treatment of formal knowledge and generated answers that are more accurate.
You sound like the people who used to know how to fix a car, or sew, or write cursive, or do multiplication times tables in their head, or know how to derive a formula, or check a mathematical proof.

Ask anyone below 30 if they can write cursive today, or know their times tables hehe. Ask them if they can derive a formula instead of using Mathematica.

Or ask a developer if they know how their pixel shaders work, or what’s going on under the hood of their favorite runtime, how hash tables work, or really anything. Previous generations did. When the complexity gets too high people just trust the machines I guess.

And no one actually knows what the LLM internals are anyway.

If you're driving you don't need to know how to fix a car, but relying on GPT to write for you to the extent of accepting its generated citations without checking them, is the equivalent of running around looking for blinker fluid as you attempt to fix your car.
personal anecdata: I had written a few paragraphs of factual information, and decided to see what GPT 4 would do with it. So I asked it to rewrite the information several times, using a different voice (e.g. write it with an optimistic view, pessimistic view etc)

EVERY single "fact" was perverted by either mixing with another fact, or misrepresenting by replacing a word like "good" with "superb" or "fantastic" (I guess optimistic means lie-through-your-teeth?)

YMMV, but basically I achieved nothing except a waste of about 30mins and an honest, personal evaluation of the limits of GPT.

It's kinda scary to think that researchers would be using ChatGPT other than a rubber duck to bounce ideas off of.

There are a couple issues I can see in that people may be unaware of how much the AI's hallucinate, but also there's a real probability that people will pick and choose what they like based on what sounds correct vs what is correct.

AI is a great tool, but it's also convincingly deceiving at times, so much so that many people are totally oblivious to it.

Sadly it's not even just references, LLMs still hallucinate or at least misrepresent even the most basic of facts. That and the stereotypical GPT-verbage makes it impossible to use for writing anything significant.
> LLMs still hallucinate

Keep in mind that there's no difference between what happens inside a model when it "hallucinates" vs. when it generates "correct" output. It's the exact same process.

That’s true, but it’s also true of anything else that makes mistakes, including buggy software. When a buggy sorting algorithm produces a bad ordering it’s doing so “with the exact same process” the good ordering is coming from. Ditto for humans and their slips (although tbh I get a little tired of the analogizing of humans and llms…not that the analogies are wrong, but just that we always analogize human minds with the latest technology: wax writing pads through computers)
Uh... yes? I'm not sure why it's some significant insight.

Surely when google gives bad results, it's "the same process" as when it gives good results. And when a book gives wrong information, it's the exact same kind of ink as correct information.

I think the point is that it's not some kind of bug to find and fix, it's a fundamental risk with the entire approach.
We were already swimming in a world of bullshit prior to the wide availability of these. I'm not sure what the future holds, but I think intelligent people are going to become very skeptical of virtually all information sources.

I would imagine there's also a raft of people who will use it as a reason to give up on any search for truth.

I still do hold a lot of hope for their eventual capabilities, but I'm also pretty pessimistic on what the direct and Nth order social effects will be.

GPT-whatever can’t do sources.

I was trying to use it as a research tool and it hallucinated 95% of the references I asked for (not a made up percentage, I counted)

Ironically the one real source turned out to be quite useful.

Hmm. In the future the AI in nefarious hands can retroactively make the papers first, and get them past the censors. Just make up a lot of bullshit and then it’s turtles all the way down lmao
I'm imagining how much easier it would have made work for the Ministry of Truth in 1984.
We use search that way, don’t see why AI trained on similar content wouldn’t be just variable in terms of reliability.
This is incredibly simplistic. Search engine results give a lot of context clues about the reliability of their asserted facts and provide a potential spectrum of answers. LLM-generated answers strip all that away, and give a single authoritatively phrased answer. Even if you’re inclined to disbelieve it, the LLM answer gives you no ability to dig in, refine, or compare. It just is. If you ask a chatbot if it’s sure, it might double down, or apologize and then repeat itself, or say it was right and give a contradictory followup.

Traditional pre-spam-overload Google results could often give a high quality answer, or if not, you’d at least get the sense of the low quality. Not so with LLMs.

I think you overestimate people’s ability to sniff out bad data on the internet.

Also are you suggesting people fact check an AI by asking it if it is correct? That seems absurd.

Pre-LLM madness, most decent scientists were capable of judging the reliability of a source, at least to an extent. Eg if the source is a paper in a decent journal, it probably has at least some substance to it and the basic facts are probably not wrong, if the paper is a zero-citation paper on vixra where none of the authors have any reasonable history, you'll probably have to check everything.
But you could trust certain websites being more accurate than others based on their brand, the author, the other content the site had published, who they are linking to and people linking to them etc.

LLMs remove that ability to be discerning about what to trust.