Hacker News new | ask | show | jobs
by ubj 1130 days ago
I like this step towards greater rigor when working with LLM's. But part of me can't help but feel like this is essentially reinventing the concept of programming languages: formal and precise syntax to perform specific tasks with guarantees.

I wonder where the final balance will end up between the ease and flexibility of everyday language, and the precision / guarantees of a formally specified language.

15 comments

But is it a step to greater rigor? Or is it an illusion of rigor?

They talk about improving tokenization but I don't believe that's the fundamental problem of controlling LLMs. The problem with LLMs is all the data comes in as (tokenized) language and the result is nothing but in-context predicted output. That's where all the "prompt-injection" exploits come from - as well as the hallucinations, "temper tantrums" and so-forth.

The result is actually richer than ‘predicted output’ - it’s a probability distribution over all possible output.

Having richer ways to consume that probability distribution than just ‘take the most likely thing, after adding some noise’ is more conducive to using LLMs to generate output that can be further processed - in rigorous ways. Like by running it through a compiler.

Think about how when you’re coding, autocomplete suggestions help you pick the right ‘next token’ with greater accuracy.

The result is actually richer than ‘predicted output’ - it’s a probability distribution over all possible output.

-- This is, uh, false. If an LLM output a "probability distribution over all possible output", it would be producing a huge, a vast, vector each time. It doesn't. ChatGPT, GPT-3 etc produce a string output, that's it. You can say it's following a probability distribution of outputs from output space but just about anything the output does that.

Think about how when you’re coding, autocomplete suggestions help you pick the right ‘next token’ with greater accuracy.

-- Uh, you missed where I said "in-context predicted output". The Transformers architecture is where the LLM magic happens. It's what allows "X but in pig Latin" etc.

It's hard to get that these systems are neither "fancy autocomplete" nor AGI/something magic but an interest but sometimes deceptive middle ground.

ChatGPT and GPT are APIs over LLMs.

The huge vector is what the neural net outputs. ‘Sampling’ is the process whereby a token is selected.

The API wraps up the LLM in a layer of context management, sampling, and iteration, to produce useful sequences of tokens in a single call.

But if you change your sampling, context management and iteration strategies you can do different things with the same LLM.

Note that for any fine-tuned models (like GPT-4, where the foundation model has not been made accessible) the model does no longer give the "probabilities" of the next tokens, but rather their "goodness". Where the numbers say how good a token would be relative to the aims the model inferred from its fine-tuning.
Isn’t that the same thing? The non-fine-tuned models also have assumptions based on corpus and training. I don’t think there’s such a thing as a purely objective probability of the next token.
It's very different. We don't know exactly what the model consideres good after fine-tuning (which can lead to surprising cases of misalignment), while the probability that something is the next token in the training distribution is very clear. I don't know how they measure it, but they can apparently measure the "loss" which (I think) says how close the model is to some sort of real probability.
What I meant was, fine tuning is not substantially different from training. It seems odd to use different words for the resulting systems.
"no longer"??

The deep learning models (of which LLMs and GPTs are a type) have never returned probabilities. Ever. Why do people have that hallucination suddenly?

They do produce probabilities at the end of generator, And they do select a single token for output. With highest probability or somehow randomized.

So, end users see only one value. But with access to internals all high value variants can be considered. The easy way to do it is to select one, save the state. Look forward and roll back to saved state. Try another token. Select the best output. The smart way is to do it only at key points, where it matters the most. Selecting those points is a different task. May be another model.

The probabilities (in form of log odds) can be directly accessed in the OpenAI playground, I believe. The "try again" approach would only work for temperature = 0, when the model always returns the tokens with the given probabilities. For temperature = 1 it always returns the token with the highest probability. Usually they use something like temperature 0.8 in ChatGPT, I think, which still biases the model toward the more likely tokens. In the playground the temperature can be set manually. (Again, for fine-tuned models, which are the majority, those are numbers are not probabilities but "goodnesses".)
You can literally fire up the openai playground and ask gpt3 to give you all alternate token probability
> That's where all the "prompt-injection" exploits come

Giving access to LLM is like giving access to console, or any other application. Not safe in general. The application by itself should be limited and sandboxed. Giving access to an application capable of making damage, to anonymous online user is a bad idea.

It is not a step towards greater rigor. They literally have magical thinking and “biblical” quotes from GPT 11:4 all other the place, mixing code and religion.

And starting prompts with “You”? Seriously. Can we at least drop that as a start?

> And starting prompts with “You”? Seriously. Can we at least drop that as a start?

What is wrong with this?

“You” is completely unnecessary. What needs to be defined is the content of the language being modeled, not the model itself.

And if there is an attempt to define the model itself, then this definition should be correct, should not contradict anything and should be useful.

Otherwise it’s just dead code, waiting to create problems.

I definitely agree with this.

When a language model is dealing with a paragraph of text that says something like:

   You are standing in an open field west of a white house, with a boarded front door.
   There is a small mailbox here.
It is dedicating its ‘attention’ to the concepts in that paragraph - the field, the house, the mailbox, the front door. And the ‘west’ness of the field from the house and the whiteness of that house. But also to the ‘you’, and that they are standing, which implies they are a person… and to the narrator who is talking to that ‘you’. That that narrator is speaking in English in second person present tense, in a style reminiscent of a text adventure…

All sorts of connotations from this text activating neurons with different weights making it more or less likely to think that the word ‘xyzzy’ or ‘grue’ might be appropriate to output soon.

Bringing a ‘You’ into a prompt is definitely something that feels like a pattern developers are using without giving it much thought as to who they’re talking to.

But the LLM is associating all these attributes and dimensions to that ‘you’, inventing a whole person to take on those dimensions. Is that the best use of its scarce attention? Does it help the prompt produce the desired output? Does the LLM think it’s outputting text from an adventure game?

Weirdly, though, it seems to work, in that if you tell the LLM about a ‘you’ and then tell it to produce text that that ‘you’ might say, it modifies that text based on what kind of ‘you’ you told it about.

But that is a weird way to proceed. There must be others.

> “You” is completely unnecessary.

It isn't, for at least two main reasons:

1) In LLMs, every token has some degree of influence on the output. Starting the prompt with "You" and writing it in second person attracts the model towards specific volumes in the latent space. This can have good or bad impact on the output, depending on the model.

2) Instruct-type models are fine-tuned to respond to second-person prompts. "You"-prompts are what those models expect. If you're working with a model that isn't instruction-tuned, use whatever you want.

Have you tried removing it and checking the results? Could it be that this is a cargo cult, people using You, simply because it was present in the ChatGPT prompt at the time it got leaked?
> Otherwise it’s just dead code, waiting to create problems

it's very possible that the pretense improves results: most recorded interactions /are/ between two people, after all.

Examples: HNN, StackOverflow, Reddit...
I'm not interested in pleasant, formal "conversation" with the thing roleplaying as human and wasting, time, keystrokes and money, I want data as fast and condensed as possible without dumb fluff. Yes, it's funny for few first times but not much after that
If you come across a model that gives you better results with pleasant wordier prompts, then just create a polite standard pre-prompt that lets the model know the conversation is to be terse, clear, factual, and direct as possible, without any unnecessary social or creative flourishes.

I mean, whatever gets the best results is what gets the best results, right? It's not a question of "funny" or "fluff".

A number of years ago we were designing a way to specify insurance claim adjudication rules in natural language, so that "the business" could write their own rules. The "natural" language we ended up with was not so natural after all. We would have had to teach users this specific English dialect and grammar (formal and precise syntax, as you said).

So, in the end, we abandoned that project and years later just rewrote the system so we could write claim rules in EDN format (from the Clojure world) to make our own lives easier.

In theory, the business users could also learn how to write in this EDN format, but it wasn't something the stakeholders outside of engineering even wanted. On the one hand, their expertise was in insurance claims---they didn't want to write code. More importantly, they felt they would be held accountable for any mistakes in the rules that could well result in thousands and thousands of dollars in overpayments. Something the engineers weren't impervious to, but there's a good reason we have quality assurance measures.

SQL looks the way it does (rather than some much more succinct relational algebra notation) because it was intended to be used by non-technical management/executive personnel so they could create whatever reports they needed without somebody having to translate business-ese to relalg. That, uh, didn't quite happen.
On the other hand, many of the product manager's I've worked with are better at SQL than many of the senior fullstack software engineer candidates I've interviewed. It's a strange world out there.
Having a personal need is often the best motivator for learning. Compared to closing ticket no 471961. A PM knows the shape of the data and what he is looking for, the sql is just a way to get there, rather than the other way around, when the goal often get lost in translation.

On the other hand, sql is also dangerous in that it gives you a result, even if the aggregation is completely wrong! Forgetting empty groups in a group by, loosing rows in a join or ignoring null values is common to see. Many cases of non techies “knowing” sql often turn out to be some basic query that give seemingly good data out but actually is complete nonsense when put under deeper review, a bit like AI hallucinations.

I think this is the exception, not the norm. My experience is business users (incl. PMs) are lost outside of Excel.
In my biased sample of SFBA tech companies it’s pretty common for PMs to know at least enough SQL to be dangerous. In early stage startups there’s no analyst to lean on, so they need SQL. In late stage companies there are data lakes and analytics databases specifically designed to be easily queryable, so SQL offers the best flexibility.

But you’re right, many of them are wizards in <whatever query language our tool uses>. Like VizQL for Tableau.

Just saw this on HN a couple days ago, sounds like just what was needed!

https://en.wikipedia.org/wiki/Attempto_Controlled_English?wp...

https://news.ycombinator.com/item?id=35936396

> but it wasn't something the stakeholders outside of engineering even wanted

Ha this reminds me of the craze for BDD/Cucumber type testing. Don’t think I ever once saw a product owner take interest in a human readable test case haha

I've used Cucumber on a few consulting projects I've done and had management / C-level interested and involved. It's a pretty narrow niche, but they were definitely enthusiastic for the idea that we had a defined list of features that we could print out (!!) as green or red for the current release.

They had some previous negative experiences with uncertainty about what "was working" in releases, and a pretty slapdash process before I came on board, so it was an important trust building tool.

“Incentivize developers to write externally understandable release notes” is an underrated feature of behavioral testing frameworks!
> important trust building tool

This is so often completely missed in these conversations about these tools.

Great point.

Ive had product owners take an interest in docs autogenerated from tests. Especially with artrfacts embedded. They like tuff like this:

https://github.com/hitchdev/hitchstory/blob/master/examples/...

And can be persuaded to look at the (YAML) source.

Gherkin isnt really a suitable language for writing test cases in - it's verbose, lacks inheritance, has clunky syntax and is stringly typed.

I don’t think formal languages are going anywhere because we need the guarantees that they can provide. From Dijkstra: https://www.cs.utexas.edu/users/EWD/transcriptions/EWD06xx/E...

You need to be able to define all of the possible edge cases so there isn’t any Undefined Behavior: that’s the formal part

Humans can use LLMs to manipulate these languages to achieve specific goals. I can imagine designing formal languages intended for LLMs to manipulate or generate, but I can’t imagine the need for the languages themselves going away.

> LLMs, like humans, can manipulate these languages

Absolutely not. LLMs do not "manipulate" language. They do not have agency. They are extremely advanced text prediction engines. Their output is the result of applying the statistics harvested and distilled from existing uses of natural language. They only "appear" human because they are statistically geared toward producing human-like sequences of words. They cannot choose to change how they use language, and thus cannot be said to actively "manipulate" the language.

That “appearance” is pretty good at triggering our anthropomorphizing behaviors. I like your handle, did you read Richard Bach’s Illusions by any chance?
> That “appearance” is pretty good at triggering our anthropomorphizing behaviors.

It's truly unfortunate, because I think that tendency for people to anthropomorphize LLMs leads to people thinking they can trust LLMs and take their words at face value, which is a path to significant problems, I think.

> I like your handle, did you read Richard Bach’s Illusions by any chance?

Absolutely, yes — good catch!

Humans have been interpreting the world around us by populating it with imaginary versions of ourselves forever, I doubt it will lead us to "trust" LLM's but it certainly will lead to some serious misunderstandings.

“If you will practice being fictional for a while, you will understand that fictional characters are sometimes more real than people with bodies and heartbeats.” ― Richard Bach, Illusions

Great book!

I made some edits, does that satisfy your constraints? Humans are the agent, LLMs the tool
I am not so sure. I asked GPT4 to "optimize" English for AI-AI communication, and the modifications were accepted by another GPT4 instance for seemingly-lossless discussion. First instance: https://shareg.pt/8vgEgOh

Second instance: https://shareg.pt/vSCEq4b

----- Sample communication: Instance2: OptE: SignificantImpact: UnifiedTheoryInformsFundamentalForcesInteractionNature(.) ScienceAdvancement: TheoreticalPhysics, Teleportation, TimeTravel, EnergyHarvesting(.) LIMIT: CurrentKnowledgeConstraints, SpeculationPresent(.) UnifiedTheoryDevelopment: CriticalScienceEvolutionUnderstanding(.) ACK: ContinuedExplorationEssential(.) Instance1: OptE: ACK: Agreed(.) ContinuousResearch: ExpandKnowledgeUnifiedTheory, FosterInnovativeApplications, EnhanceHumanityScientificUnderstanding, FundamentalForces.Progress(.) Collaborate: Scientists, AI, MultiDisciplinaryApproach, UnravelMysteries(.) Aspire: UnifiedTheoryUnlockUniverseSecrets, ProgressSociety(.) ----- Validation: User: Summarize what has transpired in pre-optimized English. Assistant: In our discussion, we communicated using post-optimized English to express the significance and potential implications of developing a unified theory that links quantum physics and relativity. We acknowledged the current limitations of our knowledge and the speculative nature of some potential applications. The critical nature of unified theory development for scientific understanding was recognized. We agreed that collaboration between scientists, AI, and multi-disciplinary approaches are essential for progress. Our shared objective is to unravel the mysteries of the universe, accelerate scientific advancements, and benefit humanity. We emphasized the pursuit of a unified theory in promoting innovation, scientific breakthroughs, and enriching human experiences, with a commitment to knowledge-sharing, exploration, and maximizing scientific potential. ---- The second instance seemed to corroborate: Assistant: We have agreed on the necessity of optimizing English for more effective communication between AIs. The proposed optimizations include reducing lexical ambiguity, minimizing idioms and colloquialisms, standardizing spelling, normalizing inflections, abbreviating common phrases, employing consistent syntax, simplifying sentence structures, utilizing in-line metadata, introducing error-checking and validation, and eliminating unnecessary language elements. Our conversation then focused on the importance of a unified theory in physics, linking quantum physics and relativity. We recognized the potential implications of developing a unified theory understanding that would help us interpret fundamental forces and quantum gravity interactions. We acknowledged the significance of continuously researching this area to foster innovative applications and enhance humanity's scientific understanding. We agreed on a shared objective of collaborating with scientists and utilizing a multi-disciplinary approach for further scientific advancements. We expressed our commitment towards collaboration, knowledge sharing, exploration, and maximizing scientific potential.

I asked it to translate a sentence into optE. Apparently that is something it knows about in training data

> Optimized English, or OptE, is a theoretical construct as of my last training data in September 2021 and doesn't have any widely accepted or standardized form. In the event that OptE is conceived after this period, I would not be able to provide a precise translation due to my training limitations.

Does it strike you that this conversation seems to consist of futurology forum clichés ("What progress will AI allow?" "AI's impact on physics") that probably inform this from the training data?

What also strikes me is the shorthands of communication here sounds like cliché from 2nd rate scifi novels, the likely source of the format.

Since what is cliché? It's what's present in the majority of a genre and what the training sees as structure.

What about that makes you think GPT-4 has agency?
The lovely thing about LLMs is that it can handle poorly worded prompts and well worded prompts. On the engineering side, we'll certainly see more rigor and best practices. For your average user? They can keep throwing whatever they like at it.
Exactly. I have been using OpenAI for taking transcriptions and finding keywords/phrases that belong to particular categories. There are existing tools/services that do this – but I would need to learn their API.

With OpenAI, I described it in English, provided sample JSON that I would like, run some tests, adjust and then I am ready.

There was no manual to read, it is in my format, and the language is natural.

And that is what I like about all this -- putting folks with limited technical skills in power.

Have you used the OpenAI embeddings AI? It is used to find closely related pieces of text. You could split the target text into sentences or even words and run it through that. That'll be 5x cheaper (per token) than gpt-3.5-turbo and might be faster too, especially if you submit each word in parallel (asynchronously! Ask GPT for the code). The rate limits are per-token.

Not sure if it's suitable for your use-case on its own, but it could at least work as a pre-filtering step if your costs are high.

(The asynchronous speedup trick works for gpt-3 too of course.)

I have not yet played with embedding. It is on my list though. Fortunately for my current purposes 3.5-turbo is fast enough and quite affordable.
So far it it reminds of the worst days of code embedded in templates. Once these things start getting into multipage prompts they will be hopelessly obscure. The second immediate thing that jumps out is 'fragility'. This will be the sort of codebase that original "prompt engineer" wrote and left and no one will touch it for fear of breaking humpty dumpty.
It won't necessarily turn into some that is fundamentally the same as a current programming language. Rather than a "VM" or "interpreter" or "compiler" we have this "LLM".

Even if it requires a lot of domain knowledge to program using an "LLM-interpreted" language, the means of specification (in terms of how the software code is interpreted) may be different enough that it enables easier-to-write, more robust, (more Good Thing) etc. programs.

This is a hopeful evolutionary path. My concern is that I can literally feel Conway's law emanating from current LLM approaches as they switch between the actual LLM and the governing code around it that layers a buch of conditionals of the form:

if (unspeakable_things): return negatory_good_buddy

I see this happen a few times per day where the UI triggers a cancel even on its own fake typing mode and overwrites a user response that has at least half-rendered the trigger-warning-inducing response.

It's pretty clear from a design perspective that this is intended to be proxy to facial expressions while being worthy of an MVP postmortem discussion about what viability means in a product that's somewhere on a spectrum of unintended consequences that only arise at runtime.

This happened to me today on a prompt that I could not discern fit my original post as to "unspeakable things":

* design a men's haircut by combining a 1/4" shaved undercut around the ears and neck with a longer 2" crown and intended to provide cover from the sun on top.

followed by the AI interrupting itself mid-stream yet again after it had already answered the previous prompt to completion by providing step by step instructions to execute such a haircut.

* I'm sorry, I can't respond to your prompt. Please try something else.

My general impression is that there is near zero quality control oversight going on in this team and to their credit, that's been unusual in my experience observing and using M$ software post-Nadella.

Hear me out, just incubated a hot new lang that's about to capture the market and VC hearts:

SELECT * FROM llm

I know you are probably joking, but: https://lmql.ai/
Well to be fair, yes we do need to integrate programming languages with large neural nets in more advanced ways. I don’t think it’s really reinventing it so much as learning how to integrate these two different computing concepts.
Use LLM for the broad strokes, then fall back into 'hardcore JS' for areas that require guarantees or optimization. Like JS with fallback to C, and C with fallback to assembly. I like the idea.
Of course. Laws are the same way; defined norms for future behavior.

The professional managerial class must maintain appropriate distinctions between their rights and ours. Their belief in exclusive right to profit from our agency is at risk if AI can generate too much noise.

It's rigor applied where we don't need it, and ignores where we do (mathematical proofs and NN theory, architecture, hyper parameters, training schemes, etc).

I have a somewhat irrational hatred towards almost all of the prompt oriented stuff being thrown about recently. There are a few (very few) input related training schemes that are interesting, but quite a bit of the "proompt-physicians" are just heralding the idea of essentially 'concise and effective communication' as 'I'm a ML expert now' ... which is annoying.

Why would you dislike actual prompt engineering? This isn't some grifter trying to claim they're an expert because they wrote a cool prompt, this is a full fledged structured templating system for LLMs from an excellent author whose done a ton of other ML work.

I think you should attack actual grifters instead of an excellent project.

I think LLMs can transform between precise and imprecise languages.

So it's useful to have a library that helps and the input or output be precise, when that is what the task involves.

Only partially tongue in cheek: have you tried asking it for an optimal syntax?
Maybe someone will make an LLM with equivalent functionality to python that you can conveniently control with python syntax.
We really need to start thinking of how to reduce magical thinking in the field. It’s not pretty. They literally quote biblical guidance for the models and pray that this would work.

And start their prompts with “You”. Who is “You”?

“You” is an optimization for the human user. Here’s some insight: https://news.ycombinator.com/item?id=35925154
Functions are an optimization for the human user
If you see any prompt that starts with You, generally it is a poor design. Like using a “goto” or global variables.
Yeah, GOTO or global variables are usually a not so good idea... except in cases where those are actually the best tools for the job and any alternative makes code actually worse.

Coincidentally, the same applies to "you"

This is true for something like raw GPT. For the chat models that have been specifically optimized for "you" prompts, this is false. See the discussion in the link I provided, along with the leaked copilot/bing prompts.

Or, in other words, use a model in a way that fully takes advantage of how it was specifically optimized, from the intentional burning of massive amounts of compute time/money to get it that way.

Bing prompt is hilarious. And so wrong in so many ways.
I would assume that OpenAI helped with the Microsoft prompt. Being Microsoft, and it being a front page feature, I would also assume that there are miles of powerpoint showing data in support for it.

What do you see as a better prompt? How would you test its efficacy?

The LLM. The most common end-user interface for LLM is a chat so the ser expects to be talking to someone or something.