Hacker News new | ask | show | jobs
by heisenburgzero 900 days ago
Not completely related. Does anyone know where I can find articles / papers that discuss why transformers, while acting as merely "next token predictor" can handle questions with: 1. Unknown words (or subwords/tokens) that are not seen in the training dataset. Example: Create a table with "sdsfs_ff", "fsdf_value" as columns in pandas. 2. Create examples(unseen in training dataset) and tell the LLM to provide similar output.

I have a feeling it should be a common question, but I just can't find the keyword to search.

PS. If anyone has any links with thoroughly discussion about positional embedding, that would be great. I never got a satisfying answer about the usage of sine / cosine and (multiplication vs addition)

3 comments

If I had to guess, single characters are able to be encoded as tokens, but there's more "bandwidth" in the model being dedicated to handling them and there's less semantic meaning encoded in them "natively" compared to tokens for concrete words. If it decides to, it can recreate unknown sequences by copying over the tokens for the single letters or create them if it makes sense.
I think some earlier NLP applications have something called "Unknown token", which they will replace any unseen word. But for recent implementations, I don't think they are being used anymore.

It still baffles me why such stochastic parrot / next token predictor, will recognize these "Unseen combinations of tokens" and reuse them in response.

This helped me understand but not well enough to explain it yet: https://transformer-circuits.pub/2022/in-context-learning-an...
Thanks. "In-context learning" was the phrase I was looking for.

Also, I found these 2 links pretty good too. 1. http://ai.stanford.edu/blog/understanding-incontext/ 2. http://ai.stanford.edu/blog/in-context-learning/

I'm still not completely convinced. Probably need to dwell on the topic longer.

Everything falls into place once you understand that LLMs are indeed learning hierarchical concepts inherent in the structured data it has been trained on. These concepts exist in a high dimensional latent space. Within this space is the concept of nonsense/gibberish/placeholder, which your sequence of unseen tokens map to. Then it combines this with the concept of SQL tables, resulting in hopefully the intended answer.
P(X_1=x_1, X_2=x_2, X_3=x_3) = P(X_3=x_3 | X_1=X_1, X_2=x_2) • P(X_1=x_1, X_2=x_2) = P(X_3=x_3 | X_1=X_1, X_2=x_2) • P(X_2=x_2 | X_1=x_1) • P(X_1=x_1)

That is to say: Having a correct conditional probability distribution over the next token conditional on the previous tokens, produces a correct probability distribution over sequences of tokens.

And, “correct probability distribution over sequences of tokens” (or, “correct conditional probability distribution over sequences of tokens, conditional on whatever)”, can be... well, you can describe pretty much any kind of input/output behavior in those terms.

So, “it works by predicting the next token” is, at least in principle, not much of a constraint on what kinds of input/output behavior it can have?

So, whatever impressive thing it does, is not really in conflict with its output being produced from the probability distribution P(X_{n+1}=x_{n+1} | X_1=x_1, ..., X_n=x_n) (“predicting the next token”)

It’s not reproducing exact strings in the training data but patterns and patterns of patterns.

Next token prediction is more intelligent than it sounds

Reminds me of that person who asked chatgpt to make its own language with vocab and grammar rules and translate back and forth, it blew my mind