Hacker News new | ask | show | jobs
by fredwu 805 days ago
Calling this an LLM and claiming:

> With enough training data and a good chat interface, this can be used instead of well-known decoder-only models like GPT, Mistral, etc.

Seems to be very misleading...

1 comments

> enough training data > good chat interface

I'm going to try to do that here: https://github.com/bennyschmidt/llimo

I totally understand your skepticism, and am also surprised it works so well as a sentence-finisher, even with hardly any data (a handful of books). Think about it like this:

If you had a file with billions of sentences:

"Paris is in France" "Apples grow on trees" "Gold is worth more than silver" "I know where to go!" etc...

Then you can complete virtually any sentence. Say the user enters "Paris" - you can easily find and return "is in France" by simply searching for "Paris" in the text then slicing out the rest of the sentence, "is in France" -- or just give the next word "is" for more of an auto-suggest feel.

But if there were 4 sentences starting with "Paris", it gets slightly more complex, now I have to rank them to know which one is the best suggestion:

"Paris is in France." "Paris is nice this time of year." "Paris Hilton liked my post!" "Paris was my favorite city overall."

In this case, "is" is still the best because it scores higher than "Hilton" and "was" - because it follows "Paris" more frequently. So in addition to the text file of billions of sentences, I need to make a ranking system to give a point score to every possible word in the file at every possible position it might be in.

To make it all faster (because a super massive text file is not feasible), the billions of sentences are not actually represented in a single text file, but as a deeply nested JavaScript object our computers are more optimized to traverse and lookup, along with the point values for each word.

At this point, _with enough sentences on file_ with ranked suggestions and fast lookups, you can complete almost anything a user could input. This is what I shared today.

> good chat interface

To put the sentence-finisher to use as a chat bot, all you need to do is convert the user's question or "prompt" into the beginning of a sentence. For example:

"Where is Paris?" -> "Paris is" and let the completer give you "in France". So the answer is: "Paris is in France."

Does this make sense?

This thing you are doing is called a "Markov chain", search it up and read about it.
It's a next token prediction library. "Markov chain" basically means finite state machine and is a looser concept. If you want to call the token prediction methodology Markovian you can - sounds cool! The implementation another person linked here ranking words would also qualify any LLM as using Markovian dynamics, but what is the point of calling it something so abstract?

More accurately, it's literally a language model:

  {
    I: { 
      want: { 
        to: { 
          be: { ... }, 
          know: { ... }
        }
      },
      will: { ... }
    },
    ...
  }
Every word of every sentence is modeled and ranked, and there are methods to perform operations on it. If you added a lot more words and phrases to the model, it would be a "large" language model. It also supports non-words though, so it's more accurately a "next token prediction library" that can be used to create language models.
Markov chain would be more sophisticated/advanced.

This is unoptimised, naïve implementation of ngram language model, idea from seven decades ago [0].

[0] "A Mathematical Theory of Communications" CE SHANNON, 1948

How many comments are you going to leave, what is it Day 3 for you?

Markov chain is equivalent to "state machine" and I can't believe the number of braindead people on this page who don't know this basic fact.

> "The Markov Property states that the probability of future states depends only on the present state."

> "A Markov chain is a type of Markov process that has either a discrete state space or a discrete index set (often representing time), but the precise definition of a Markov chain varies."

https://en.wikipedia.org/wiki/Markov_chain

^ You could have spent the last 3 days learning this basic fact instead of trolling Hacker News. Notice it has nothing to do with token prediction specifically. It's just a loose philosophical concept that means "finite state machine" (AKA deterministic/predictable sequence of states). The React library "XState" is said to implement a Markov chain. Think about what the value would be in saying "this library is trash! All it is is a Markov chain!" totally missing the point of what it does. GPT uses next-token prediction too - from sEvEn dEcAdEs aGo~ (probably more tbh, that's all you found?)

For the sake of your hilarious argument - the data structure I use to model language is not "either ngram or Markov chain" - Markov chains use ngrams in the form of unigrams, bigrams, and trigrams (or if an unknown number: "ngrams"). They're not concepts at odds lol. I hope you learned something here, but I doubt it.

Finally, the data structure the next-token-prediction lib uses is really none of those concepts, it's more accurately a "language model", it's not a state machine at all. One guy said "Markov" and people parroted them in Reddit fashion, and now I get to deal with the bottom-of-the-barrel (you). You really should educate yourself, it would do wonders.

You're conflating state machines with markov chains. Markov chains are stochastic, xstate library is not meant for markov chains - I doubt it has any support for state transitions from probability distributions.

Your library is ngram based model.

Day 4 :D

Re: "stochastic" flawless copy pasta from Google but a Markov chain is still an example of a (finite) state machine and is not itself an implementation of anything.

ngram Language Models are an implementation though, and is not a competing concept:

With a language model, you could talk about "a three word Markov chain" or you can simply say "a trigram". You can say "A Markov chain of variable length" or you can say "an ngram". That is all that is meant regarding those 2.

If a Markov assumption is that you can predict the next word based on knowing all the previous words, then a bigram assumption would be that you can predict the next word based on the previous 1 word. A trigram assumption is that you can predict a word with 2 previous words, because they're all 3 part of the same trigram.

More from Stanford on language models (LM):

> "Models that assign probabilities to sequences of words are called language models or LMs. In this chapter we introduce the simplest model that assigns probabilities to sentences and sequences of words, the n-gram."

> "Markov models are the class of probabilistic models that assume we can predict the probability of some future unit without looking too far into the past. We can generalize the bigram (which looks one word into the past) to the trigram (which looks two words into the past) and thus to the n-gram."

https://web.stanford.edu/~jurafsky/slp3/old_jan23/3.pdf

Looking forward to your next comment!

LLMs can handle questions and answers they have never seen word by word before though.
I think this is the key. A system based on just counting the words in its database will give 0% for sentence continuations not in the database, but LLMs don't look up, they extrapolate, and can provide good sentence completions for sentences that are entirely new.
Two things: 1) This library will continue predicting words too for every word that exists in the model, it just makes less sense as you keep going. That's also true of any LLM. 2) An LLM is not synonymous with "chat bot".
Stick with it and eventually you'll invent your own subset of Prolog.