Hacker News new | ask | show | jobs
by throwaway4aday 840 days ago
Nice work! I built something similar years ago and I did compile the probabilities based on a corpus of text (public domain books) in an attempt to produce writing in the style of various authors. The results were actually quite similar to the output of nanoGPT[0]. It was very unoptimized and everything was kept in memory. I also knew nothing about embeddings at the time and only a little about NLP techniques that would certainly have helped. Using a graph database would have probably been better than the datastructure I came up with at the time. You should look into stuff like Datalog, Tries[1], and N-Triples[2] for more inspiration.

Your idea of splitting the probabilities based on whether you're starting the sentence or finishing it is interesting but you might be able to benefit from an approach that creates a "window" of text you can use for lookup, using an LCS[3] algorithm could do that. There's probably a lot of optimization you could do based on the probabilities of different sequences, I think this was the fundamental thing I was exploring in my project.

Seeing this has inspired me further to consider working on that project again at some point.

[0] https://github.com/karpathy/nanoGPT

[1] https://en.wikipedia.org/wiki/Trie

[2] https://en.wikipedia.org/wiki/N-Triples

[3] https://en.wikipedia.org/wiki/Longest_common_subsequence