| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jltsiren 1126 days ago

Instead of debating about the moral and legal aspects, it could be more productive to focus on the technical aspects that can help making more informed decisions about the moral and legal problems.

On some levels of abstraction, LLMs seem to be unknowable black boxes. On other levels, they are simply approximate solutions to established problems, such as estimating the probability distribution for a token following a context.

Genome assembly is kind of like complementary problem to text generation. You can't read the genome directly, but you can duplicate it, break it into fragments, read the fragments, and try to assemble them. The methods vary, but you generally try to find overlaps between the fragments and build a graph based on the overlaps. If you start from a context that occurs only once in the genome, there is often one overwhelmingly likely path in the graph that corresponds to a substantial part of the genome. On the other hand, if the context is too short or it occurs in a repetitive region of the genome, any path you traverse is likely to be chimeric and not correspond to any part of the underlying sequence.

Using similar heuristics, an LLM could estimate whether it's following a long overwhelmingly likely path, replicating substantial parts of the training data, or making choices between substantially different paths, generalizing from the data. And because the training data is usually not that big, it could query the data when it believes it could be replicating the data.