| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by rhdunn 1125 days ago
	The problem with GPT and other LLMs is that they don't tokenize words at a word or morpheme level, it's just blocks of up to 4 characters, so you get tokens like `!"` instead of two separate tokens. -- That makes it harder to write custom tools on top of, unlike e.g. the output/model of things like the universaldependencies project.

2 comments

DougBTX 1125 days ago

Do you strictly need that level of tokenisation precision to meet your high-level goals?

link

morkalork 1125 days ago

This is my first reaction as well. Talking about tokenization and POS tagging is getting lost in the weeds when one has goals like this:

>I also want to be able to assess how much of the text is about a given topic, so that if I'm interested in reading a detective story from e.g. the Project Gutenberg collection, I don't want it to pick up a story where a detective is only mentioned in one paragraph.

This more like NLU than an NLP problem isn't it? It's like tracking how much of a Harry Potter book contains Voldemort content without knowing ahead of time that he may be referred to as He Who Must Not Be Named, You-Know-Who, The Dark Lord and so on. One would have to first identify the thing you're interested in, then learn when characters/the author invent new ways to refer to it, and carry all those forwards to find new instances. Fun!

link

rhdunn 1125 days ago

I also want to tag and highlight those parts of the document. For that, I need to know where the label starts and ends, which you can't really do when you don't have control over the tokens.

It's also hard to write custom inference/tagging rules, like in the case you mentioned w.r.t. Voldemort, if you don't know what the tokens look like.

link

chaxor 1125 days ago

Spacy is a decent suggestion here. They have pretty good ways of writing tagging rules.

All of this does seem to be extremely excessive to choose a book genre though. I would imagine the number of books after a simplistic clustering technique would be rather small to flip through, so I really don't understand the use case at all.

If you have very few books (few thousands) then you can apply more fine grained analyses in reasonable amounts of computation, such as contextualized embedding methods. But if the point is to select a book, there no real benefit since the simple 2 second term frequency methods would narrow choices down to only a few books.

If you have billions of books, contextualized embeddings become quite expensive to produce and use (several weeks or months of processing, petabytes of storage, etc), so it's not really feasible as an individual, But the extra querying capability does help narrow the large set down.

link

viksit 1125 days ago

perhaps a spacy pipeline using gpt and huggingface?

link