| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by samus 357 days ago
	Shouldn't production models already do this? They already tend to use tokenizers with complex rules to deal with a lot of input that would otherwise be tokenized in a suboptimal way. I recall a bug in an inference engine (maybe llama.cpp?) because of an implementation difference in their regex engine compared to the model trainer. Which means that the tokenizer used regex-based rules to chop up the input.

1 comments

search_facility 356 days ago

turns out - no, by intuition they should do this for sure - but no.

UPD: Found the paper: - https://huggingface.co/papers/2502.09741 - https://fouriernumber.github.io/

in paper mentioned “number” is a single sort-of “token” with numeric value, so network dealing with numbers like real numbers, separately from char representation. All the math happens directly on “number value”. In majority of current models numbers are handled like sequences of chars

link