Hacker News new | ask | show | jobs
by littlestymaar 841 days ago
> There should be somewhere in the corpus, "the is spelled t h e" that this system can use to pull this out.

Such an approach would require an enormous table, containing all written words, including first and last names, and would still fail for made up words.

A more tractable approach would be to give it the map between the individual tokens and their letter component, but then you have the problem that this matching depends on the specific encoding used by the model (it varies between models). You could give it to the model during fine-tuning though.

1 comments

The best approach would be to instruct it to under the hood call a function for such asks and hide the fact that it called a function.