| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by dmurray 364 days ago
	For the 800 names that were missing declension data in the database, it seems like the most straightforward thing to do would be to assign their declensions by hand. It shouldn't take a native speaker more than a couple of hours (if some name they haven't seen before is ambiguous, then whatever they guess at least won't sound obviously wrong to other native speakers). Alternatively, very cheap to ask an LLM to do it. Encoding them into a trie like this would still be a good way to distribute the result, but you don't have to rely on the trie also being a good way to guess the declensions.

4 comments

alexharri 363 days ago

It would be good to cover more names for sure -- that's an ongoing process at DIM. Names are frequently added to the approved list of Icelandic names, so there's always going to be some lag.

I would not be confident enough myself to add the data myself since I'd probably be wrong a lot of the time. When reviewing the results for the top 100 unknown names I frequently got results that I thought _might_ be wrong, but I wasn't sure. For those, I looked up similar names in DIM to verify, and often thought "huh, I would not have declined those names like this". For that reason, I rely on the DIM data as the source of truth since it's maintained by experts on the language.

link

perching_aix 364 days ago

Yeah, that'd be a good idea. That said, it still wouldn't resolve the issue for names that are in-use despite not being approved (or foreign names).

I also live in a country with a centrally governed personal name list, but you can request exceptions, and there are people who were born before the list existed, so their names won't necessarily be on the list either. Immigrants can also retain their names during naturalization I believe, and there can be lots of other complications still. So the ability to sorta-kinda predict the proper declension is still useful.

link

thaumasiotes 364 days ago

link

wizzwizz4 364 days ago

I see no reason that an LLM should be better at guessing than a trie (unless the actual example was in its training data, in which case a web search would be more appropriate).

link

dmurray 364 days ago

I agree. I just like having the guessing done at compile time on principle. It allows you to change a guess, if you find that it's wrong, and convince yourself that you haven't broken any of the other cases where you were previously accidentally right.

link

wizzwizz4 363 days ago

My main objection is the temptation to mix real and fabricated data. Your entire dataset becomes much less useful if it's got nonsense mixed in with it, and if historical examples are anything to go by, it can be hundreds of years before someone identifies and untangles the nonsense from the fact. Any minor benefit is not worth this risk imo.

link

esafak 364 days ago

I wonder if existing LLMs already know these patterns?

link

jer0me 363 days ago

The Icelandic government has been proactive about helping OpenAI train its models on the language to stave off extinction: https://openai.com/index/government-of-iceland/

link

xigoi 363 days ago

If they’d rather support open-source models so the future of the language is not in the hands of a single foreign corporation…

link

thaumasiotes 363 days ago

Yes, this is an example of a problem that an LLM is ideally suited to solve.

link