Hacker News new | ask | show | jobs
by embedding-shape 26 days ago
Yeah, was about to comment that too, instead of training a new model and new weights exclusively for Norwegian (and expecting/wanting every other small/medium-sized country to do the same) which seems infinity harder, they could have made high quality transcriptions and translations of the stories currently described only in Norwegian into English, and making it all public. I guess there still would be a worry that it'd be counted as "less important" compared to other history, news and culture about other countries.
3 comments

Oddly enough, my wife was recently involved in a project to translate historical crime novels from Norwegian; since all the available late 20th century Scandinavian crime novels have already been translated and turned into popular TV series, the plan was to go further back. Into the 1930s. The first cut was done with LLMs, but encountered the problem that (a) Norwegian itself has changed noticeably since then, in both major dialects, and (b) the machine translation deteriorated on large sections, resulting in entirely missing paragraphs and pages in a few places. Not to mention the usual translation issues (what police role does lensman map to?) and localisation (to what extent should the casual antisemitism be left in or removed?)

Translation is never a bijective process. It's never quite the same experience in translation as it is in the original, due to the cultural differences between reader and writer. Larger in this case because 1930s Norway is very different even from 2020s Norway.

Ultimately this was not a success due to marketing difficulties; it is very difficult to get a book noticed.

( https://www.amazon.co.uk/Iron-Chariot-Nordic-Crime-Library/d... )

Sorry if I was unclear, I didn't want to give the impression I think translations or even transcriptions in some cases is easy, or without problems, or not painstakingly time-consuming, it very much is.

I just think building a LLM from scratch is ever harder, with more potential problems that are harder to solve, more time-consuming and even more resource-intensive.

It would require an investment, but those will pay dividends later, as it becomes easier to train LLMs on/for Norwegian. If we need to translate everything to English we might as well just drop using Norwegian altogether. Practically everyone speaks English fluently already...
> as it becomes easier to train LLMs on/for Norwegian

Why would it be easier in the future? The advances we see with LLMs today require a huge amount of data, and it's getting hard getting the amount of data just using any language, I'm having a hard time seeing how it'd get easier for Norwegians to build their own LLM, unless they seriously start to ramp up how much Norwegian content they're putting out.

> If we need to translate everything to English we might as well just drop using Norwegian altogether. Practically everyone speaks English fluently already...

Yeah I mean with that black and white perspective you can pretty much do anything and it won't matter for anything :) I think for the rest of us, what we speak daily and what we rely on professionally, can differ, and that's OK. But maybe this is just my broken Swedish mind being so used to using English professionally but then conversing in Spanish outside of work daily, YMMV.

These models will never compete with frontier models and do not need to - it is about hitting a good-enough, not being the best. Behind the frontier, getting to a certain performance level, is getting easier over time - both sample and compute efficiency is going up.

Furthermore one can reuse investments in data (both agreements, infrastructure and datasets), compute (GPUs, servers) and know-how (training scripts, experienced engineers).

But are you seriously under the belief that all of that, plus all the other things you're forgetting about, is easier, cheaper and faster than transcriptions and translations?

I understand and agree building the LLMs yourself comes with more benefits, long-term ones especially, but still it's harder, more expensive and really time consuming work.

> in both major dialects

Nynorsk and bokmål is not dialects but variants of written Norwegian.

> high quality transcriptions and translations of the stories currently described only in Norwegian into English

You make it sound like an easier task than training an LLM. I'd argue it's not obvious, and would assume the contrary.

Yes, why wouldn't it be easier to transcribe and translate, skills humanity had for centuries, compared to LLMs that we've only learnt to build these last few years, and even require a frikken computer to do? Of course one of these is harder than the other...
Look at it from this lens: translating and transcribing these stories hasn't happened for the centuries they existed, while as you point out the skills where always there. In contrast LLMs have been here for a few years at most and everyone and their dogs are trying to get in on the "race".

With absolutely no insight into why, which one has better odds to happen first is obvious to me.

Sure, it isn't as "hot" to translate stuff as it used to be some hundreds of years ago, and building LLMs surely is "hot" today, I don't doubt more people are attempting to build LLMs today than translating huge datasets, especially if we narrow the two to exclusively "In Norwegian".

Having insights into both translations, transcriptions and attempting to build LLMs myself, I'm fairly sure which effort would be successful first, regardless of how many attempt it first.

Copyrights and statutes don't allow them to do that. The mandate of the National Library maybe permits them to make an LLM through (though I won't at all be surprised if someone sues them anyway).