Hacker News new | ask | show | jobs
European Languages Datasets (european-language-grid.eu)
65 points by pragmaticalien8 2051 days ago
3 comments

There's also "The Big Bad NLP Database", a much larger resource.

https://datasets.quantumstat.com/

I get eye cancer from the date format they use...
I absolutely do not understand purpose of the folowing

https://live.european-language-grid.eu/catalogue/#/resource/...

https://live.european-language-grid.eu/catalogue/#/resource/...

Since when strict eBNF grammars are useful for natural language processing?

There's a research direction on 'controlled natural language' which is essentially a limited subset of a natural lanugage that still allows to express all that you need for a particular problem domain.

They have their uses in natural language generation, where you may want to output some data in a way that's more readable to humans, and in various specialized query languages. For example, in some tasks you may prefer a voice command system that has more flexibility than mere keywords, where the instructions are phrases matching the users' language (you might want the same product to support many languages) but the system needs to understand only a very limited subset of language that can be expressed with a strict grammar - mainly because its ability to do stuff is also limited to what that subset can express. And this provides reliability - you can verify that the limited set of expressions that the system can understand get understood properly and those who aren't clear get rejected. This is bad for some use cases and good for others; picking a 'best effort' most likely interpretation (which many state of art methods do now) might be desirable or completely unacceptable depending on your use case.

The benefit of a strict grammar over (for example) NN transformer architectures for NLG and NLU is that it's relatively straightforward to map the structures of that grammar to the structured data that your non-NLP code is using for the business logic, you can have a clear and debuggable 1-to-1 mapping for the semantics of these phrases.

"Finance English" may not be a natural language
Every language has something like that. In Germany it's called "Behördendeutsch" (administration german). Even as a native speaker with good language skills, you have to read all forms and letters at least twice to make sense of it.
Though this is an .eu domain, the languages covered are more than just the official EU ones (A quick glance showed entries for Basque (somewhat politically controversial) and Turkish (not part of the EU, though partially in continental Europe).
It's entirely uncontroversial that Basque is being spoken as a native language by EU citizens. What's potentially controversial is merely how much official recognition this fact should receive from the state in which they reside.

On the other hand, the corpus also contains "Chinese" (I assume Mandarin; I haven't checked), which I don't think even the most enthusiastic pan-Europeans are trying to claim yet.

How is Basque "politically controversial"? It is an official language in the Basque Autonomous Community of Spain. Turkish is also a minority language in Greece. It doesn't have official status, but they have their own government mandated Turkish-language schools.
I was not aware that Turkish-language schools were in place in Greece. That's great!

Admittedly I don't know the current state of the politics of Basque but when I lived in France its use was not encouraged in the Basque region.

For these purposes, I'm not sure that its political status is relevant. It's a language; there are a significant number of people in Europe who speak it. So it belongs on a list of European languages.
The language outside politics isn’t, but if you start using it as a weapon in politics (“there is a basque language, so there must be a Basque Country”), it can easily become controversial.

That certainly is the case when you start killing people for that cause (https://en.wikipedia.org/wiki/Basque_conflict)

You do realize that militant Basque nationalist paramilitary groups are disarmed and disbanded since more than 10 years, yes?

It's like saying Irish is controversial because the Provisional IRA existed as an organization in the past...

Which brings up an interesting point. What makes a language european? English, French, etc are official languages in many african and asian colonies due to conquest. Does it make french and english asian or african language? Is russian an asian language or a european language?
> What makes a language european?

That one's simple: it's a native language in a European country.

> Does it make french and english asian or african language?

No, since the official language is not the native language of and in these countries. Afrikaans, on the other hand is an African language, though it originated from (and is still very close to) Dutch.

> Is russian an asian language or a european language?

Russian is still a European language, despite most of Russia being located in Asia. The Asian parts of Russia have their own regional native languages (35 or so in total), with more than 20 official ones.

https://www.european-language-grid.eu/about/:

“The European Language Grid fosters Language Technologies FOR Europe built IN Europe, tailored to our languages and cultures and to our societal and economical demands, benefitting the European citizen, society, innovation and industry.”

⇒ it is European (language grid), not (European language) grid.