Hacker News new | ask | show | jobs
by mkasu 2134 days ago
NLP is not my main field but still relevant to my work because I often use models and resources from NLP as tools. I'm also personally interested in Linguistics and Languages so I follow related news, sometimes attend NLP conferences and follow people in those fields on Social Media.

It is very concerning how few thought is usually put into linguistic or language characteristics when dealing with these topics. I also rarely see cultural considerations etc. Basically everything is considered as "machine learning will hopefully get this right if having enough data" which is unfortunate (ML is a great tool but the conferences are about language processing).

Another big issue I noticed is that a majority of research only targets or evaluates English texts. In many cases the language is not even specified (although it is clear they use English from figures or examples). I even heard people complaining that work on non-English data is treated as too minor by many reviewers so stuff like that often just gets rejected.

I think this is a really weird development for a field which centers around natural languages.

2 comments

While I sort-of recognize the emotion you describe in myself, it cannot be ignored that these ignoramuses are simply blowing "traditional" research out of the water in terms of results. That's true across the board, from NLP to image data to computational biology.

It's also a bit simplified to consider it a bifurcation between "traditional" linguists and AI experts entirely ignorant of the discipline. Long before the current wave of AI started, Google liked to hire linguists and computational scientists. These teams probably do have plenty of subject matter experts, but for now they are reaping the low-hanging fruits of the suddenly-improved generic methods. As the marginal improvements are inevitably diminished, subject matter will become more salient again.

I'm a computational biologist by training, and have great appreciation for the often beautiful algorithms, many created in the 70s or 80s and allowing then-spectacular feats of tackling large datasets. Unfortunately, it isn't always obvious how to transfer that knowledge to the new way of doing things.

Yes, the seeming performance of (especially) neural models compared to traditional models is probably the main factor. Although, some voices[1] argue that traditional or much simpler approaches still often do a similar job compared to super over-engineered models, especially when going even slightly beyond an existing target-dataset or task.

I'd argue, that improving the ML models is really the job of ML researchers and should be mainly targeting ML conferences like AAAI (Adv. of AI). In other conferences (directly targeting NLP, CV, Comp. Biology, etc.) it should be the main job to combine those models with the domain-specific characteristics (e.g., language information for NLP) or "traditional" methods to make it an interesting discussion.

I was recently doing reviewing for a multimedia conference and quite a lot of the papers I reviewed were basically pure ML papers. A colleague had the same experience.

1: https://arxiv.org/abs/1907.06902

The ML papers wouldn't bother me if they included specialists of the targeted domain to address the obvious pitfall. I've analyzed the figures in the blog post and skimmed the paper and both one novelty claim ((2) A single massively multilingual model spanning 109 languages and showing cross-lingual transfer even to zeroshot cases.) and an "explanation" (Such positive language transfer across languages is only possible due to the massively multilingual nature of LaBSE) can be debunked just by looking carefully at the figures like I did in the past hour. The languages on which they test the things are also poorly selected (6 constructed languages, one duplicate and one macro-lang) which shows clear lack of attention to details and poor understanding of some basic linguistics notions. But hey it's an ML paper, it's from Google and it has BERT in the title so get attention and will be cited even if it's half-crap.
Yes, this is exactly my point. NLP is about processing language (which have a century old field dedicated to it) yet the new trend is to totally discard that as a minor details. It's not. It's also fine if people mostly focus on English but then they should be clear about it and not claim to address language in general when they are in fact doing English processing in particular.