Hacker News new | ask | show | jobs
by bonoboTP 2402 days ago
I often wonder how much of a head start the isolating nature of English gave for computing. It allowed ignoring a lot of inflectional and agglutinative complexity.

Concretely I mean it's very easy to generate text using sentence templates. Just plug in words and it works out. "The $process_name has completed running." "Like $username's comment" "Ban $username".

Relatedly, I think focusing NLP efforts on English masks a lot of interesting phenomena, because English text already comes in a reasonably tokenized, chunked up and pre-digested, easy to handle form. For example speech recognition systems started out with closed vocabularies, with larger and larger numbers of words, and even in their toy forms you could recognize some proper English sentences. To do that in Hungarian for example, the "upfront costs" to a "somewhat usable" system are much higher, because closed vocabulary doesn't get you anywhere. (Similarly, learning basic English is very easy, you can build 100% correct sentences on day 1, you learn "I", "you", "see" and "hear" and can say "I see" and "You see" and "I see you" and "I hear Peter" which are all 100% correct. In Hungarian these are "nézek", "nézel", "nézlek", "hallom Pétert" requiring learning several suffixes and vowel harmony and definite/indefinite conjugation. The learning curve till your first 100% correct 3-5 word sentences is just steeper.)

I don't mean it's impossible to handle agglutinative languages in NLP, I just mean the "minimum viable model" is much simpler and attainable for English, which on the one hand was able to kickstart and propel the early research phases and on the other hand perhaps fueled a bit too much optimism.

English can seem very well structured and it can tempt one to think of language in a very symbolic, within-the-box, rule-based way. In terms of syntax trees, sets of valid sentences etc, instead of "fuzzy probabilistic mess" that it really is. Surely, the syntax tree, generative grammar approach (Chomsky and others) gave us a lot of computer science, but this kind of "clean" and pure symbolic parsing doesn't seem to drive today's NLP progress.

In summary, I wonder how linguistics and especially computational linguistics and NLP would have evolved in a non-Anglo culture, e.g. Slavic or Hungarian.

5 comments

>I often wonder how much of a head start the isolating nature of English gave for computing.

That's like saying "I wonder when you stopped beating your wife"; you assume there was a head start, when, in fact, the world's first commercial computer was German[1].

And until recently, natural languages had a near-zero effect on computing. Worst case, users ended up seeing messages which weren't grammatically perfect, and it wasn't a big deal.

>I wonder how linguistics and especially computational linguistics and NLP would have evolved in a non-Anglo culture, e.g. Slavic

Would have? NLP has only started to matter recently, at a time when it has to work in all languages from the get-go. The current evolution includes contributions of people from many languages and cultures.

And for that matter, English makes a lot of things harder.

[1]https://en.wikipedia.org/wiki/Z4_(computer)

> you assume there was a head start, when, in fact, the world's first commercial computer was German

Did the Z4 do a lot of German language text generation, or German language input parsing? But anyway German is also not agglutinative, but it does have complexities like gendered declension of articles and adjectives.

> And until recently, natural languages had a near-zero effect on computing.

Seems like we're talking past each other and I packed multiple things in the comment. I meant user-facing messages there. I've done some software internationalization (translation) work some years ago and in many cases the format was just templates. You were often expected to translate templates with pluggable strings. Whereas what you would actually need is to write a function that looks at the word that you want to plug in, extracts the vowels, categorizes them with some branching logic, looks at the last consonant, decides if you need a linking vowel, decides on the vowel harmony based on the vowels, look up if it's an exception and then apply the suffix.

In English you can generate the message "Added %s to the %s." These are usually translated to Hungarian as if it was "%s has been added to the following: %s". Or instead of "with %s" they must write "with the following: %s", because applying "with" to a word or personal name requires non-trivial logic. Whenever the translators resort to "... the following: %s", you can know they weren't able to fit it into the sentence with proper grammar due to the use of too primitive string interpolation-based internationalization.

Until recently, Facebook was not able to apply declension to people's names, as it is quite complicated. Normally "$person_name likes this post." would require putting $person_name into dative case, requiring determination of vowel harmony. To avoid it, they picked a rarer verb form which doesn't need the dative case but doesn't sound as natural. They've only transitioned to the dative case in the last year or so.

A lot of this stuff is just not even on the mind of English speaking devs, because template-based string interpolation is a good enough solution in English for the vast majority of cases. The only exception that would need a little bit of branching logic is applying "a" or "an" before a word or pluralization, but these don't come up too often.

Again, my point was dynamically generating user-facing messages, UI elements is so easy in English, while properly doing it in other languages.

> Would have? NLP has only started to matter recently, at a time when it has to work in all languages from the get-go. The current evolution includes contributions of people from many languages and cultures.

Most of the research outside of explicit machine translation research is still based on English. How many papers are out there, e.g., on visual question answering (VQA) systems in Polish or Finnish? In many cases I feel less impressed by such systems because I feel like English is too easy. The order is very predictable, the words are easily separable, the whole thing is much more machine processable. Maybe it isn't so, it would be interesting to see empirical results.

Ah. On that note, I guess my point was that language was never an impediment to UI.

Sure, some things will be easier in English. In other languages, the programmers would just roll with whatever is easier to code; the users would gobble it up as long as it's usable.

Back in the 90's, I've seen pirated software "internationalized" by running the UI keywords through machine translation into Russian. Knowing English was an advantage: if you translated the UI back into English, you could figure out what some of those things did. Still, it existed.

The complexity of language wasn't an impediment, it just lowered expectations for the quality of user interfaces.

Agglutinative languages would probably work as well as isolating languages, since they tend to work by just shoving things on the end of words rather than inflecting them. It does potentially raise a segmentation problem, but I'm not really sufficiently familiar with any agglutinative language to know how hard a problem it is in practice.

The difficult languages are inflectional languages, where you make things completely different instead of just tacking something on the end.

It's worth pointing out that all whitespace is completely optional in Fortran, the first programming language--doi=0,10 is exactly the same as DO I = 0, 10. So it's not like early computing relied heavily on gratuitous whitespace.

Possibly less than you think. (I'm not addressing the NLP part)

For example, every language is already used to math formatting. Programming languages draw more inspiration from math formatting than English.

That leaves naming. Here agglutinative languages should have an advantage. You can have more natural ways to describe roles like how in English we may have caller and callee, rather than more clumsily camel-casing something like sumOfLists.

> linguistics and especially computational linguistics and NLP would have evolved in a non-Anglo culture, e.g. Slavic or Hungarian.

Probably not much different, except that more elements of morphology are treated together with syntax.

> Concretely I mean it's very easy to generate text using sentence templates. Just plug in words and it works out. "The $process_name has completed running." "Like $username's comment" "Ban $username".

If computing were primarily championed by a fusional language (agglutinative languages usually have somewhat "clean" morphology), I imagine that libraries for inflection will be more prominently used. Like in English where more professional apps use a pluralizer library. One natural API for an inflection API is as a fluent API.

Certainly English's morphosyntactic simplicity helped out NLP; your phrase "minimum viable model" hits the nail on the head. But increasingly over the last 5-10 years, I think there is a lot of progress on techniques for handling morphological complexity. Some of the unsupervised tokenization methods that first saw use for English (eg Goldsmith's work) now sees play for agglutinative languages: see here for example[0]. So its not clear to me if NLP in a non-Anglo culture would just use the same techniques (arriving at practical achievements a decade later) or if there would be fundamentally different techniques that are totally unobvious to me now.

Re your point on language being a "[f]uzzy probabilistic mess" -- language is absolutely NOT a fuzzy probabilistic mess and its a damn shame that NLP based its success on black-box models, because it means no one bothers realizing that language isn't a mess at all. See Jelinek's law of speech recognizer accuracy [1]. Simply because we get results using messy black box models doesn't mean that's how things work under-the-hood.

[0] https://www.researchgate.net/publication/221013038_Unsupervi...

[1] https://en.wikipedia.org/wiki/Frederick_Jelinek

The first 40ish years of computing are dominated by machines with a paucity of online storage. It would be more than just a 10 year delay.
Being able to encode it reasonably in 5 bits and comfortably in 6 (adding case and a few last nice symbols) was helpful too.