| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by LKAndrew 2644 days ago
	How would it be possible that 40,000 words translates to 400,000 bits?! Am I missing something here?

6 comments

simias 2644 days ago

Compression I suppose. Like storing the word "compression" is easier if you already know the familiar "com-" prefix as well as the noun "pression". Which is itself easier to remember if you know the verb "to press".

I've been actively learning Portuguese and Russian lately, it's impressive how much faster I can pick up Portuguese vocabulary vs. Russian. And that's even for words that don't have an obvious cognate in languages I already know. The structure of the words, the various building blocks are just so much more familiar in Portuguese. A word like "atrever" (to dare) doesn't have any obvious cognate in languages that I know but it just "looks right" it a way that, say, "atverer" or "aterver" wouldn't. Those last two words sound distinctly un-Portuguese (I might say, un-Romance) to me. That makes it a lot easier to remember the spelling.

Eventually as I grow my Russian vocabulary I start making similar connections. Волноваться is pretty tricky to memorize on its own but it becomes easier when you know that ~ся is the reflective, ~ть is the common verb ending, волна means wave and ~овать is a very common building block for Russian verbs.

mrec 2644 days ago

"Compression" applies very much to grammatical variants of words. It's why the only bizarre irregular verbs in a language tend to be the ones that get used all the time - be, do, go etc - because for anything more obscure the brain just forgets the special case and applies the general rule.

Steven Pinker's book Words and Rules is a great layman-oriented read if this sort of thing interests you.

coldtea 2644 days ago

It also makes sense why the most common verbs are irregular in most languages: to have us pick the direct word quickly, instead of the slower way of deriving it from a rule.

So, they're like "constants" vs calling a function to calculate a value.

mrec 2644 days ago

Not sure I follow. If (big if) there were really a measureable advantage to having us "pick the direct word quickly, instead of the slower way of deriving it from a rule", it doesn't follow that irregularity makes that easier. I could memorize "goed" just as easily as I can memorize "went".

coldtea 2643 days ago

>I could memorize "goed" just as easily as I can memorize "went".

It wouldn't have anything distinctive for people to latch on to, so they would be constantly trying to derive it from the general rules for regular verbs.

E.g.

(a) all verbs regular -> instinctively go to (slower) rule derivation instead of memorization of all verbs, even for the most common ones.

(b) most frequent verbs being irregular -> instinctively retrieve them from the (faster) "lookup table" of memorizations, and bypass the rule based derivation for them.

I.e having the clear distinction of irregularity makes it faster to go directly to that kind of "constant" memory.

That said, this is not my theory, read it years ago in a cognitive/linguistic pop science article. This seems to say more or less the same thing:

https://en.wikipedia.org/wiki/Regular_and_irregular_verbs#Li...

In studies of first language acquisition (where the aim is to establish how the human brain processes its native language), one debate among 20th-century linguists revolved around whether small children learn all verb forms as separate pieces of vocabulary or whether they deduce forms by the application of rules. Since a child can hear a regular verb for the first time and immediately reuse it correctly in a different conjugated form which he or she has never heard, it is clear that the brain does work with rules; but irregular verbs must be processed differently.

simias 2644 days ago

I took it the other way around: it's easy to memorize "went" because you use it all the time. If on the other hand a much less common verb like "to satiate" had a very irregular conjugation then it would regularize pretty quickly because nobody but ultra-pedants would bother to remember the exception.

I think a decent real example of that is fiancé/fiancée, those are french borrowings and have, at least originally, kept the French grammatical gender inflection. However nowadays I often see people using either spelling in a gender-neutral way since most people don't bother to learn French grammar for this one word.

coldtea 2643 days ago

>I took it the other way around: it's easy to memorize "went" because you use it all the time.

That still wouldn't explain the why of having it like "went" vs "goed".

Sure, it's easy to memorize because we use it all the time, but why have it to memorize it in the first place, versus something like "goed".

So, this theory (I tried to convey above) said that it being irregular placed ensured we don't slow down try to derive it from regular rules, but instead have fast access to a memorized form.

Couldn't we just memorize "goed"? If it's just "frequency of use" that mattered, "went" and "goed" would work just as well.

But the extra idea is that "goed", being regular, would be too easy for us to confuse with thousand of other regular verbs, and not use our "fast recall" mechanism, regardless of that verb being needed all the time.

Not sure if correct - read it years ago. This seems to be related to that:

https://en.wikipedia.org/wiki/Regular_and_irregular_verbs#Li...

nbabitskiy 2644 days ago

English cognate to "atreve" is "attribute".

~ов is an iterative suffix - like ~le or ~er in gamble and chatter. It's useful to know, because you can rationalise why it is always dropped in the present tense - you can't be iterative at the moment, unless you're an Englishman)

simias 2644 days ago

>English cognate to "atreve" is "attribute".

I didn't know that, but you'll grant me that it's not a very useful cognate (either in spelling or in meaning).

>~ов is an iterative suffix - like ~le or ~er in gamble and chatter. It's useful to know, because you can rationalise why it is always dropped in the present tense - you can't be iterative at the moment, unless you're an Englishman)

Very interesting, thanks.

nbabitskiy 2639 days ago

If you're serious about russian, feel free to ping me.

jerf 2644 days ago

The number of "bits" something takes is not an absolute value. It is relative to the encoding scheme that is being used for the bits.

Converting a large word list to a small number of bits has been a computer science hobby for a long time. Here's a pretty good search result to start working through for more details: https://duckduckgo.com/?q=building+a+small+spell+checker+suf... It was especially important to write small and fast spell checkers in the 1980s and early 1990s, when you couldn't expect to have enough RAM sitting around to simply load up a naively-encoded list of words, and the act of spell checking a few thousand words could take noticeable time.

So in an encoding scheme chosen to represent English compactly, I'm not too surprised that you can get things down quite small.

However, the question is, what relevance does that encoding scheme have to the human brain? Having just scanned through the paper, the answer is "probably not that much", which the researchers are well aware of. They explicitly present this as a lower-bound, which is a reasonable thing to do. It is obvious that the brain does not simply store 1.5MB of data in the way a computer does, in many ways.

To be honest, this amounts to an exercise in recreational mathematics more than anything else. There's nothing wrong with that, and that's not a criticism. My point is that I'm not sure it's worth trying to read the paper as anything else.

boomlinde 2643 days ago

A quick test using a non-lossy compressor with no understanding of phonemes or human language and grammatical context at all on a dictionary of 370000 English words resulted in 24 bits per word here. It wouldn't surprise me if our abilities to roughly contextualize language in terms of the language we already understand gives us a serious advantage here.

Now a few questions: Can you hear a word you've never written (in a language that you're familiar with) and intuitively spell it right the first time? Can you read a word that you don't immediately understand and figure out its meaning from the context in which it is used? Can you accurately complete half of a sentence?

That a lot of people can do these things suggests to me that we all sit on something superficially similar to an efficient lossy compressor in our brain.

GuB-42 2644 days ago

I've seen somewhere that there is about 1 bit of entropy per letter of English text. The best packer (Hutter prize) compresses 100MB of wikipedia down to 15MB, including the packer itself, a ratio of 1:6.5, which isn't that far off.

Considering that, 40k words to 400kbits is not too surprising.

whitten 2644 days ago

My only guess is that they are implicitly ignoring information we use all the time when talking about information in a computer. The number 5 stored in one place in a computer is the same as the number 5 stored elsewhere. In the human brain two words are already different because they are in different places. The researchers aren’t counting information needed to tell them apart. Admittedly, this is not the way I think either.

shanth 2644 days ago

Superposition / quantum mechanical