Hacker News new | ask | show | jobs
by solarmist 1404 days ago
I think it's conflating doing anything that might help with doing the most valuable things. As a concrete example, if you looked up 25k words once in a dictionary at 10 seconds each (this is speedy for a digital or paper dictionary), it would cost you >70 hours looking things up. You'd be hard-pressed to convince me that getting very good at finding stuff in a reference is directly improving my language skills.

The intermediate plateau is because of Zipf's law. In a 300-page book, there are ~5500 unique words and ~3000 of them occur once or twice. This isn't a big deal for native speakers because a 300-page book is about 100k words (1 day's worth of content), but for a language learner, that might take weeks or months to cover. To go further, that native speaker will probably encounter those words again in ~40 days, but it might be years before that learner re-encounters all of them (having long since forgotten them).

Your time is best spent focusing on the sentences (30% of the book) that contain those 3000 words because they use almost all of the rest of the words.

1 comments

> Your time is best spent focusing on the sentences (30% of the book) that contain those 3000 words because they use almost all of the rest of the words.

This seems to assume: (i) that readers of a 300 page book in a foreign language (not typically a beginner task!) are choosing to do so primarily as a means to the end of learning/remembering unfamiliar words, and not because they want to understand the content of the book itself, develop their appreciation of literary phrasing, challenge themself etc., and (ii) that focusing on a [probably disjointed] subset of the sentences in the book won't deprive the reader of the necessary context to grok sentences even when the words are familiar. I'm not sure either is generally true.

Ultimately the alternative to using machine-selected sentences isolated from long form text for learning new words or fill-the-blanks exercises is using definitions and exercises specifically constructed to be accessible and relevant to language learners. The only obvious case where I can see the ML process generating more useful examples is if the language learners' needs are skewed heavily towards absorbing the sort of specialist technical/professional vocabulary conventional learning courses don't cover.

I also think that picking up common and uncommon idiomatic phrases would be at least as important as individual words too (though this is definitely something an ML tool can aid)

This scales up and down freely. I choose 300 pages because that is ~100k words or the amount a native speaker processes daily.

My process is just the opposite of (i). I want to read and understand a book, so I want something to show me where my deficiencies are. Then when I read the page, chapter, etc. it will go much more smoothly. This is also an iterative process where I'm constantly going back and forth between studying new words in a section and trying to read the section.

For (ii), every exercise or review sentence has a link directly back to the source material. I am playing with the idea of extending the context to +/- N sentences when showing an exercise as well.

> using definitions and exercises specifically constructed to be accessible and relevant to language learners

This is the prescriptivist view of language learning and how all classes and textbooks are created. It can be useful, especially at the earliest stages of learning a language. Still, I mostly reject it because when using a language, I have very little control over the content I have to consume. I don't get to choose how an article is written or how someone speaks to me. So the sooner I address that as a language learner, the faster I will become comfortable with arbitrary content.

Phrases are great, I agree. I don't have a vision of how that could work technically so it's just in the pile of ideas I'd love to do eventually.

> For (ii), every exercise or review sentence has a link directly back to the source material. I am playing with the idea of extending the context to +/- N sentences when showing an exercise as well.

This would be a good idea, but my point is more that content in general writing (as opposed to specifically constructed to be self-explanatory writing) is inferred from structure and callbacks to the words or tone of much earlier sections of the writing. Language learners do have to handle passages of text which aren't written with ease-of-comprehension in mind, but they don't have to try to fill in the blanks for "As seen in the previous chapter, x is an example of ______" without reading references to x in the previous chapter first. That's often an impossible task even for native speakers. Similarly, people are much more likely to correctly guess at meaning of a word describing a characters' emotional state (or internalise the meaning after looking it up) if they followed the narrative of the section six pages earlier which provided the context for their emotional state. Not stripping that context, or algorithmically isolating the sentences in a piece which don't require context to fully understand is a tough challenge.

Ah, okay. I understand now.

Yes, that can happen, but from my experience it is rare; Also because this is focused on learning the language I can give hints/affordances like the word, or definition, in their native language, so all user needs to do is produce the word in the target language and conjugation.