Hacker News new | ask | show | jobs
by GolDDranks 912 days ago
I'm attempting to create a frequency list of words for language learners. (In Japanese.)

Commonly, these lists are based in just what word appears in the text at "surface" level. However, words commonly have multiple "senses" or nuances of meaning in which they are used. Dictionaries list these senses, but it has been traditionally hard to disambiguate which sense the word is used in, given an usage in text.

LLM's make this feasible, so I'm attempting to create a word sense/usage frequency list.

4 comments

Consider using fastText's word vectors. They have a bunch of languages that come pre sorted by frequency and are sufficient for basic word sense. Perhaps use a LLm to automate some of the disambiguation.

https://fasttext.cc/docs/en/crawl-vectors.html

https://news.ycombinator.com/item?id=13771292 (6 years ago)

Aligning the fastText vectors of 78 languages

https://github.com/babylonhealth/fastText_multilingual/blob/...

Thanks, I look into these.
That’s a great idea. I hope it can be done for other languages, too.

I used to help prepare study materials for Japanese learners of English. The other editors and I would try to adjust the vocabulary to keep it at an appropriate level for the target learners. Word-frequency lists provided some guidance, but they showed only how often words appeared in the surveyed texts, not the meanings in which they were used. The word “medium,” for example, might have a fairly high frequency, but could we expect the learners to know the meanings “a substance through which a force travels” or “someone who claims to have the power to receive messages from dead people”?

A similar problem was with multiword idioms. The verb “make” is one of the most common words in English, but how common are “make it,” “make do,” “make up,” “make away with,” or “make out”? Ten years ago, I was unable to find any reliable answers. We had to rely on our gut feelings.

Good luck with your project. LLMs should be a big help.

Thanks you! Yep, multi-word idioms are tough. How do you quantify whether a phrase is just a "sum" of it's words, or is there some additional meaning, "idiomness" to it. I haven't thought a lot about that yet, but it's a problem that I need to solve for this.
If you’d like to discuss these issues, feel free to get in touch. My website URL is on my profile page. I’m not a programmer or expert on natural language processing, but I have worked on over a dozen Japanese-English and English-Japanese dictionaries and enjoy thinking about such problems.
Can you talk a little more about the process? I’m guessing you’re not just prompting gpt to list most common words.

Are you asking the LLM to annotate text and then count number of annotations?

How do you make sure that each disambiguation has a stable label throughout?

Basically, I have a big corpus of text (novels, as I'm interested in getting the learners to read) and a dictionary. I annotate the words using the dictionary, and then give the text context, the target word and the possible dictionary definitions as input to LLM, and I let it select or score, which definitions could be considered to "apply" given the context. Finally, I tally the counts.

The disambiguated senses are provided by the dictionary. Does that answer your question?

How about the highest frequency phrases and variations?

As a language learner, I’ve found that high frequency word lists to not be that useful. It’s too atomic of a unit devoid of context. Memorizing word lists don’t lead to speaking a language — but learning phrases often do. Even better is to learn phrases within a context, like a restaurant or a lecture.

LLMs might actually add value. Word frequencies are simply statistical counts, but finding common phrases is a more co more complicated problem — and the LLMs structure (attention) might actually be the solve.

(I actually ask this if ChatGPT 4 today. I ask it to tell me the highest value phrases I should learn if I’m in a restaurant. I also ask it to break down phrases for me, and give me a lesson on conjugations etc.)

Ah, yeah, totally! The whole point of this excercise is to ascend the level of "words" to get to level of "units of meaning". These commonly consist of not single words but phrases.

Also, you are absolutely correct that learning "atomic units" in isolation is not good practice. What I'm thinkin here is to get some tools to collect the data for "what". The "how" of the learning needs to happen in context.