Hacker News new | ask | show | jobs
by tenaf0 894 days ago
I have been working on a similar project on-and-off in my spare time, the only remotely interesting feature that other similar software may not have is that it actually tries to parse/analyze sentences (with an NLP lib). It's made specifically for German, and the reason why I wanted to make it is that no existing software managed to handle separable verbs properly - for example learning "Wir fangen jetzt an." is just wrong if you learn it as 'fangen' and 'an' separately, you actually care about 'anfangen', dictionary-wise.

It unfortunately does have false-positives (a complete solution would require LLMs, I believe over the much less complicated NLP algorithms - I just don't want to send whole books to ChatGPT, as that would quickly become expensive), but I found it usable, so I made it public now: https://github.com/tenaf0/lwt

I don't want to "advertise" it even more, as the NLP lib is run by academia as a free service, and I don't want to overburden it (I have been planning on hosting it myself, but didn't yet get there).

3 comments

You have my full support for your project, as I think natural language processing is a very exciting and underutilised technology for language learning. But if you want a low-tech solution, I've found Wiktionary to be ideal. Wiktionary has all the declensions and prefixes for German verbs; to use your example:

https://en.wiktionary.org/wiki/f%C3%A4ngt_an

tells you what the word is, and gives a link back to:

https://en.wiktionary.org/wiki/anfangen#German

I chose to add Wiktionary to Kiwix Android (8GB download) for offline use. In addition, I can search by right-clicking or tap+holding on a word. All that information is available because of the (mostly manual) work done by Wiktionary contributors, but it reaches a very high standard. There is usually more digression and explanation for the usage notes in Wiktionary than, say, Collins German-English dictionary, which is a rather good thing for language learners.

FWIW, English Wikitionary (appears to!) have fewer words than German Wiktionary. I've run into this trying to extract words from eBooks (then converting to the "base" form, to essentially de-duplicate). I think it's mostly compound or more niche words, but I imagine you'd still run into them at least occasionally with most written works.

There's a nice project for converting and extracting the data from English Wiktionary into JSON but it doesn't support any other languages, AFAIK, which is a bit of a shame but also not very surprising - Wiktionary is a lot more complex, technically, than I expected!

Interesting to hear that - I'm still at the level of German where I wouldn't know what I'm missing. For clarification: are you saying that:

- the English Wiktionary has fewer English words than the German Wiktionary has German words, or

- the English Wiktionary has fewer German words than the German Wiktionary does?

The latter. I'm very definitely not at that level either, but looking at German words from books that couldn't be found on English Wiktionary, I was able to find them on German Wiktionary. One example would be "Weihnachtsfest" - not sure it's "officially" a compound word, though if you know "Weihnacht" and "Fest", then the meaning should be clear. In any case, it shows up as a single word and trying to "split" words made up of other words is an exercise in insanity.

Another example is "krächzender", which might also serve to give some idea of the particular pains in processing German text. It's not in English Wiktionary, but krächzen is, and is a verb. So "krächzender" is the adjectival form of the verb, and if you know "krächzen" and the general rules around adjective formation it would probably be obvious. But would you rely on a computer to parse those rules, or would you want a table with all the declensions laid out? And if you're building a vocab list for a book, is it a separate entry in the list, or does it fall under the verb?

Obviously, German Wiktionary only has definitions & explanations in German so it's not great for beginners, but any tool that's trying to automatically do stuff with German text would likely benefit from using German Wiktionary.

I have no idea if it's true for other languages, but I wouldn't be surprised if it's also true for other major languages spoken by Wikipedia users (e.g., French, Spanish, but maybe not Chinese).

Interesting! I have a partially-built, related, tool, to extract "words" from e-books, so I could build flashcard lists and make sure I knew the majority of words that were used - most of them would be common words but every book has a decently-sized selection of specialised vocabulary. I did think about trying to get something fancy done with an LLM or an NLP for figuring out the separable verbs, but in the end, I took a very... brute-force approach, basically grabbing the final word in the "phrase", then prepending that to every word in the phrase one by one and asking "is this a known separable verb?" - I'm not sure how well it worked, but that's a different story.
You could potentially use an NLP library like SpaCy, or even bundle with a free fine-tuned LLM like Mistral 7b.

The fine-tuned mistral models are known to out-perform GPT-4 on their specific tasks.