| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jlavine 1401 days ago

I've studied a few languages and have long wished for / thought of making software roughly along the lines of what you're suggesting here. While tackling a text, looking up words one-at-a-time can be painfully inefficient and break the flow. Digital dictionaries are quicker than paper ones, but don't resolve the fundamental problem. Nothing as advanced as natural language processing crossed my mind, but I did imagine software that would take a target text and generate a vocabulary list that the user could then study efficiently, allowing the user to then read the text fluidly without need for looking up words. There is a long tradition of human-produced "readers" that do this, but software would allow generating this for any text the user wanted and with much greater flexibility and learning options (e.g. auto-generated Anki flashcard decks).

I looked into LingQ (mentioned in another comment here), which promises just this idea, but found that it failed pretty badly in execution, to the point that a paper dictionary and notebook just worked better.

I know you're seeking a cofounder, but here are some suggestions I hope you find useful:

- I think it's essential to use a top-quality dictionary for something like this. I've found Oxford to produce excellent bilingual dictionaries in the languages I've studied. Of course these are copyright-protected and so would require paid purchases or licensing of some kind to use. I think there's a tendency to use generic free-license dictionaries for software like this, and these really aren't good enough.

- There's also a tendency to bake into the software the false assumption that "a word is a word", when of course words can have many different meanings, words in different languages do not have one-to-one correspondence, and for most languages, learning a word requires also learning extra information such as gender, conjugation, declension, pronunciation/stress, etc. This goes hand-in-hand with using a top-quality dictionary, which lists multiple word meanings, phrases, and extra linguistic information that poorer-quality dictionaries omit or get wrong.

- I would suggest incorporating human-produced translations in the user's native language of the texts in the user's target language, at least as an option that the user could upload. Even the best machine translation software can miss quite a lot.

1 comments

solarmist 1401 days ago

Thank you for the thoughtful response. I completely agree on all points.

Yeah, my first idea was automatic Anki cards, but once you try to use sentences and track words, it's intractable for Anki. It's a complex graph problem even before you get to the level of word vs. form disambiguation.

For dictionaries, my initial source is Oxford bi-lingual dictionaries and a few others.

Re: words. Yup. This was one of the top technical challenges I needed to solve for this project to be possible. I separate each of the possible meanings and map them to a lemma or character that uniquely represents each.

For example, 見ました [Japanese to see (past tense)] maps to several things simultaneously:

みる - the lemma/dictionary form of the word

見る - the kanji representation of the lemma

見 - the kanji in the word

みました - the non-kanji conjugation of the word

Parallel texts are a feature that I hope to implement eventually, but aligning parallel texts isn't trivial, but even that (much simpler than machine translation) ends up being a many-to-many mapping.

link

jlavine 1401 days ago

I can see the difficulty of integrating with Anki if you're trying to implement a sophisticated system for tracking a user's progress, however I would definitely want some way of studying vocabulary with spaced-repetition flashcards I could have on my phone for something like this, Anki or otherwise. I've personally found traditional flashcards and (even more so) writing words with pen on paper to be the most effective ways of absorbing vocabulary. I've found that I don't ultimately absorb vocab as well using other self-quiz methods like multiple-choice / fill-in-the-blank / matching / etc. in Duolingo and similar apps (though I've had the false impression of learning from these methods). Just my personal experience.

It seems we're on the same page regarding specific word-meanings, comprehensive linguistic information provided along with a word, and using good dictionaries.

I would suggest including at least the option of viewing the full conjugation / declension / etc. associated with a word, and not just that used in the context of the text. I don't know any Japanese (though I recognize the example character you provided from the little Chinese I've studied), but something like this full conjugation of that verb is roughly what I'd want to see for any conjugated/declined language: http://www.japaneseverbconjugator.com/VerbDetails.asp?txtVer....

I understand parsing and matching up parallel texts is difficult, but I thought NLP would help with that, and if the software fails at matching up individual words, you could default to matching up clauses or sentences, which the user could study (with flashcards or something similar) alongside the individual words taken from the dictionary.

I took a look at the app, but I see the currently-available demo is only English-Japanese, and I'm also consistently getting "API Error: Request failed with status code 412" in multiple browsers.

Exactly what parts of the app are you looking for someone else to work on?

link

solarmist 1401 days ago

The current demo version of the app uses heavily cached 1 Gb data files at the limits of the instance's memory, so, unfortunately, it crashes a lot.

I've been waiting to update until I had UI for exercises implemented. They will be something like flashcards, including SRS for scheduling.

Well, two things mentioned in your reply are things that need to be implemented.

* Consuming and parsing the Oxford Dictionary's API and transforming that into Parsnip's data models is one thing.

* Building statistics pages showing a user's progress/knowledge. Instead of an information firehose, I'm leaning towards only showing conjugations in the user's library. That way, the user can immediately see usage examples for those forms.

link