|
I've studied a few languages and have long wished for / thought of making software roughly along the lines of what you're suggesting here. While tackling a text, looking up words one-at-a-time can be painfully inefficient and break the flow. Digital dictionaries are quicker than paper ones, but don't resolve the fundamental problem. Nothing as advanced as natural language processing crossed my mind, but I did imagine software that would take a target text and generate a vocabulary list that the user could then study efficiently, allowing the user to then read the text fluidly without need for looking up words. There is a long tradition of human-produced "readers" that do this, but software would allow generating this for any text the user wanted and with much greater flexibility and learning options (e.g. auto-generated Anki flashcard decks). I looked into LingQ (mentioned in another comment here), which promises just this idea, but found that it failed pretty badly in execution, to the point that a paper dictionary and notebook just worked better. I know you're seeking a cofounder, but here are some suggestions I hope you find useful: - I think it's essential to use a top-quality dictionary for something like this. I've found Oxford to produce excellent bilingual dictionaries in the languages I've studied. Of course these are copyright-protected and so would require paid purchases or licensing of some kind to use. I think there's a tendency to use generic free-license dictionaries for software like this, and these really aren't good enough. - There's also a tendency to bake into the software the false assumption that "a word is a word", when of course words can have many different meanings, words in different languages do not have one-to-one correspondence, and for most languages, learning a word requires also learning extra information such as gender, conjugation, declension, pronunciation/stress, etc. This goes hand-in-hand with using a top-quality dictionary, which lists multiple word meanings, phrases, and extra linguistic information that poorer-quality dictionaries omit or get wrong. - I would suggest incorporating human-produced translations in the user's native language of the texts in the user's target language, at least as an option that the user could upload. Even the best machine translation software can miss quite a lot. |
Yeah, my first idea was automatic Anki cards, but once you try to use sentences and track words, it's intractable for Anki. It's a complex graph problem even before you get to the level of word vs. form disambiguation.
For dictionaries, my initial source is Oxford bi-lingual dictionaries and a few others.
Re: words. Yup. This was one of the top technical challenges I needed to solve for this project to be possible. I separate each of the possible meanings and map them to a lemma or character that uniquely represents each.
For example, 見ました [Japanese to see (past tense)] maps to several things simultaneously:
みる - the lemma/dictionary form of the word
見る - the kanji representation of the lemma
見 - the kanji in the word
みました - the non-kanji conjugation of the word
Parallel texts are a feature that I hope to implement eventually, but aligning parallel texts isn't trivial, but even that (much simpler than machine translation) ends up being a many-to-many mapping.