Hacker News new | ask | show | jobs
by bilinualcom 2142 days ago
OP here, I found this website when I was looking for a way to get the updated version (with correction) of a PG (Project Gutenberg) book and all changes/diffs from the point that I scrape the book from PG website for my language learning side project: https://www.bilinual.com

The bilinual Project also rebuild ebooks (modern HTML, PDF and open ePUB format) with better quality and readability( while it is not its prime goal). Take a look at one example here:

https://www.bilinual.com/book/18043sven/sv/en#line=68&lpp=23 https://www.bilinual.com/download/18043sven-sv-en.pdf https://www.bilinual.com/download/18043sven-sv-en.epub

2 comments

I tried to pull up a couple of books on the front page to check out what you had, and

* One of them 404ed: https://www.bilinual.com/download/30117fren-fr-en.pdf

* The other was full of problems: https://www.bilinual.com/download/16210fren-fr-en.pdf

For example, many words don't have translations at all, and those that do are often incorrect. This feels like a very rough machine translation? For example:

> et c'est surtout dans les paroisses riveraines du Saint-Laurent

You translate this

> and Ce east primarily in the · · some saint Laurence

While Google Translate gives

> and it is especially in the parishes bordering the St.Lawrence

If you're using machine translation, why not use a Google API that might give usable results at least? If that's not plausible, maybe you should try to get together a team of volunteers to manually translate these ebooks for language learners?

(I hope these suggestions are helpful, I'm not trying to be dismissive of your project.)

Hi, Thanks for checking the website.

1- 404 issue: I implemented the PDF generation recently and I noticed that WeasyPrint has issue with html files that have too many tags (our books have around 2*number_of_words tags in them). This is not a big issue and it will be fixed soon in the next iteration.

2- Using Google API: Google APIs and other translation tools are great for translating sentences. However, the problem with use of parallel texts for language learning is our brain laziness. After few pages, our brain looses its patient to solve the translation problems (critical thinking!?) and actually learn words and structure of sentences. The focus immediately goes toward translated sentences in your native language rather than the original text.

Personally, I learn a word for a life when I slow down and think about similar words, its root, and at the end looking it up in a dictionary. The process is valuable.

3- Team of volunteers: It is easier said than done. The functionality is present but I prefer to improve the suggestion engine as much as possible before I involve volunteers. Are you interested to join?

>If you're using machine translation, why not use a Google API that might give usable results at least?

I prefer https://www.deepl.com/translator to Google.

Wow, they look awesome, thank you! (Learning spanish here)
Looking at Unamuno's Abel Sanchez..

'hermanos' is translated 'brethren' - super-archaic.

p8, 'dedicado' in 'te has dedicado a pintar?' is translated 'hardcore'.

'sí' in 'Que sí, hombre' is translated 'do', as in do-re-mi, I guess.

p5,11-12 has 'quieres/quiero repeatedly translated as "with friends like those who needs enemies". Which is just inexplicable. I can't imagine how that would happen.

Corrupted dictionary?

..and most of the trickiest words on a page aren't translated, maybe because not in your dictionary or they have 'lo' or 'se' appended.

Thanks, we are working to improve the quality of both our dictionaries and ML engine. Very hard to answer they the translation picked these and I have to look into each of these individually to answer your questions. The translations are not perfect but it is alive project and I am trying to improve it every hour that I find.

"I can't imagine how that would happen." : Just as a hint, click on "translations" here:

https://en.wiktionary.org/wiki/with_friends_like_these_who_n...

Yeah, something is going very wrong. As if they were trying to tokenize html/pdf, pulling in a lot of extraneous characters/bytes, and using some sort of homebrew ML project to translate it. I don't know how else you'd get such bizarre results.
Yes, not to mention that quieres and quiero should be near the top of the Words So Common That No Translation Is Needed list.