Hacker News new | ask | show | jobs
by bambax 929 days ago
Thanks for the info.

> the completely dataset here: https://huggingface.co/datasets/Pclanglais/MonadGPT

Classic French transcription seems to be lacking. In particular, "s" used to be printed in a manner very similar to "f", but they're really s.

For example this:

> ce qui augmentoit ſes craintesc'eſt que certe innocente Vierge ne parloit iamais d'autre choſe aux Domeſtiques que du lcge d'Orl'cans donnant à connoitre à la façon dont elle en difcouroit que fon inclination eſtoit toute aux armes

should be spelled like this:

> ce qui augmentoit ses craintes c'est que cette innocente Vierge ne parloit jamais d'autre chose aux Domestiques que du ?? d'Orléans donnant à connoître à la façon dont elle en discouroit que son inclination étoit (or estoit) toute aux armes

Maybe there should be some kind of dictionary step before fine-tuning?

1 comments

Ah it's completely volontary on my part: I want to keep the historical spelling as much a possible. That's why I used the google books OCR which does a better work at it than Gallica. That's still a bit erased in the current model (I don't think the tokenizer likes this so much).
Ok -- "avoit" instead of "avait" is indeed a different spelling -- but "f" in original text is not a different spelling, it's a different way of writing the same letter s (a different shape, but the same letter).