Hacker News new | ask | show | jobs
by bambax 929 days ago
Great project! Do you have a list of the training/fine tuning data that went into it?

A great use would be to enable one to have conversations with Pascal or Leibnitz, etc.

For instance, I published online the complete text of the Mémoires de Saint-Simon (written in 1745-1755, but describing the second part of the reign of Louis XIV and the Régence, 1695-1721).

Saint-Simon was described by his contemporaries to be one of the greats conversationalists of his time. It would be so cool to chat with him.

1 comments

I published the completely dataset here: https://huggingface.co/datasets/Pclanglais/MonadGPT

While I don't think Saint-Simon is included, a French colleague did a few try with it that turned out better than ChatGPT.

I'm currently working on an extended historical model for French (from 1000-2000) and maybe Saint-Simon memoirs will be included as well.

Thanks for the info.

> the completely dataset here: https://huggingface.co/datasets/Pclanglais/MonadGPT

Classic French transcription seems to be lacking. In particular, "s" used to be printed in a manner very similar to "f", but they're really s.

For example this:

> ce qui augmentoit ſes craintesc'eſt que certe innocente Vierge ne parloit iamais d'autre choſe aux Domeſtiques que du lcge d'Orl'cans donnant à connoitre à la façon dont elle en difcouroit que fon inclination eſtoit toute aux armes

should be spelled like this:

> ce qui augmentoit ses craintes c'est que cette innocente Vierge ne parloit jamais d'autre chose aux Domestiques que du ?? d'Orléans donnant à connoître à la façon dont elle en discouroit que son inclination étoit (or estoit) toute aux armes

Maybe there should be some kind of dictionary step before fine-tuning?

Ah it's completely volontary on my part: I want to keep the historical spelling as much a possible. That's why I used the google books OCR which does a better work at it than Gallica. That's still a bit erased in the current model (I don't think the tokenizer likes this so much).
Ok -- "avoit" instead of "avait" is indeed a different spelling -- but "f" in original text is not a different spelling, it's a different way of writing the same letter s (a different shape, but the same letter).