Hacker News new | ask | show | jobs
by pilooch 336 days ago
Some colleagues and myself did implemented exactly this six months ago for a French gov agency.

It's open source and available here: https://github.com/jolibrain/colette

It's not our primary business so it's just lying there and we don't advertise much, but it works, somehow and with some tweaks to get it really efficient.

The true genius though is that the whole thing can be made fully differentiable, unlocking the ability to finetune the viz rag on targeted datasets.

The layout model can also be customized for fine grained document understanding.

2 comments

You don't have a license in your repository top-level. That means that nobody who takes licensing at all seriously can use your stuff, even just for reference.
Good catch, will add it tomorrow. License is Apache2.
They do have: https://github.com/jolibrain/colette/blob/main/pyproject.tom...

I agree it's better to have the full licence at top level, but is there a legal reason why this would be inadequate?

Standard practice now is to just have an LLM read the whole repo and write a new original version in a different language. It’s code laundering.
Great, thanks for sharing your code. Could you please add a license so I and others can understand if we're able to use it?
Yeah the fine tuning is definitely the best part.

Often, the blocker becomes high quality eval sets (which I guess always is the blocker).