| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by pilooch 336 days ago

Some colleagues and myself did implemented exactly this six months ago for a French gov agency.

It's open source and available here: https://github.com/jolibrain/colette

It's not our primary business so it's just lying there and we don't advertise much, but it works, somehow and with some tweaks to get it really efficient.

The true genius though is that the whole thing can be made fully differentiable, unlocking the ability to finetune the viz rag on targeted datasets.

The layout model can also be customized for fine grained document understanding.

2 comments

ted_dunning 336 days ago

You don't have a license in your repository top-level. That means that nobody who takes licensing at all seriously can use your stuff, even just for reference.

link

pilooch 336 days ago

Good catch, will add it tomorrow. License is Apache2.

link

wryun 336 days ago

They do have: https://github.com/jolibrain/colette/blob/main/pyproject.tom...

I agree it's better to have the full licence at top level, but is there a legal reason why this would be inadequate?

link

deadbabe 336 days ago

Standard practice now is to just have an LLM read the whole repo and write a new original version in a different language. It’s code laundering.

link

JSR_FDED 336 days ago

Great, thanks for sharing your code. Could you please add a license so I and others can understand if we're able to use it?

link

Adityav369 336 days ago

Yeah the fine tuning is definitely the best part.

Often, the blocker becomes high quality eval sets (which I guess always is the blocker).

link