Hacker News new | ask | show | jobs
by verditelabs 2 hours ago
I am not on the research team, rather on the production side of things, so my knowledge on that is pretty limited. I think one of the main takeaways from a lot of the research, though, on both the segmentation side and the ink detection side, is that it's a lot less about what models and techniques and such you use, but how good your training data is. Gathering ground truth is hard, and if you don't have a lot of good ground truth, it doesn't matter if your code is perfect, you'll never get results.
3 comments

You brought up what I'm most curious about: Where does the ground truth come from for this work since you can't just to unwrap a scroll to tell if the model got it right or, presumably, make a facsimile scroll and wrap it up.
The ground truth comes from manual work. The scrolls can be unwrapped virtually, manually, through extensive pointing and clicking by a human on the boundaries of the scroll. This, in and of itself, is not particularly hard in sections of the scroll that are preserved well, but is extremely tedious and slow and error prone. We have a team of annotators who do manual annotation and refinement through custom software we've written, mostly improving on automatically generated segmentations and unwrappings.

Once you have some unwrapped papyrus, you can render it to an image and look for ink. Ink leaves a certain texture that can be identified by the naked eye and labeled. Between these two processes you get the segmentation and ink detection ground truth. Segments can be flattened virtually through existing software and algorithms.

I'm sure that process is described somewhere on the project's site and, being a lazy human (and unwilling to ask LLMs to summarize it for me), I leaned on you for a human answer. I really appreciate you taking the time to answer. Thank you.

I can see why you'd be attracted to this project from a "let's solve problems computationally" perspective (never mind the historical side). It sounds like there are some cool problems in there.

The eye toward automating the process that the project seems to be targeting is particularly cool, too. This kind of stuff that makes me have real enthusiasm for ML.

That is a general truth of most ML; many models _can_ find the information in the data, if the data is good enough. If it is not, then likely no model can.
> it's a lot less about what models and techniques and such you use, but how good your training data is.

Ah, the good old bitter lesson strikes again