How awesome do you feel right now? This is HUUUGE! To think that a scroll was unreadable for so, so long, until we invented machines that let us read it slice by slice. It's such an unfathomable achievement - we made machines that let us read 2000+ year olds fragile scrolls without ever opening them - and you helped do just that.
Do you know what kinds of features the model is picking up on to distinguish ink from papyrus? And did you have any labeled data (images where a human expert has identified ink or perhaps a scan of a burnt scroll with known content) to help train it?
Certainly my Mark 1 eyeballs would not obviously perform better than random guessing at this task. Although my eyeballs are, if nothing else, nerfed by only being able to see a 2D slice of the data.
Yes. Most of the ink we have come across is carbon based. This leaves a certain texture on the scrolls that is recoverable and viewable with fairly basic physically based rendering, though how much ink is recoverable varies greatly from one character to the next. I don't have links handy but we just published updates to our data viewer page on our website. Pherc.Paris.4 I believe has the best overlay of ink.
A lot of labeled data is available on our ftp server which has public access
We've been trying to automate since the beginning. A lot of it is automated but it's mostly the easier and less damaged parts of the scrolls. Scanning takes a few days for the biggest scrolls but the amount of human refinement is still a multi month process.
Outstanding work! I've participated in the challenge, but didn't get far. One of the questions I had at the time was - if I'm going to use ML to detect ink, could it invent hallucinated letters, or even parts of text, and how to prevent that?
Yes, it's quite possible for ML to hallucinate ink, though it is on a much more local scale, like predicting a slightly longer stroke, filling in more of a character than is actually in the data, etc. Perhaps enough to change a reading of a character or show where ink isnt. It is difficult for ink detection to hallucinate grammatical and idiomatic greek and latin.
What is the input to the ML algorithm? Does it know the surrounding context so that it has a chance to deduce "if this stroke is slightly longer then the end result will be idiomatic greek and latin"?
Just as with redacted documents (consistently blocked terms) or bad OCR jobs (wrong or missing characters), even if only a certain percentage comes out unmangled it is more readable than having no data at all.
A stable base corpus and some dynamic programming will allow you to clean up the remainder[0].
Imagine a worst case scenario: the Herculaneum scrolls turn out to be just the works of this one mediocre pet philosopher. What would we still expect to learn from them, and what would the next step be?
Though I have an interest in Old Norse and I spend a lot of time reading Scandinavian runestones. > 90% of them are grave markers for a dead father, mother, brother, sister, cousin, etc. If I've learned anything from that, it's that people across time and space all lead lives as real and complex as anyone else's. Their joys were as high as mine have been and their sorrows as low as mine have been.
I'm interested to know about the approaches that you tried with the ML, and then decided to not use. In practice, the options are so many. How did you come up with the final approach - and was there a systematic way to decide which options to go for?
I am not on the research team, rather on the production side of things, so my knowledge on that is pretty limited. I think one of the main takeaways from a lot of the research, though, on both the segmentation side and the ink detection side, is that it's a lot less about what models and techniques and such you use, but how good your training data is. Gathering ground truth is hard, and if you don't have a lot of good ground truth, it doesn't matter if your code is perfect, you'll never get results.
You brought up what I'm most curious about: Where does the ground truth come from for this work since you can't just to unwrap a scroll to tell if the model got it right or, presumably, make a facsimile scroll and wrap it up.
The ground truth comes from manual work. The scrolls can be unwrapped virtually, manually, through extensive pointing and clicking by a human on the boundaries of the scroll. This, in and of itself, is not particularly hard in sections of the scroll that are preserved well, but is extremely tedious and slow and error prone. We have a team of annotators who do manual annotation and refinement through custom software we've written, mostly improving on automatically generated segmentations and unwrappings.
Once you have some unwrapped papyrus, you can render it to an image and look for ink. Ink leaves a certain texture that can be identified by the naked eye and labeled. Between these two processes you get the segmentation and ink detection ground truth. Segments can be flattened virtually through existing software and algorithms.
I'm sure that process is described somewhere on the project's site and, being a lazy human (and unwilling to ask LLMs to summarize it for me) I leaned on you. I really appreciate you taking the time to answer. Thank you.
I can see why you'd be attracted to this project from a "let's solve problems computationally" perspective (never mind the historical side). It sounds like there are some cool problems in there.
The eye toward automating the process that the project seems to be targeting is particularly cool, too. This kind of stuff that makes me have real enthusiasm for ML.
That is a general truth of most ML; many models _can_ find the information in the data, if the data is good enough. If it is not, then likely no model can.
That varies greatly on the state of preservation of the scroll. For some of the scrolls we can recover entire columns of text. But this is a best case. Plenty of scrolls, or portions of scrolls, are extremely damaged and warped to where our current methods cannot unroll them through any combination of automated and human driven unrolling. Both of these still have massive headroom for improvement, but achieving that headroom is hard as the preservation gets worse.
To give numbers, for ideal portions of scrolls, we can read 100% of the characters. In nonideal portions of scrolls, we can read 0% of the characters. It's not really possible to quantify how much we could theoretically recover of that 0% through better methods, and how much is truly destroyed.
I am not a papyrologist or a classicist, rather I'm a computer scientist, so my expertise is unfortunately not in _what_ the scrolls say, rather how we get there. That being said I think and hope that there will be a trove of things that has no known provenance at all, completely lost works that elude the public memory.
Other members that were on the team before me had already proved it out before I came along so I knew it was possible. The cool thing for me though was specifically doing some physicically based rendering techniques. How well these work varies greatly, but on a few segments in one scroll they work extremely well. I whipped up some simple code to composite layers, did up a render, and without any ML at all was looking at multiple rows of text that no one had read for 2000 years. That was neat.
There's also the Telegony. Odysseus has a son through Circe who winds up killing him and marrying Penelope. Odysseus son through Penelope, Telemachus, marries Circe. There's some wild stuff that doesn't survive.
Looking through these it’s crazy to find out that The Iliad is only 1 of like 5 original texts on the Trojan war. We’re reading book 2 of a 5 book series
That's what was thought, but maybe not -- only one of the three so far looks Epicurean, which is not what was expected. Maybe it's a fluke, but historians are buzzing a bit about whether it might be broader than expected.
The Epicureans were particularly hostile to the Jews and Christians, because Epicureans deny Providence or the active intervention of the divine in human affairs. See Horace Sermones 1.5.
in the paper it says "The recovered text is a philosophical treatise on ethics, and the evidence points to a Stoic work: it turns on human nature, impulse, and the moral progress of human beings, and its final preserved column names Aristocreon — nephew and disciple of the great Stoic Chrysippus — which, together with the language and themes of the text, places it in a Stoic context and dates it to the 2nd century BC."
BS in CS from a big state school in the USA. I have a hobby interest in history. I learned about the challenge on YouTube. Got involved contributing because I needed money. Then they put out a job posting. I applied, interviewed, and was hired.
30 scrolls, maybe? Something like that. I scanned Pherc Paris 4 and Pherc Paris 3 at Beam line 18 at ESRF back in March.
The team did "the campfire scroll" experiment a few years ago to replicate carbonization, unrolling, and ink detection. That is the only case I am aware of. It proved the method could work but it's not a source of say training data; it varies too much from the real scrolls.
The main limitation is time and cost. We have to scan on what is AFAIK the most powerful x-ray beam line in the world. It is not cheap
You had to pay? I understand the machine cost many hundreds of millions of dollars, but I would have thought for academic researchers doing open science, the beamtime is free (funded by the govt / science trusts).
The beam time is unfortunately not free. I scanned Pherc Paris 4 and Pherc Paris 3 in March and had the final shift on the beam. As I was removing the scroll from the scanning pedestal the next team of scientists were already in the lab getting their samples ready. It's a well oiled machine and they've got customers.
The way these things normally work is that the project starts with some sort of a grant. Then that grant pays for all of the costs of the project: peoples' salary, materials used, time on equipment, plus money for the buildings and administration (overhead).
In this case the time on the equipment would need to be included, both a portion of the cost of building/maintaining it, and probably the energy needed to run it. Even where the government is providing the grant (likely here), it still needs to be accounted for.
How do get to do that? As in what did you study to get the prerequisite knowledge, and how did you find this particular job? When I see interesting jobs I'm anyways curious what path lead there
I am a computer scientist. I studied CS in university, worked in the semiconductor industry for a while, got started as a participant in the challenge aspect of the Vesuivus Challenge. They were hiring, I sent in an application, interviewed, and was offered the job.
That's a tough one to give a strong estimate of. Some scrolls are easier or harder to unwrap and read for a multitude of different reasons, mostly due to how damaged the scroll was in the eruption, and how easy or not the ink is to read. IIRC from what we've scanned of the herculaneum collection, none of the ink is easily visible via spectrum alone, so we have to use a lot of ML and physically based rendering techniques to be able to find ink. That also requires unwrapping and segmentation _before_ any ink detection.
For iron gall ink with high enough iron concentration, the ink stands out in the xray volume through simply masking off low values, such as was shown in our campfire scroll experiment a few years ago. No herculaneum scrolls show similar ink.
Most of the evidence so far points towards carbon based ink. I am not sure if any of the scrolls we have scanned show strong evidence of iron gall based ink. I know that there are different types and preparation methods for different carbon based inks, but I do not know if it is possible to determine which kind(s) were used solely from inspecting the xrays.
I am, though, not a papyrologist, so historical ink making, preparation, and usage are not my field.
Did anyone on the team come from a non-science, non-math, non-academia background? Did anyone working on this just teach themselves and start contributing?
Yes. Sean, who was a co-winner of the 2024 prize, IIRC has no formal background in ML, computer science, AI, etc. He is one of our core researchers and the most productive team member.
I've been on the Discord for a couple of years now, and poking around with submissions as well. Sean and the entire team deserve so much praise for all of this work.
It's easy to just read about the breakthrough and see it as one neat, linear line to get there, and hard to comprehend the hours, months and years that so many spent to get there. Big congrats to you, Sean, Nat and the entire team!
I am unaware of those fragments in particular. Though we have scanned a dozen or so fragments, mostly to help guide ink detection, since the ink in them is often more visible in visible and/or near IR light, but can be hard to impossible to detect in the xray spectrum.
You have a potential to rewrite the history of European Antiquity quite substantially. The Herculaneum set of scrolls is enormous and must contain a lot of hitherto unknown.
That comes with a set of peculiar risks. Once your work starts producing something that contradicts previous work of Very Important People, they will lobby to stop you. Be prepared for that.
Science should be neutral and always value new evidence. Scientists as humans are unfortunately subject to all sorts of passions.
We have very little written material surviving from Rome, at least from the period before a codex (book) was invented, which was more durable that a scroll. Often, we only know of one source describing important events, and when it comes to political struggles and civil wars, the perspective of the defeated party often did not survive. The punishment of damnatio memoriae was practised and even among the early emperors, Caligula and Nero were subject to a form thereof. (This library in Herculaneum was buried 11 years after Nero's death.) I would be surprised if everything in the scrolls perfectly aligned with the record that survived for 2000 years and that was filtered by both random chance and political/religious censorship. Even Christians later destroyed some pagan texts.
BTW personally, I would love for some textbook of Etruscan to emerge from there. This was once again a language whose teaching was banned in Rome.
Hats off!