Hacker News new | ask | show | jobs
by mkhorton 1445 days ago
In the short (~decade) term, we do tape backups of calculation data in Berkeley, and offload data to an independently-funded European project (NOMAD), to ensure data is in at least two locations. Likewise, our production databases are automatically backed up in the cloud, but we also keep a local mirror on a bare metal server. In the longer 2^6-year time frame or further out still, I would just be flattered if the data is at all still useful for people. I think it's fair to say our community has a lot of challenges to face before we get to that point.

We don't seed any torrents ourselves and only support API access (mainly because we're a small team and have to focus our effort), but with the open license I hope the data can live on wherever/however it can.

1 comments

If someone were to try to do a bulk download of the data (well, or whatever they thought was the most significant data) through the API for preservation purposes, might it put an undue load on your server infrastructure? Some kind of bulk data download might be useful insurance there.

There seem to be some interesting efforts to run SQLite in the browser so that server infrastructure only has to provide bulk data access, with precomputed indices to avoid full table scans; I wonder if those might be applicable here: https://blog.ouseful.info/2022/02/11/sql-databases-in-the-br... (though of course if you aren't using SQLite as your backend now it might be a headache)

Such an approach, if it were feasible, would have the advantage that bulk data downloads wouldn't look very different from normal use.

This would be a much bigger conversation, the SQLite efforts are very cool.

Short answer to your question is that the API load should be fine (I regularly download large subsets of the database myself via the API for research purposes), although there are good and bad ways of writing API queries. We have some tutorials, workshops, etc. available to help newcomers to our API write good queries.

We also have an email address set up (heavy.api.use@materialsproject.org) where people can give us a heads up if they are concerned about putting an undue load on our servers; as much as we try to have reasonable automatic limits set, sometimes we have had issues! API traffic continues to grow too, which in some ways is a nice problem to have, but does mean this is a moving target.

That's good to hear!