Hacker News new | ask | show | jobs
by kragen 1448 days ago
The page says the data is licensed under CC-BY (presumably in countries that have sui generis database protection, rather than countries like the US where facts aren't copyrightable). This is great!

Is there a torrent? How can we ensure that this treasury of materials knowledge is preserved 64, 256, or 1024 years into the future, even if, for example, the US goes to war against Russia or China and decides to criminalize exporting materials data?

1 comments

In the short (~decade) term, we do tape backups of calculation data in Berkeley, and offload data to an independently-funded European project (NOMAD), to ensure data is in at least two locations. Likewise, our production databases are automatically backed up in the cloud, but we also keep a local mirror on a bare metal server. In the longer 2^6-year time frame or further out still, I would just be flattered if the data is at all still useful for people. I think it's fair to say our community has a lot of challenges to face before we get to that point.

We don't seed any torrents ourselves and only support API access (mainly because we're a small team and have to focus our effort), but with the open license I hope the data can live on wherever/however it can.

If someone were to try to do a bulk download of the data (well, or whatever they thought was the most significant data) through the API for preservation purposes, might it put an undue load on your server infrastructure? Some kind of bulk data download might be useful insurance there.

There seem to be some interesting efforts to run SQLite in the browser so that server infrastructure only has to provide bulk data access, with precomputed indices to avoid full table scans; I wonder if those might be applicable here: https://blog.ouseful.info/2022/02/11/sql-databases-in-the-br... (though of course if you aren't using SQLite as your backend now it might be a headache)

Such an approach, if it were feasible, would have the advantage that bulk data downloads wouldn't look very different from normal use.

This would be a much bigger conversation, the SQLite efforts are very cool.

Short answer to your question is that the API load should be fine (I regularly download large subsets of the database myself via the API for research purposes), although there are good and bad ways of writing API queries. We have some tutorials, workshops, etc. available to help newcomers to our API write good queries.

We also have an email address set up (heavy.api.use@materialsproject.org) where people can give us a heads up if they are concerned about putting an undue load on our servers; as much as we try to have reasonable automatic limits set, sometimes we have had issues! API traffic continues to grow too, which in some ways is a nice problem to have, but does mean this is a moving target.

That's good to hear!