| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by netizen-936824 1656 days ago
	Raw data can be on the order of terabytes, not that it can't be shared but this is a real barrier when it comes to raw data

3 comments

zmb_ 1655 days ago

There are also legal and privacy concerns. I've worked on a few research papers where exactly one researcher had access to the data under a very strict NDA. And even they did not get full access to the raw data, only the ability to run vetted code against it and some subsets for development.

This is because the datasets were subscriber logs from mobile operators. They are both highly privacy sensitive and contain sensitive business knowledge. There is no way they will ever get published, even in some anonymized form.

Ultimately it always comes down to trust. You need to convince your peer reviewers to trust you that you have correctly done what you have claimed to have done. Of course, even when you publish datasets, you need to convince the peer reviewers to trust you that you didn't fake the data.

link

chongli 1656 days ago

The whole point of the field of statistics is that you can carry out statistical tests and analysis on a sample; you don’t need all of the data.

link

JBorrow 1656 days ago

It doesn’t really work like that. For instance, imagine you have a simulation with billions of particles in it. To construct reduced data you may need to use many fields (position, temperature, composition) of all particles over many outputs (usually at different times).

link

chongli 1656 days ago

In that case you shouldn’t need to ship the data at all. Just include the code for the simulation and let the rescuers run it to generate the data themselves.

link

JBorrow 1648 days ago

Sorry I'm a bit late to this, but those simulations take 10s - 100s of millions of Cpu hours (i.e. costs of millions - 10s of millions of dollars), so that's not practical.

link

bloak 1655 days ago

I think in astronomy they generate tens of terabytes per night and an experiment may involve automatically searching through the data for instances of something rare, like one star almost exactly behind another star, or an imminent supernova, or whatever. To test the program that does the searching you need the raw data, which until recently, at least, was stored on magnetic tape because they don't need random access to it: they read through all the archived data once per month (say) and apply all current experiments to it, so whenever you submit a new experiment you get the results back one month later.

I like the idea of publishing the data with the paper but it's not feasible in every case.

link

someguydave 1656 days ago

I guess we should stop trying because datasets are big

link

jrichardshaw 1655 days ago

The GP is making a completely legitimate point here that broad sharing of large raw datasets is pretty hard, but I don't think anyone is arguing we should give up. Here's a few thoughts, though they're more directed at the general thread than the parent.

In my case I'm currently finishing up a paper where the raw data it's derived from comes to 1.5 PB. It is not impossible to share that, but it costs time and money (which academia is rarely flush with), and even if it was easy at our end, very few groups that could reproduce it have the spare capacity to ingest that. We do plan to publicly release it, but those plans have a lot of questions.

Alternatively we could try to share summary statistics (as suggested by a post above), but then we need to figure out at what level is appropriate. In our case we have a relevant summary statistic of our data that comes to about 1 TB that is now far easier to share (1 TB really isn't a problem these days, though you're not embedding it in a notebook). But a large amount of data processing was applied to produce that, and if I give you that summary I'm implicitly telling you to trust me that what we did at that stage was exactly what we said we'd done and was done correctly. Is that reproducibility?

You could also argue this the other way. What we've called "raw data" is just the first thing we're able to archive, but our acquisition system that generates it is a large pile of FPGAs and GPUs running 50k lines of custom C++. Without the input voltage streams you could never reproduce exactly what it did, so do you trust that? Then you're into the realm of is our test suite correct, and does it have good enough coverage?

I think we have a pretty good handle on one aspect of this, is our analysis internally reproducible? i.e. with access to the raw data can I reproduce everything you see in the paper? That's a mixture of systems (e.g. configs and git repo hashes being automatically embedded into output files), and culture (e.g. making sure no one things it's a good idea to insert some derived data into our analysis pipeline that doesn't have that description embedded; data naming and versioning).

But the external reproducibility question is still challenging, and I think it's better to think about it as being more of a spectrum with some optimal point balancing practicality and how much an external person could reasonably reproduce. Probably with some weighting for how likely is it that someone will actually want to attempt a reproduction from that level. This seems like the question that could do with useful debate in the field.

link

someguydave 1654 days ago

why not purchase a sufficient number of tapes or drives to capture the data and deposit it at the university library?

certainly sharing apparatus is hard but you could release the schematics, board designs and BOMs of the electronics involved.

The problem now is that 1) very few even try to reproduce 2) very little money is available for reproduction

fixing those incentives would help alot.

link