| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by chongli 1656 days ago
	The whole point of the field of statistics is that you can carry out statistical tests and analysis on a sample; you don’t need all of the data.

2 comments

JBorrow 1656 days ago

It doesn’t really work like that. For instance, imagine you have a simulation with billions of particles in it. To construct reduced data you may need to use many fields (position, temperature, composition) of all particles over many outputs (usually at different times).

link

chongli 1656 days ago

In that case you shouldn’t need to ship the data at all. Just include the code for the simulation and let the rescuers run it to generate the data themselves.

link

JBorrow 1648 days ago

Sorry I'm a bit late to this, but those simulations take 10s - 100s of millions of Cpu hours (i.e. costs of millions - 10s of millions of dollars), so that's not practical.

link

bloak 1655 days ago

I think in astronomy they generate tens of terabytes per night and an experiment may involve automatically searching through the data for instances of something rare, like one star almost exactly behind another star, or an imminent supernova, or whatever. To test the program that does the searching you need the raw data, which until recently, at least, was stored on magnetic tape because they don't need random access to it: they read through all the archived data once per month (say) and apply all current experiments to it, so whenever you submit a new experiment you get the results back one month later.

I like the idea of publishing the data with the paper but it's not feasible in every case.

link