| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by dekhn 2489 days ago
	braggy PR is misleading: the 25GB/s coming from CERN is after they filter the data down from 600TB/s because there are no commercial systems that can capture data at higher rates.

3 comments

breck 2489 days ago

This is a good point!

Just for fun, for more perspective on big data, a human body generates around 1-10M new cells per second, and a cell contains about 10-100GB of information. So a single human is generating 1-100PB/s of data just in the new cells! (Give or take a few OOM)

link

throwaway_bad 2489 days ago

Are you trying to quantify the "information" by the size of the DNA? I think this is a pretty meaningless number to multiply since most of the DNA will be exact copies and DNA alone doesn't capture all the information about a cell.

OTOH the amount of "information" needed to perfectly simulate a cell is probably unbounded. Just a corollary of the fact that we currently don't know how to perfectly simulate reality. Even a single "real" number can take up infinite space.

link

glenvdb 2489 days ago

> OTOH the amount of "information" needed to perfectly simulate a cell is probably unbounded. Just a corollary of the fact that we currently don't know how to perfectly simulate reality.

This is a very good point. The 'information' in a cell isn't the base pairs in its DNA, but all the atoms that make up the whole cell. And then each atom encapsulates properties such as position, velocity, charge, van der Waals radius etc.

However this considers atoms with classical mechanics. In a quantum mechanical representation it would be very different again and you can start asking really hairy questions about whether information can be created or destroyed.

link

dekhn 2488 days ago

It's worth reading the prior literature: Markus Covert has gotten pretty good at predicting quantitative phenotypes using whole cell simulations (with very limited cell representations, basically just feature matrices).

https://www.cell.com/abstract/S0092-8674(12)00776-3

link

breck 2489 days ago

Just back of the envelope estimates if you were to do things like scRNAseq, metabolomics, genomics, etc, on every cell. Infeasible but just as a thought experiment. Most DNA is the same, but not exact, and therein lies the rub (cancer). The point on unbounded though is a good one.

link

gnufx 2489 days ago

I'm not sure what that means, but the Cori filesystem is rated at 700GB/s and Summit's 2.5TB/s. See https://docs.nersc.gov/filesystems/cori-scratch/ and https://www.olcf.ornl.gov/olcf-resources/compute-systems/sum...

link

dekhn 2489 days ago

it's pretty simple. the physical data acquisition devices (ATLAS is an example) collect data at rates in the 100s of terabytes/sec https://home.cern/science/computing/processing-what-record)

No storage system can store that data (and most of it is not useful) so they have a series of hardware triggers and buffers that reduce the data down to roughly what modern (general purpose) hardware is capable of handling. They tune the thresholds to match what consumer hardware is capable of.

With regard to supercomputer filesystems: nobody wants to use GPFS. CERN's EOS sustained (theoretical) 3.3TB/sec in Apr 2015, so it's not like they're uncompetitive with the largest supercomputer...

link

gnufx 2489 days ago

I know how data collection works, but it sounded as if 25GB/s was regarded as high compared with filesystems you can buy.

Obviously some people do want GPFS, if they can afford it, but Cori uses Lustre. I don't mean to claim that either is ideal for streaming high rate event data, of course.

link

adev_ 2489 days ago

> Obviously some people do want GPFS, if they can afford it, but Cori uses Lustre

Data model at CERN does not match the one of a supercomputer. CERN data are not processed locally but distributed and spread to ~100 of participating institute in the experiment.

Moreover, "personal opinion", GPFS is crap. It's an old relic from the 90s that has so many quirk and problem of design that it would deserves an entire conference on it. Plus the fact it's proprietary and expensive.

The only reason that make GPFS still alive is that for a long time, the only alternative was Lustre, and Lustre is even worst.

link

dekhn 2488 days ago

lustre is crap.

Every single supercomputer meeting I've been to (I've been part of the community for years, they often invite me to their meetings to give an industry perspective), people are just continuously complaining about the filesystems, and it's GPFS and Lustre at the top of the list.

link

derefr 2485 days ago

What filesystems would they like to be using?

link

packetslave 2489 days ago

it's adorable how you call it "braggy PR" when almost every major technology company these days (FB, Google, Amazon, Uber, Pinterest, etc., pretty much everybody except Apple) has an engineering blog where they share possibly-interesting work they've done.

link

mkasu 2489 days ago

Apple also has a blog where they discuss (some of) their machine learning results[1].

1: https://machinelearning.apple.com/

link