Hacker News new | ask | show | jobs
by Xk 5449 days ago
Alright; I'm confused.

First they say that they "generate around 1 petabyte of data per second"

Then they say "ATLAS produces up to 320M bytes per second, followed by CMS with 220M Bps. The data from ALICE amounts to 100M Bps and LHCb produces 50M Bps." only that sums up to 690M Bps ... definitely not 1 petabyte per second. (That is, assuming that 1M Bps means 1 million bytes per second, or just under 1 megabytes per second.)

And then, later on, they talk about a different mode in which "more data is produced by the four experiments, about 1.25G Bps in total." which is still not 1 petabyte per second.

What's going on?

3 comments

I used to be the sysadmin for a high energy physics lab as we prepared for the ATLAS experiment to come online. (It was a long wait, following helium explosions and such.) The reason you see so many different numbers is that they cannot possibly record the full flow of information. CERN has a very large buffer that the collision sensor data is fed into initially, which is analyzed in realtime to determine which chunks of data are likely to contain significant information. Those chunks are kept, and the rest are discarded. This bothered a lot of people, since they are probably throwing away interesting scientific data, but they are limited by current storage technology.

Further preliminary analysis is performed on the retained data, broadly categorizing the energy and other characteristics of the collision. That allows individual physics groups around the world to download only the data that is likely to pertain to their specific research, e.g. the Higgs boson, multiple dimensions, etc.

There was some talk of transferring data via Bittorrent or perhaps a custom protocol involving fountain codes. That never got off the ground. Instead, the Russians were working on a custom peer-to-peer system with a monolithic centralized set of indices, a system which is hopefully working better than it used to.

P.S. - Here's a hummingbird-speed video of building our prototype fileserver node for local physics analysis of ATLAS data [before I learned about electric screwdrivers]: http://www.youtube.com/watch?v=8y6MpPNqxmw

Millions of ADCs running constantly, doing math locally to determine if the signal is over a certain threshold, then higher level triggers, etc... it filters the data a lot.
I assume the smaller data rate is for after the data is filtered.