| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by amadsen 1054 days ago
	Let me paint you a picture: You have data coming off the detectors at a rate of a couple of hundreds of GB/s (after pre-filters implemented in FPGAs etc) that needs to be processed and filtered in real time with output written to disk and tape at about one GB/s. We're talking really CPU intensive processing here: Kalman filters, clustering algortihms, evaluating machine learning models. The facility is one of a kind and operating cost is in the billions per year so downtime is unthinkable, this stuff needs to work. Offline, you're running very, very, detailed (and CPU heavy) simulations. All in all, you have some hundreds of petabytes of data that are constantly being processed and reprocessed for hundreds of different purposes. These systems have many millions of lines of code between them, a lot of which needs to be shared between them. Offline analysis needs to re-run online algorithms and so on - you need a single stack for all systems. You have some hundreds of thousands of CPU cores to run all of this. Due to how academia works, beyond a couple of large core datacenters, resources are mostly spread out in hundreds of locations globally so that each participating university can have maintain a cluster on their premises for teaching/research/funding reasons. You need an efficient way to get the data that a program needs to where it is running, or preferably move the program to where the data is. This is not a tech company, there's no revenue so throwing money at the problem is not an option - it's all funded by tax payers so efficiency is paramount. What language do you reach for? Matlab? Lol. The closest analogy I can think of are some big trading systems and large scale ML inference and content serving at FAANG and the like. That's all usually java or C++. Oh, one more thing: There's very few professional developers dedicated to this. A lot of it is built and maintained by grad students and researchers in-between writing papers. They're smart people, and they can code, but they have neither time nor interest in learning a new language or framework every other year. They move around. A lot. It wouldn't work to have different tech stacks for different projects - you need to pick one solution, not just for one area but for the entire field. So people can spend less time learning and more time doing. There's no one available to migrate legacy code because some new cool language appeared or yesterday's cool library isn't maintained anymore. These projects run for decades. Whatever tech you pick you must be certain that it will still be around and supported 10, 20, 30 years later. That the code still runs and the data that you paid billions for can still be read.

1 comments

davidktr 1054 days ago

Thanks for the detailed answer, I really appreciate the insight. I work in research myself, so I'm familiar with the general constraints.

I was certainly unaware of the size of the data coming from the detectors. If speed is the argument that beats all others, I rest my case. From what I read on the root.cern website, root is a data analysis and simulation environment, so I was not aware of the aspect of prototyping for online use.

Because I spent a lot of time thinking how software development can work in an analysis heavy research environment, I still would like to comment on some of your points. To distribute binaries and source code, packages work very well for us. Especially if you want to reuse software components in unseen contexts, packages and a package registry makes the most sense.

The use case "re-run online algorithms in offline analysis" is a very familiar one. In my line of work, we do that daily: Switching between online and offline to test + deploy algorithms. Vastly smaller scale, of course. But to us, packages are the first part of the solution. All you do is change the data source. For offline, it's local data or a remote DB, for online, it's an interface such as a websocket.

The second part of the solution are unit- and integration tests. Other users will immediately see what you did (or didn't) test. Again, packages are the distribution system of choice. This has nothing to do with Matlab/Python/R/Julia. Rust has crates.io, JS has npmjs, even Java has something like Maven Central.

Regarding the funded-by-taxpayers argument: The issue I see here is that the cool ML, simulation, data analysis stuff which the CERN people do remains in the root ecosystem. If they used something like PyPI, I could use their stuff too. I have a lot of clustering problems, especially on time series. With a more or less proprietary system like root, I can't use any of CERN's implementations.

Regarding "researchers don't have time to learn new languages": If you look on github.com/root-project/root/issues and root-forum.cern.ch, there are suspiciously many questions regarding "how can I make use of root and Python libraries", and "X doesn't work in root, what do". Newbs have to learn root as well, and they seem to like using Python at least as an enhancement.