Hacker News new | ask | show | jobs
by jamii 4099 days ago
> all this "avoid crossing to the kernel" optimization is becoming a drop in the ocean

Here is an example of a small single threaded program beating out a number of distributed graph frameworks running on 128 cores, with a 128 billion edge dataset.

http://www.frankmcsherry.org/graph/scalability/cost/2015/02/...

Performance matters because it enables simplicity. If your language forces you to pull in multiple machines to solve your problem then its turned a simple program into a distributed system and life gets complicated fast. Just throwing more cores at a program without understanding why its slow will just get you into trouble.

Multithreaded programs and distributed programs should be a scary last resort after making absolutely sure you can't get away with the simple solution.

1 comments

Yes I saw this, and got a little disillusioned at first, but after looking carefully this is not big data, their entire dataset fits in RAM. When your dataset can't fit in RAM - this is where the last resort comes into play. Sadly most companies, I agree, don't know when data is really big data. Most of the time it's just medium data. And I agree about the overhead costs.
> their entire dataset fits in RAM

128 billion edges. 1 TB of data just to list edges as pairs of integers. 154 GB after cleverly encoding edges as variable length offsets in a Hilbert curve.

Do you have a bigger dataset?

Oh, I was referring to the original posts. Will take a look. Thanks!