| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by haltingproblem 2179 days ago

We need a better decomposition of scalability. Do you mean scalability in data or scalability in compute or scalability of both?

Definitions:

Scalability in Data (SD): doing fast computation on a very large number of rows

Scalability in compute (SC): doing slow computations on a large number of rows

For SD, I have found that a 16-32 core machine is more than enough for tens of billions of rows as long as your disk access is relatively fast (SSD vs. HDD). If you vectorize your compute operations you can typically get to within 10x the assembly compute time. This allows you to tap into in a 32 core machine for 10s of effective giga flops. These machines are rated at 100s of giga flops. For example I had to compute a metric on a 100 million row table (dataframe) which effectively required on the order of 10-20 tflops of compute. Single-core pandas was showing us 2 months of compute time. Using vectorization and using mp.Pool I was able to reduce to a few hours. The big win here was vectorization and not mp.Pool.

For Compute scalability - e.g. running multiple machine learning models which cannot be effectively limited to a single machine, nothing beats Dask. Dask is extremely mature, has seen a large number of real world cases and people have used it for hundreds of hours of uptime.

Vectorization is a oft unlooked realm of speedup which can easily give you 10-100x speedups in Pandas. Understanding vectorization and what it can and cannot do is a highly productive exercise.