|
|
|
|
|
by danielscrubs
2502 days ago
|
|
I am a data scientist and I care. The time when you could just do proof of concepts or a PowerPoint presentation is long behind us. So now we have to start to take it into production, which means we get the exact same problems as SE has always had. Iff Rust helps us take it into production we will use it. But it’s a lot of land to cover to reach Pythons libraries so I’m not holding my breath. That said, Pythons performance is slow even when shuffling to Numpy. |
|
The bottlenecks, in order, are: inter-node comms, gpu/compute, on-disk shuffling, serialisation, pipeline starvation, and finally the runtime.
Why worry about optimising the very top of the perf pyramid which will make the least difference? Why worry if you spent 1ms pushing data to numpy when that data just spent 2500ms on the wire? And why are you even pushing from python runtime to numpy instead of using arrow?