| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by shoo 1658 days ago

It's pretty hard to give helpful advice without clearly understanding the existing situation and what the actual bottlenecks are.

E.g. maybe 15 of the 17 hour running time is because the database is doing sequential scans of some tables as some essential indices haven't being defined. Or maybe the indices are defined but the queries need to be written to take advantage of them. Or maybe the queries are blazing fast because the python scripts are taking it upon themselves to perform outer joins in very slow pure python code rather than just getting the database engine to do it. Or maybe all the queries are happening implicitly through SQLAlchemy ORM and the entire analysis is a fractal mess of lazy n+1 select antipattern OO nonsense, and most of the running time is actually network latency between the machine where the python sits and the machine where the database lives. Maybe 4 of the 17 hours of running time is due to compute heavy hot loops in pure python code that can be sped up 1000x if someone is willing to roll up their sleeves and spend a week rewriting as C / C++ / Cython code that lets the CPU loose to crunch numbers in arrays without allocating or hashing or reference counting or waiting for the GIL. Or maybe the entire thing is relatively well engineered, given the physics of the computations involved, and 17 hours is pretty reasonable!

If no one knows yet what the bottlenecks are, maybe spend a few days profiling stuff and comparing it to theoretical estimates of the throughput or processing speed that the hardware is capable of, assuming the system was making optimal use of the hardware, and try to figure it out. It'd be a bit unfortunate to not understand the bottlenecks and migrate everything to pyspark and end up with something that runs slower than the original version.