| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by NumberCruncher 1659 days ago

> Most of that time is spent executing poorly crafted Oracle SQL queries drawing millions of rows into the analytical run-time for, sorting, aggregation, discarding, merging, and spliting tasks.

I always try to follow the rule-of-thumb of "if it can be done in the analytical DB, it should be done in the analytical DB". In my experience Oracle is pretty well suited for all of the "sorting, aggregation, discarding, merging, and splitting tasks". With proper indexing/partitioning processing 81M records shouldn't take 17 hours. Pulling all the data into python and then fighting the lack of (out-of-the-box) multi-threaded data processing capabilities seems to be part of the problem than of the solution.

In my current job if I have to do some analytical heavy lifting I just write the data to AWS S3 (parquet) and read the query-results back through AWS Athena (Presto) into python.