Meanwhile, google was sorting Petabytes in under a minute on their clusters 6+ years ago. We've still got a long ways to go in OSS land to compete with the big boys.
When I asked why BigQuery doesn't do these sorts, the answer came straight from the post "Nobody really wants a huge globally-sorted output. We haven’t found a single use case for the problem as stated."
Do you think you could ask someone and find out the cluster sizes they used for those sorts? They mention "With the largest cluster at Google under our control", but it would be more interesting to have an idea of actual numbers, even if just an order of magnitude.
- https://cloud.google.com/blog/big-data/2016/02/history-of-ma...
When I asked why BigQuery doesn't do these sorts, the answer came straight from the post "Nobody really wants a huge globally-sorted output. We haven’t found a single use case for the problem as stated."
These accomplishments are awesome nevertheless!
Disclaimer: I'm Felipe Hoffa, and I work for Google (http://twitter.com/felipehoffa).