| > Learning more about a tool that can filter and aggregate two billion rows on a laptop in two seconds If someone has a code example to this effect, I'd be greatful. I was once engaged in a salesy pitch by a cloud advocate that BigQuery (et al.) can "process a billion rows a second". I tried to create an SQLite example with a billion rows to show that this isn't impressive, but I gave up after some obstacles to generating the data. It would be nice to have an example like this to show developers (, engineers) who have become accustomed to the extreme levels of CPU abuse today, to show that modern laptops really are supercomputers. It should be obvious that a laptop can rival a data centre at 90% of ordinary tasks, that it isn't in my view, has a lot to do with the state of OS/Browser/App/etc. design & performance.
Supercomputers, alas, dedicated to drawing pixels by way of a dozen layers of indirection. |
The speed comes from the raw speed of arrow, but also a 'trick'. If you apply a filter, this is pushed down to the raw parquet files so some don't need to be read at all due to the hive-style organisation
Another trick is that parquet files store some summary statistics in their metadata. This means, for example, that if you want to find the max of a column, only the metadata needs to be read, rather than the data itself.
I'm a Python user myself, but the code would be comparable on the Python side