| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by mjburgess 1261 days ago

> Learning more about a tool that can filter and aggregate two billion rows on a laptop in two seconds

If someone has a code example to this effect, I'd be greatful.

I was once engaged in a salesy pitch by a cloud advocate that BigQuery (et al.) can "process a billion rows a second".

I tried to create an SQLite example with a billion rows to show that this isn't impressive, but I gave up after some obstacles to generating the data.

It would be nice to have an example like this to show developers (, engineers) who have become accustomed to the extreme levels of CPU abuse today, to show that modern laptops really are supercomputers.

It should be obvious that a laptop can rival a data centre at 90% of ordinary tasks, that it isn't in my view, has a lot to do with the state of OS/Browser/App/etc. design & performance. Supercomputers, alas, dedicated to drawing pixels by way of a dozen layers of indirection.

6 comments

RobinL 1261 days ago

An example using R code is here: https://arrow.apache.org/docs/r/articles/dataset.html

The speed comes from the raw speed of arrow, but also a 'trick'. If you apply a filter, this is pushed down to the raw parquet files so some don't need to be read at all due to the hive-style organisation

Another trick is that parquet files store some summary statistics in their metadata. This means, for example, that if you want to find the max of a column, only the metadata needs to be read, rather than the data itself.

I'm a Python user myself, but the code would be comparable on the Python side

link

thinkharderdev 1261 days ago

You can see some of the benchmarks in DataFusion (part of the Arrow project and built with Arrow as the underlying in-memory format) https://github.com/apache/arrow-datafusion/blob/master/bench...

Disclaimer: I'm a committer on the Arrow project and contributor to DataFusion.

link

StreamBright 1261 days ago

You can try the examples or datafusion with flight. I have been able to process data with that setup in Rust under milliseconds that usually takes tens of seconds with a distributed query engine. I think Rust combined with Arrow, Flight, Parquet can be a game changer for analytics after a decade of Java with Hadoop & co.

link

cmollis 1261 days ago

completely agree with this. Rust and arrow will be part of the next set of toolsets for data engineering. Spark is great and I use it every day but it's big and cumbersome to use. There are use-cases today that are being addressed by datafusion, duckdb, (to a certain extent, pandas).. that will continue to evolve.. hopefully ballista can mature to a point where it's a real spark alternative for distributed computations. Spark isn't standing still of course and we're already seeing a lot of different drop in C++ SQL engines.. but moving entirely away from the JVM would be a watershed, IMO

link

tveita 1261 days ago

Clickhouse or DuckDB are databases I would look at that support this use case pretty much "out of the box"

E.g. https://benchmark.clickhouse.com has some query times for a 100 million row dataset.

link

spaniard89277 1261 days ago

DuckDB is so simple to work with. It's only worth to look elsewhere with real big data, or where you really need a client-server setup.

I hope it receives more love.

link

IanCal 1261 days ago

Duckdb is outrageously useful. Great on its own, but slots in perfectly reading and providing back arrow data frames, meaning you can seamlessly swap between tools when SQL for some parts and other tools better for others. Also very fast. I was able to throw away designs for multi machine setups as duckdb on its own was fast enough to not worry about anything else.

link

intelVISA 1261 days ago

Having used all three I'd go with Clickhouse/DuckDB over Arrow every time.

link

sanderjd 1261 days ago

Oh interesting - why?

link

intelVISA 1261 days ago

They're easier to use and faster is the tl;dr.

link

nlittlepoole 1261 days ago

100% agree.