| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by snidane 1516 days ago

Isn't the typical big data sql task IO bound?

Vectorization only works when you have a table stored in an optimized columnar format and compute an run a function over a column or to combine multiple columns.

The moment you throw in group bys or windows the data turns into rows that you read from a hash table or after a sort - at which point you lose all opportunities of vectorization.

Since group bys break vectorization, the other use case is for map or reduce (sums, counts) operations over the entire table. In absence of filters you can precompute these for each column.

Plain map or sum like operations in presence of a filter is the only real use case for vectorization in OLAP, if I'm not missing anything.

In that case you need to implement the vectorized operation to work across together with a mask, so that you don't include the filtered out values, and over compressed data, otherwise you're wasting time on bringing the data from disk closer to cpu.

Most general big data sql tasks will not gain significant improvement using vectorization, unless they specialize on map after filter, no group bys, operations, such as perhaps log processing.

Vectorization and other kinds of hardware acceleration is highly useful for small array data that fits into memory such as geo data, APL, numpy, tensors on TPU processing and similar stuff.

2 comments

jandrewrogers 1516 days ago

In a well-designed system, you will typically be limited by effective bandwidth, often memory bandwidth or efficient use thereof which is an area where vectorization can help. Modern servers have tremendous storage bandwidth if you have an I/O scheduler capable of using it. Some newer database engines explicitly reject the assumption that storage throughput is precious as a design constraint, since it has become much less true over time due to advances in hardware.

Use of page layouts highly-optimized for vectorized evaluation is common now even if the implementation isn't vectorized. You lose nothing on modern hardware (they are good layouts regardless) and it allows you to easily do vector optimizations later. As a semantic distinction, columnar and vector layouts are organized differently and optimize for somewhat different things even though they have superficially similar appearance. Classic DSM-style columnar is largely obsolete.

Vectorization, first and foremost, is about optimizing selection operations in a database, but it can provide assists in other areas like joins, sorts, and aggregates. Most queries are a composed from these primitives, so many parts of the query plan may benefit. As a heuristic, operations that GPU databases excel at are the same kinds of operations that benefit from vectorization.

Obviously you can't just throw vectorization at an arbitrary database and expect major benefits, they need to be intentionally designed for it.

alterneesh 1515 days ago

I can't seem to understand why vectorization wouldn't help, say if you read after a sort. Irrespective of whether it fits in memory, or you perform some sort of an external sort, any operation that you want to perform on top of that sorted vector, be it an aggregation to reduce it, or an arithmetic operation with another column, you could still leverage vectorization and would end up using fewer CPU cycles, no?