| Isn't the typical big data sql task IO bound? Vectorization only works when you have a table stored in an optimized columnar format and compute an run a function over a column or to combine multiple columns. The moment you throw in group bys or windows the data turns into rows that you read from a hash table or after a sort - at which point you lose all opportunities of vectorization. Since group bys break vectorization, the other use case is for map or reduce (sums, counts) operations over the entire table. In absence of filters you can precompute these for each column. Plain map or sum like operations in presence of a filter is the only real use case for vectorization in OLAP, if I'm not missing anything. In that case you need to implement the vectorized operation to work across together with a mask, so that you don't include the filtered out values, and over compressed data, otherwise you're wasting time on bringing the data from disk closer to cpu. Most general big data sql tasks will not gain significant improvement using vectorization, unless they specialize on map after filter, no group bys, operations, such as perhaps log processing. Vectorization and other kinds of hardware acceleration is highly useful for small array data that fits into memory such as geo data, APL, numpy, tensors on TPU processing and similar stuff. |
Use of page layouts highly-optimized for vectorized evaluation is common now even if the implementation isn't vectorized. You lose nothing on modern hardware (they are good layouts regardless) and it allows you to easily do vector optimizations later. As a semantic distinction, columnar and vector layouts are organized differently and optimize for somewhat different things even though they have superficially similar appearance. Classic DSM-style columnar is largely obsolete.
Vectorization, first and foremost, is about optimizing selection operations in a database, but it can provide assists in other areas like joins, sorts, and aggregates. Most queries are a composed from these primitives, so many parts of the query plan may benefit. As a heuristic, operations that GPU databases excel at are the same kinds of operations that benefit from vectorization.
Obviously you can't just throw vectorization at an arbitrary database and expect major benefits, they need to be intentionally designed for it.