For a columnar database, that's a continuous chunk of memory. Assuming 32bit q defaults to 32bit int, 1.1 billion integers across four machines means each 64-core (with 4 threads/core) KNL chip is averaging over 275M elements of int array, or 1.1M 32bit int operations per thread. Now think again whether that's amazing or not.
is it ? those things are trivial enough to be entirely bandwidth limited.
total_amount is 4 byte, passenger_count is 1 and those are tightly packed in a column layout.
streaming through that in 150ms is almost within the reach of a single normal chip with dual channel DDR3 ram.
Of course, not quite, and that's discounting the (small) sync overhead but still, no need to shell out 4 big servers, overpriced phi chips and fancy wide bus memory.