I've seen bitmaps mentioned a number of times lately. I must admit it is not something I am all that familiar with. Can someone explain to me why bitmaps are more valuable than standard column oriented databases?
I havn't wrapped my head around how this helps speed up queries while data is being ingested.
They are useful for categorical variables. For example, is a record in the "Likes motorcycles" category? They are fast because (well, one reason) bitwise logical operations are very fast for CPUs to do.
Adtech is an example of a sector that benefits from this...they slice and dice datasets a lot to target ad campaigns and such. Being able to do that quickly is useful.
So are you saying that the data is stored in categories which allows for those types of lookups to run faster? Do you have specifics on how the design of a bitmap based database achieves this? How does it maintain these relationships? Just through 0 and 1's?
I guess it's easy for me to visualize both row and column based. Im struggling with the bitmaps concept.
I'm super hyped about this, I've been working on this for the last couple-few years and I'm optimistic about the return to being primarily an open-source thing.
Instead of storing values, like "dog", "cat", or "mouse" it stores (in this example) three binary numbers:
000 - whatever needs to associate with animals, but has no associations currently
001 - whatever it is is associated with having a "mouse" included
111 - whatever it is is associated with having a "dog", a "cat" and a "mouse" included
In the past, high cardinality data sets weren't good for storing in binary form, or a binary index, but nowadays there are ways around this. So, that list of animals could be quite large.
The primary reason it's so much faster is that many CPUs nowadays can do 10s of lookups in a single instruction cycle. That makes them extremely fast.
FeatureBase could be the "feature store" in the middle of the batch prediction section's diagram, or simply be a drop-in replacement for the model's registry.
I havn't wrapped my head around how this helps speed up queries while data is being ingested.