| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by dTal 2572 days ago

>converting data such that it is represented in ones and zeros

er, what? Isn't it all?

>This dramatically reduces the size

huh? There is no symbolic encoding less efficient for length than binary.

3 comments

jaffee 2572 days ago

Good catch... that sounds pretty silly. It should probably read more like "converting relationships to be represented by single bits"

As a concrete example, we took the NYC taxi ride data set which is something like 300GB of CSV files and when it was indexed in Pilosa, the total size of all the bitmap files was closer to 40GB.

link

kazinator 2571 days ago

It's quite clear to hopefully everyone that a matrix of bits can represent a DAG. That's called an adjacency matrix.

What's not obvious is what we're associating with what in the NYC taxi ride data set.

A bitmap can also represent a set: the bit positions denote enumerated element symbols, and the value indicates whether that element is present.

So we rearrange the NYC taxi ride data into a data structure based on graphs and sets, and make large bitmaps?

link

fnordsensei 2572 days ago

Well, you could decide that one axis is monkeyIndex, and the other is amountOfBananasOwned, and have a quite compact representation of which monkey owns what number of bananas.

I.e., decide on a symbolic meaning for the axes rather than converting data wholesale.

link

dTal 2572 days ago

That doesn't sound compact at all! Every monkey's banana count uses a fixed number n of bits, where n = max(amountOfBananasOwned). That's horribly inefficient, when an ordinary binary counter uses n = log2(amountOfBananasOwned).

Which is not a criticism of Pilosa - I'm sure it's doing something very clever - I just don't understand what.

link

jaffee 2572 days ago

Bit-sliced indexing is the clever magic here. This post goes very deep on it https://www.pilosa.com/blog/range-encoded-bitmaps/

But really, you use one bitmap for each binary bit of an integer, and it turns out you can generate arbitrary range queries on your dataset by doing various combinations of boolean operations on those bitmaps.

link

kazinator 2571 days ago

This page perpetrates an apparent contradiction. Starts off with:

> Pilosa is built around the concept of representing everything as bitmaps.

What this literally means is that not a single datum is stored that is not in the bitmap.

But then the diagrammed examples look like this:

                manatee loris sea_horse
  Vertebrate    1       1     1
  Invertebrate  0       0     0
  Breathes Air  1       1     0

Pardon my ignorance, but I see here a bitmap of 9 bits plus six character strings that are obviously not inside that bitmap.

If those strings are removed, the bitmap means jack squat.

Are they understood to be in another bitmap?

link

jaffee 2571 days ago

One does have to maintain some understanding of the how integer row and column ids are linked to what they actually represent.

Sometimes this is a function which might map (for example) row 3 to the letter 'd', 4 to 'e', and so on. Sometimes it has to be a lookup table which can be kept within Pilosa, or externally. Sometimes the IDs map directly to what they represent (day-of-month, year, passenger count, etc.)

So strictly speaking, not everything is a bitmap, but the bulk of the heavy lifting in terms of serving queries is computation on bitmaps.

link

fnordsensei 2572 days ago

Right! I meant in comparison to serializing data, which is what I (maybe incorrectly) assumed that the parent was referring to.

E.g., turning something like a JSON string of the same data into bits and sticking it in Pilosa.

link

ytklx 2572 days ago

Roaring bitmaps is quite good at compressing bits. Just as an example: Say, most monkeys have the same number of bananas. Pilosa doesn't store it as MonkeyCount bits, but just a few bytes.

link

ytklx 2572 days ago

This is very similar to how Pilosa saves integers. https://www.pilosa.com/docs/latest/data-model/#bsi-range-enc...

link

kazinator 2571 days ago

Okay, so if we store a three-bit 5 as 101 (in columns 0-2) and an additional "non-null" indicating 1 in column 3, we're saying that we have a row that has an association with columns 0, 2 and 3, but not with 1. Since, you know:

> The central component of Pilosa’s data model is a boolean matrix. Each cell in the matrix is a single bit; if the bit is set, it indicates that a relationship exists between that particular row and column.

But why do I want my integers sliced apart into these associations?

link

toddgruben 2571 days ago

So that you can perform very fast range queries on deeply complex filters on those associations

link

ytklx 2572 days ago

Pilosa uses roaring bitmaps to store the data. Roaring bitmaps are compressed and performing bit operations on them is very efficient since they don't require uncompressing the whole bitmap to perform bit operations.

link

kazinator 2571 days ago

The Roaring bitmap paper is equally opaque in this regard. I easily understand everything about the bitmaps themselves: the compression and various operations. But then they say that they tested it on some Census1881 data, and I'm thinking, what, how? How do you grind census data into a bitmap? And is everything in the bitmap, including people's names and such, or does the bitmap refer to objects that are not in the bitmap?

link