Pilosa: An open source, distributed index

Y	Hacker News new \| ask \| show \| jobs

	Pilosa: An open source, distributed index (pilosa.com)
	110 points by josephturnip 2572 days ago

7 comments

joshuaellinger 2572 days ago

Technically, it is very interesting -- it uses Roaring Bitmaps under the hood and builds a query engine on it. So an easy way to think about it is that it maps categorical data into a giant compressible distributed bitmap.

I've been planning to see if I can (mis)use it as an OLAP replacement but I haven't had time to get to it.

link

jaffee 2572 days ago

You definitely can... the feature set keeps growing. We have multi-field filtered GROUP BY now. It's amazing to see how flexible Roaring Bitmaps can be!

link

jaffee 2572 days ago

Source code:

https://github.com/pilosa/pilosa

link

eismcc 2572 days ago

>Pilosa is a standalone index for big data. Its goal is to help big data storage solutions support real time, complex queries without resorting to pre-computation or approximation. Pilosa achieves this goal by implementing a distributed bitmap index which provides a compact representation not of the data itself, but of the relationships present in the data.

https://www.pilosa.com/pdf/PILOSA%20-%20Technical%20White%20...

link

sktrdie 2572 days ago

Not sure who this document is aimed to. It's not technical enough to appeal to programmers that are working closely with Pilosa. And it's not written in a way to make it easy to understand for people that don't know anything about Pilosa (such as myself). I mean a subtitle called "Time Quantum" is enough to make me confused. Would appreciate a more generic "what is this" intro if possible.

link

dmos62 2572 days ago

I found the use cases section the most informative. You can click on a use case to get a write-up. Here's a few excerpts from transportation:

> Pilosa is a distributed bitmap index that sits on top of a data store. The key to understanding and then using Pilosa is converting data such that it is represented in ones and zeros. This dramatically reduces the size as well as accelerates query times.

> For example, timestamps are important information, but we tend to be interested in individual components of a timestamp, especially when analyzing data with cyclic trends. Timestamp components are stored as groups of bitmaps, known as “frames”. We create one frame for the day of the week, as illustrated in the following table. Along with similar frames for year, month, and time of day, this accelerates queries that ask questions about rides belonging to any logical combination of these time groups.

> [...]

> Because each data point includes pickup/dropoff times and total distance travelled, it’s easy to determine the average speed of the trip. As an example, we use this as a first order approximation of congestion. We created a frame representing average speed, with a spacing of 1 mph.

> In order to answer questions about congestion, we needed to first determine what speeds constitute slow traffic. One of the basic queries in Pilosa is the TopN function, and we used that to get a list of all the different average speeds. By performing a count on each we built a histogram of how many rides fall into each speed bucket, and decided from there which buckets deviate enough from the norm to constitute congestion.

link

dTal 2572 days ago

>converting data such that it is represented in ones and zeros

er, what? Isn't it all?

>This dramatically reduces the size

huh? There is no symbolic encoding less efficient for length than binary.

link

jaffee 2572 days ago

Good catch... that sounds pretty silly. It should probably read more like "converting relationships to be represented by single bits"

As a concrete example, we took the NYC taxi ride data set which is something like 300GB of CSV files and when it was indexed in Pilosa, the total size of all the bitmap files was closer to 40GB.

link

kazinator 2571 days ago

It's quite clear to hopefully everyone that a matrix of bits can represent a DAG. That's called an adjacency matrix.

What's not obvious is what we're associating with what in the NYC taxi ride data set.

A bitmap can also represent a set: the bit positions denote enumerated element symbols, and the value indicates whether that element is present.

So we rearrange the NYC taxi ride data into a data structure based on graphs and sets, and make large bitmaps?

link

fnordsensei 2572 days ago

Well, you could decide that one axis is monkeyIndex, and the other is amountOfBananasOwned, and have a quite compact representation of which monkey owns what number of bananas.

I.e., decide on a symbolic meaning for the axes rather than converting data wholesale.

link

dTal 2572 days ago

That doesn't sound compact at all! Every monkey's banana count uses a fixed number n of bits, where n = max(amountOfBananasOwned). That's horribly inefficient, when an ordinary binary counter uses n = log2(amountOfBananasOwned).

Which is not a criticism of Pilosa - I'm sure it's doing something very clever - I just don't understand what.

link

jaffee 2572 days ago

Bit-sliced indexing is the clever magic here. This post goes very deep on it https://www.pilosa.com/blog/range-encoded-bitmaps/

But really, you use one bitmap for each binary bit of an integer, and it turns out you can generate arbitrary range queries on your dataset by doing various combinations of boolean operations on those bitmaps.

link

fnordsensei 2572 days ago

Right! I meant in comparison to serializing data, which is what I (maybe incorrectly) assumed that the parent was referring to.

E.g., turning something like a JSON string of the same data into bits and sticking it in Pilosa.

link

ytklx 2572 days ago

Roaring bitmaps is quite good at compressing bits. Just as an example: Say, most monkeys have the same number of bananas. Pilosa doesn't store it as MonkeyCount bits, but just a few bytes.

link

ytklx 2572 days ago

This is very similar to how Pilosa saves integers. https://www.pilosa.com/docs/latest/data-model/#bsi-range-enc...

link

kazinator 2571 days ago

Okay, so if we store a three-bit 5 as 101 (in columns 0-2) and an additional "non-null" indicating 1 in column 3, we're saying that we have a row that has an association with columns 0, 2 and 3, but not with 1. Since, you know:

> The central component of Pilosa’s data model is a boolean matrix. Each cell in the matrix is a single bit; if the bit is set, it indicates that a relationship exists between that particular row and column.

But why do I want my integers sliced apart into these associations?

link

ytklx 2572 days ago

Pilosa uses roaring bitmaps to store the data. Roaring bitmaps are compressed and performing bit operations on them is very efficient since they don't require uncompressing the whole bitmap to perform bit operations.

link

kazinator 2571 days ago

The Roaring bitmap paper is equally opaque in this regard. I easily understand everything about the bitmaps themselves: the compression and various operations. But then they say that they tested it on some Census1881 data, and I'm thinking, what, how? How do you grind census data into a bitmap? And is everything in the bitmap, including people's names and such, or does the bitmap refer to objects that are not in the bitmap?

link

kitd 2572 days ago

The link goes straight to the "Data Model" section.

The Introduction section is a better starting point. There are other sections (Getting Started, Architecture and API Reference) for programmers who want to know the details.

link

maycotte 2572 days ago

Great feedback. We could also do a better job of documenting the most frequent use cases which include high cardinality segmentation, personalization at scale, rapid data exploration/discovery, edge analytics, fast BI, threat detection, bioinformatics and more.

link

continuations 2572 days ago

How do you handle race conditions? E.g. an app updated the persistent store but crashed before it could update Pilosa?

link

jaffee 2572 days ago

Pilosa is best used in conjunction with something like Kafka with (e.g.) separate consumers for Pilosa and a persistent data store.

link

ahazred8ta 2572 days ago

"Continuous Analysis on Really Big Data - Pilosa is an open source, distributed bitmap index" https://www.pilosa.com/

link

shuzchen 2571 days ago

Any chance there'll be some built-in support to perform collaborative filtering? Seems like a database of relations like this would be awesome for user-based collaborative filtering.

link

jaffee 2569 days ago

Great question! We have actually done some experiments with this in the past and will likely be rolling out features like this on top of Pilosa as part of Molecula https://www.molecula.com/is-your-data-ai-ready/

link