| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by thamer 2047 days ago

Sorry, I didn't check back on this comment after posting it, I hope you'll see this.

It's manual. What you get with Arrow is an efficient way to store structured data in a way that values for the same column (same dimension) are together on disk rather than having each record with all its fields together. So if you're storing say a dataset of users with a 64-bit user ID, an IP address, a timestamp, and a country code you'd define a Schema object as having these 4 columns with the size of each one (here 64/32/32/16 bits for example) and then you'd start writing your records block by block. A block is just a set of records and Arrow will mark the start and end of each block. Up to you to decide when to start and end a block, I use 100k entries per block but haven't played much with different values.

In pseudo-code it'd be something like this when reading just the user IDs:

    VectorSchemaRoot root = arrowReader.getVectorSchemaRoot();
    BigIntVector userIdVector = (BigIntVector) root.getVector("user_id"); // gets this 64-bit dimension from the schema
    // ... more vectors defined, one for each dimension
    List<ArrowBlock> blocks = arrowReader.getRecordBlocks();
    for (ArrowBlock block : blocks) { // go over all blocks
        arrowReader.loadRecordBatch(block); // *actually* reads the block
        for (int i = 0; i < block.getRowCount(); i++) {
            long userId = userIdVector.get(i);  // offset is within the current block
            processUserId(userId);
        }
    }

The code in this example will only go over the user IDs, and will read them very quickly. So yes, you have to implement any sort of querying capabilities yourself. In my case it was simple set of queries like "get distribution of dimension X" where X can be a parameter, or "filter records where X < minX || X > maxX", also with parameters, etc. Just a handful in all.

For a limited set of queries and not something like full SQL, this was perfect. I found this article very useful to get started: https://github.com/animeshtrivedi/blog/blob/master/post/2017...