Hacker News new | ask | show | jobs
by opticalfiber 3781 days ago
I'm not really sure what the point of this article is. If you're clever, you can represent any data as a bit array. And once you're there, counting bits or XORing them together is easy. But is that system easy to understand? Is the code easy to read for the rest of the developers who work on the project? Or for future hires? What happens if the data set becomes too large to fit in memory on a single machine? Storage is cheap and getting cheaper every day. The whole data set described in this problem (done the "inefficient" way) fits on a flash drive at today's capacities.

There are always tradeoffs when it comes to architecture design. Speed and storage space are only two of the many factors that require consideration.

5 comments

I don't understand this argument. Is it really likely that a spaghetti nest of Python, json, Hadoop, and Elasticsearch is easier to understand and to maintain than maybe 100-200 lines of self-contained Go code with minimal (no?) outside dependencies?

Or that new hires are going to think bit counting and XOR are ninja CS voodoo, while everyone already knows how to Elasticsearch?

How about cost? What's the monthly all-in cost for each approach?

This is a nice satire. It points out why modern software can be faddy, inefficient, fragile, hard to maintain, and expensive to run, with no obvious benefits (including "easy to understand", which seems to be code for "I already know how to use all those dependencies and APIs, so let's assume new hires will too.")

Not at all. I can appreciate an elegant, problem-specific solution as much as anyone else. The point is that there are always tradeoffs.

What's the query interface like for the bit array? Does it even have one? Seems like you'd have to sit down and add more code whenever you wanted to know something new about the data. Would you write your own query layer? Would you eventually need a dedicated team to maintain it? You may say that is an extreme conclusion, but I've seen this very story play out multiple times in large dev organizations.

As amorphic pointed out, indexing the data with ES makes it easy to access even for non-technical users. Each layer of abstraction comes with its own costs – of course – but also its own benefits. Tradeoffs.

> What's the query interface like for the bit array?

I don't see any request for a query interface in the list of requirements.

The point of the packed-bit array is that you don't need any kind of query interface other than a few accessor macros. It's a memory-mapped array of 64-bit integers; you just index into the array and use a macro to mask out the bits you're interested in.

The "old-timer" solution - which actually solves the stated problem instead of a hypothetical future problem with lots of extra requirements - can be written in a few pages of C. It shouldn't take more than an hour, perhaps two (plus some time for testing).

If and only if the requirements change0 to something significantly more complex would something like a query interface make sense (YAGNI). The task as stated just isn't very hard.

> But is that system easy to understand?

Yes, if you documented it properly.

> Is the code easy

I haven't used go, but it's easy in C. Setting up macros for all the bit manipulation is a very common technique. Usage would be trivial - just call a couple accessor macros. It's certainly easier than walking the parse tree of a JSON record. Using a simple bitmap would also skip the initial parsing step.

It's easier to use the bitmap. The "example usage" section below is very simple. Setting up ElasticSearch or fiddling with a JSON parser is more work (and a lot harder on CPU/RAM).

    /* the "2nd array" */
    uint64 *port_states;
    #define IP_PORT_STATES(ipaddr) (port_states[ip])

    #define PORT_22_INDEX 0
    #define PORT_80_INDEX 1
    // ...etc...

    /* some of the accessor macros */
    #define PORT_STATE_MASK (0x00000003)
    #define PORT_STATE(ipaddr, port_index) \
        (PORT_STATE_MASK & (IP_PORT_STATES(ipaddr) >> (3 * (port_index)))

    #define PORT_CLOSED 0
    #define PORT_OPEN   1
    // ...etc...

    /* example usage  */
    uint32 open_count = 0;
    uint32 ip = 1;
    do {
        if (PORT_STATE(ip, PORT_22_INDEX) == PORT_OPEN) {
            open_count++;
        }
    } while (ip < 0xffffffff)
> future hires

If they can't handle calling a couple macros over a uint64 array, I wouldn't recommend hiring them.

> too large to fit in memory

It's not necessarily in memory - the "old-timer" is using a memory mapped file.

edit: bugfix

This is 206, why are you using #defines instead of "const int"s.
Because the last time I wrote a bitmap like that was about a decade ago using a proprietary compiler for a Z80 clone that implemented a "C89-like"[1] language? Call it a bad habit from too many years working with embedded micros.

If I was using a modern(-ish) C, an enum might be more appropriate.

[1] When you only have 16k of RAM and a 256 byte stack, having "int x;" default to "static" storage instead of the stack is a feature.

Assuming C code (not C++), then const int is not the same as a #define. An enum is only good if the values fit within the range of a native int.
static const's may not be optimized to literals. constexpr is not fully supported.
"if the data set becomes too large to fit in memory on a single machine?"

You just use mmap. If it become bigger than the disk capacity of single machine, you can still mount a distributed filesystem and use mmap.

Slightly off topic: http://yourdatafitsinram.com/ and more seriously https://www.youtube.com/watch?v=SiSY1b0am5w which is a really good talk about the challenges of sysadmining for scientists. What do you do when 2TB of RAM isn't enough?
You add an index and smaller-block loader. The size and structure of a "smaller-block" depends on the project and how you are using it.

In more extreme situations, you to learn all about materialized views, cache invalidation, etc.

+1

Ill add that having your data in Elastic means that a (probably non-technical) user still can use it as part of their analysis without having to get dev involved.

Yes! ES is great for a lot of reasons, not the least of which is the one you point out.
You don't need to be clever to represent data as bit arrays. All you need is to have some CS basics and not be completely brain-dead.