Hacker News new | ask | show | jobs
by kristoff_it 1261 days ago
One change to make this library more idiomatic would be to make HyperLogLog avoid forcing heap allocation on the user.

So from this (simplified code):

    pub fn HyperLogLog(comptime p: u8) type {
        return struct {
            dense: []u6,

            const Self = @This();
            pub fn init(allocator: Allocator) !Self {
                return Self{
                    .dense = try allocator.alloc(u6, 0),
                };
            }
        }
    }
To this:

    pub fn HyperLogLog(comptime p: u8) type {
        return struct {
            dense: [1<<p]u6;

            const Self = @This();
            pub fn init() Self {
                var s = Self{};
                for (s.dense) |*x| x.* = 0;
                return s;
            }
        }
    }
The advantage to this API is that the user can choose where to place the HLL (stack, heap, global memory). Also note how the new init can't fail anymore.

That said, in the code that I've conveniently omitted there's one big complication: when the HLL has only few items in it, it doesn't use the dense representation and instead uses a `std.AutoHashMap` to keep counts.

Unfortunately `std.AutoHashMap` is not a type that can be reasonably disentangled from allocation.

I was about to drop my idea of commenting when I realized that a very neat solution would be to also provide the user with the choice (and admittedly extra responsibility) of when to switch from a hash map to a proper HLL.

    pub fn init(start: std.AutoHashMap) Self {
        var s = Self{};
        for (s.dense) |*s| s.* = 0;
        self.addFromHashMap(start);
        return s;
    }
There you go, allocation-free HyperLogLog :^)

I've also come up with my own funky interface when I implemented Cuckoo Filters in Zig, for those who are interested:

https://github.com/kristoff-it/zig-cuckoofilter

3 comments

Or alternatively one could make good use of the existing buffer and use that to encode the sparse representation, like Antirez did in Redis:

https://github.com/redis/redis/blob/unstable/src/hyperloglog...

Why is Zigs syntax so … foreign to me? What’s the hype behind it?
You're looking at a piece of code that's doing a type of metaprogramming mostly unique to Zig. Additionally I'm using a for loop capture syntax that is not common in other languages (closest thing I can think of are Ruby blocks).

You're just looking at a piece of code that is understandable for a newcomer to not find familiar.

Take a look at Rust and you'll see where a lot of it is coming from.
I used to hate Rusts syntax, but a YT channel in the name of ”No boilerplate” made me see the beauty in it.
Actually i found zig easier to read than rust… something that i found more appealing!
This is awesome... question though: ```pub fn HyperLogLog(comptime p: u8) type { return struct { dense: [1<<p]u6;

            const Self = @This();
            pub fn init() Self {
                var s = Self{};
                for (s.dense) |*x| x.* = 0;
                return s;
            }
        }
    }```
doesn't this allocate 1<<p upfront though. If yes then the HLL of size 16384 bytes upfront which kinda beats the purpose of having a sparse representation no?
> doesn't this allocate 1<<p upfront though

Yes, it does. The idea (see the last code snipped in that post), is that the user delays the creation of the HLL until they are ready to switch to a dense representation. Before then, they just use a std.AutoHashMap directly.

Or, alternatively, the HLL could use the same buffer for both dense and sparse representation (see the Redis code).