Hacker News new | ask | show | jobs
by ignoramous 1596 days ago
Mostly technical questions from me:

1. Re: Presto/Clickhouse: Have you looked at dsq, pola.rs, DuckDB, Apache DataFusion, Clickhouse Local? If so, what's your opinion on where the (data science) ecosystem is moving towards (for example, ibis-project.org taking over Presto/Trino).

2. Re: Bitmaps: What's the most compact way you know to store a bitmap-index in printable ASCII (b64 etc)? Puny code esque state machines are elegant (used in DNS), but is there anything else that's better?

3. Re: Unit economics/Opex: Are you a believer in the cost effectiveness (both eng and monetary) of the overall Serverless storage and compute movement?

4. Re: Russia: From the looks of it Russia (and Eastern Europe, in general) has fantastic yet untapped pool of talent, but then, how do you compete for talent with Yandex, Klarna, Spotify, UiPath, and others?

5. Re: py/go: Choose one? ;)

Thanks.

1 comments

1) DuckDB - like the performance and that it's tightly integrated into python. Clickhouse local - i saw the announcement when it came out, but i just don't see the usecase for it in analytics. DataFusion is new for me. I'd say that the ecosystem is moving towards snowflake/bigquery/redshift/...

2) Why would you want to do that? You'd use bitmap index because it's quite compact and you can process the data at the speed of memory bandwidth, using ascii defeats that, no?

3) Not really. I just can't imagine costs of running any large website or app with serverless.

4) I think it's the same in the US - everyone competes for the same talent with FAANG (MAANG?). The salary gap between big companies and startups is even lower in Russia. Also, we humans want to do meaningful things, some of us struggle to find meaning in being another bigco employee.

5) Py

Thanks.

> Why would you want to do that?

We store user preferences (200+ yes/no knobs) in a bitmap (well, a bitmap-index like the one in Hash Mapped-Array Tries). We want to capture those prefs in a single sub-domain (limited to 63 lower-case alphanumeric chars) or a URL (limited to 200 mixed-case alphanumerics). Today, we simply convert the bitmap into url-b64 (or, b32 to store it in the subdomain), but we will soon run out the 63-char limit if we introduce more knobs.

A demonstration of it is here, in case the above didn't explain it well: https://rethinkdns.com/configure (choose blocklists, and see the selection generate a path appended to the base-url shown in the search-bar).

Oh. No, i don't think it's possible - i'd suggest to just use multiple subdomains.

Technically, you can squeeze out some bits: there are 36-37 possible characters of which you are using only 32, so with arithmetic coding you would be looking at about 1 extra bit for 5 characters, but it's a nightmare to code. And after those extra bits run out, you will get the same problem anyway.

Does another party need to decode the url? What about using a dictionary for the top 10k seen starting combinations and then encode the rest?

What about run length encoding? 1-9 for positive sequences. a-i for negative sequences (max means pattern continues) and the rest for frequent patterns like alternating sequences, etc

9967b would be 24 yes, 1 no, 7 yes, 3 nos, 1 yes etc

Thanks. RLE is a hit-and-a-miss (101010101010 etc); top-k is heuristic-based. Those are viable solutions, nevertheless (esp, top-k). Puny Code, which DNS uses, I thought was pretty neat for fitting in a state-machine in printable characters. Takes a bunch of CPU resources to restore it, though.

I was wondering if there are other techniques that I may not be aware of.