Hacker News new | ask | show | jobs
by judofyr 1985 days ago
Interesting choice of technology, but you didn't completely convince me to why this is better than just using SQLite or PostgreSQL with a lagging replica. (You could probably start with either one and easily migrate to the other one if needed.)

In particular you've designed a very complicated system: Operationally you need an etcd cluster and a tailetc cluster. Code-wise you now have to maintain your own transaction-aware caching layer on top of etcd (https://github.com/tailscale/tailetc/blob/main/tailetc.go). That's quite a brave task considering how many databases fail at Jepsen. Have you tried running Jepsen tests on tailetc yourself? You also mentioned a secondary index system which I assume is built on top of tailetc again? How does that interact with tailetc?

Considering that high-availability was not a requirement and that the main problem with the previous solution was performance ("writes went from nearly a second (sometimes worse!) to milliseconds") it looks like a simple server with SQLite + some indexes could have gotten you quite far.

We don't really get the full overview from a short blog post like this though so maybe it turns out to be a great solution for you. The code quality itself looks great and it seems that you have thought about all of the hard problems.

4 comments

> and a tailetc cluster

What do you mean by this part? tailetc is a library used by the client of etcd.

Running an etcd cluster is much easier than running an HA PostgreSQL or MySQL config. (I previously made LiveJournal and ran its massively sharded HA MySQL setup)

Neat. This is very similar to [0], which is _not_ a cache but rather a complete mirror of an Etcd keyspace. It does Key/Value decoding up front, into a user-defined & validated runtime type, and promises to never mutate an existing instance (instead decoding into a new instance upon revision change).

The typical workflow is do do all of your "reads" out of the keyspace, attempt to apply Etcd transactions, and (if needed) block until your keyspace has caught up such that you read your write -- or someone else's conflicting write.

[0] https://pkg.go.dev/go.gazette.dev/core/keyspace

Drat! I went looking for people doing something similar when I sat down to design our client, but did not find your package. That's a real pity, I would love to have collaborated on this.

I guess Go package discovery remains an unsolved problem.

> I guess Go package discovery remains an unsolved problem.

Or did you just not really search, like most of us excited to DIY? :-D

Godoc is pretty good, the package shows up for the searches I'd probably do in a similar situation.

https://godoc.org/?q=etcd

https://godoc.org/?q=etcd+watch

Funny, the package in question exists because _I_ thought I could do better and wanted to DIY.
You must not have seen https://godoc.org/?q=tailetc
Whoa, we hadn't seen that! At first glance it indeed appears to be perhaps exactly identical to what we did.
Slightly different trade-offs. This package is emphatically just "for" Etcd, choosing to directly expose MVCC types & concepts from the client.

It also doesn't wrap transactions -- you use the etcd client directly for that.

The Nagel delay it implements helps quite a bit with scaling, though, while keeping the benefits of a tightly packed sorted keyspace. And you can directly access / walk decoded state without copies.

I wish pkg.dev had a signin and option to star/watch a package. I do this with GitHub repos I should revisit. Would have been handy for pkg.dev :) yes, I know - nobody wants yet another login
> What do you mean by this part? tailetc is a library used by the client of etcd.

Oh. Since they have a full cache of the database I thought it was intended to be used as a separate set of servers layered in front of etcd to lessen the read load. But you're actually using it directly? Interesting. What's the impact on memory usage and scalability? Are you not worried that this will not scale over time since all clients need to have all the data?

Well, we have exactly 1 client (our 1 control server process).

So architecturally it's:

3 or 5 etcd (forget what we last deployed) <--> 1 control process <--> every Tailscale client in the world

The "Future" section is about bumping "1 control process" to "N control processes" where N will be like 2 or max 5 perhaps.

The memory overhead isn't bad, as the "database" isn't big. Modern computers have tons of RAM.

You're able to serve all your clients from a single control process? And this would probably work for quite a while? Then I struggle to see why you couldn't just use SQLite. On startup read the full database into memory. Serve reads straight from memory. Writes go to SQLite first and if it succeeds then you update the data in memory. What am I missing here?
We could use SQLite. (I love SQLite and have written about it before!) The goal is N control processes not for scale, but for more flexibility with deployment, canarying, etc.
That makes sense. Thanks for answering all of my critical questions. Looks like a very nice piece of technology you’re building!
I'm curious what drove the decision to move to an external store (and multinode HA config at that) now compared to using a local Go KV store like Badger or Pebble?

Given that the goals seem to be improving performance over serializing a set of maps to disk as JSON on every change and keeping complexity down for fast and simple testing, a KV library would seem to accomplish both with less effort, without introducing dependence on an external service, and would enable the DB to grow out of memory if needed. Do you envision going to 2+ control processes that soon?

Any consideration given to running the KV store inside the control processes themselves (either by embedding something like an etcd or by integrating a raft library and a KV store to reinvent that wheel) since you are replicating the entire DB into the client anyway?

Meanwhile I'm working with application-sharded PG clusters with in-client caches with coherence maintained through Redis pubsub, so who am I to question the complexity of this setup haha.

Yes, we're going to be moving to 2+ control servers for HA + blue/green reasons pretty soon here.
> Running an etcd cluster is much easier than running an HA PostgreSQL or MySQL config.

What if you used one of the managed RDBMS services offered by the big cloud providers? BTW, if you don't mind sharing, where are you hosting the control plane?

> What if you used one of the managed RDBMS services offered by the big cloud providers?

We could (and likely would, despite the costs) but that doesn't address our testing requirements.

The control plane is on AWS.

We use 4 or 5 different cloud providers (Tailscale makes that much easier) but the most important bit is on AWS.

Why is testing Postgres/MySQL difficult? You can easily run a server locally (or on CI) and create new databases for test runs, etc.
It's not difficult. We've done it before and have code for it. See the article.
> ...but the most important bit is on AWS.

Curious: Was running DynamoDB with DAX (DynamoDB Accelerator) in front ever in contention? If not, is it due to vendor lock-in (for example, not being able to migrate out) or because tailscale doesn't feel the need to use managed offerings especially for core infrastructure?

> > Running an etcd cluster is much easier than running an HA PostgreSQL or MySQL config.

> What if you used one of the managed RDBMS services offered by the big cloud providers?

Yeah, AWS RDS "multi-AZ" does a good job of taking care of HA for you. (Google Cloud SQL's HA setup is extremely similar.) But you still get 1-2 minutes of full unavailability when hardware fails.

I haven't operated etcd in production myself, but I assume it does better because it's designed specifically for HA. You can't even run less than three nodes. (The etcd docs talk about election timeouts on the order of 1s, which is encouraging.)

For many use cases, 1-2 minutes of downtime is tolerable. But I can imagine situations where availability is paramount and you're willing to give up scale/performance/features to gain another 9.

This is about spot on. I do get the part about testability, but with a simple Key/Value use case like this, BoltDB or Pebble might have fit extremely well into the Native Golang paradigm as a backing store for the in-memory maps while not needing nearly as much custom code.

Plus maybe replacing the sync.Mutex with RWMutexes for optimum read performance in a seldom-write use case.

On the other hand again, I feel a bit weird criticizing Brad Fitzpatrick ;-) — so there might be other things at play I don‘t have a clue about...

if you want a distributed key/value data store, you want to use what's already out there and vetted. It use to be zookeeper, but etcd is much simpler and that's what Kubernetes uses and it has been a big success and proved itself out there in the field. Definitely easier than a full SQL database which is overkill and much harder to replicate especially if you want to have a cluster of >= 3. Again, key is "distributed" and that immediately rules out sqlite.
It's overkill until it's not. We chose etcd initially but after a while we started wanted to ask questions about our data that weren't necessarily organised in the same way as the key/value hierarchy. That just moved all the processing to the client, and now I just wish we used a SQL database from the beginning.
Yeah, but for their use case it's just KV and also ability to link directly in go
I was initially baffled by the choice of technology too. Part of it is that etcd is apparently much faster at handling writes, and offers more flexibility with regards to consistency, than I remember. Part of it might be that I don't understand the durability guarantees they're after, the gotchas they can avoid (e.g. transactions), or their overall architecture.