Hacker News new | ask | show | jobs
by realreality 2771 days ago
- transaction size and duration limitations. I can almost understand the limitation on large write transactions, but the same size limitation applies to read transactions. If you’re doing a large range read, you may not know whether your range will reach the 10MB limit, and thus raise an exception.

- the storage backends seem less impressive than the marketing leads you to believe. The default memory backend is obviously too limited to use in production, and the “ssd” backend turns out just to be built on top of the Btree code from SQLite. Besides that, the documentation warns against using the ssd backend on macOS. Isn’t that a bit strange, considering who owns foundationdb??

- while testing, I found that it was impossible to shrink a cluster. If you add a second storage node just to test that the distributed stuff works correctly, you can’t reduce it back to a single node without destroying the entire database and starting over. If it’s possible to run everything on one node, it should be possible to shrink a cluster back to a single node.

- the storage backends have a crazy amount of write amplification (something like 3x, according to the docs). The foundationdb folks should focus on improving the underlying storage, for instance by building on lmdb or RocksDB or something. For my toy app, I abstracted my data access to use either lmdb (for local testing) or foundationdb (for production), but I ultimately ended up just using lmdb because I didn’t want to deal with fdb’s limitations and operational unknowns.

- another weird fdb limitation: the best single threaded latency you’ll get is supposedly around 1ms for small reads. The docs suggest you can achieve much better performance by scaling the cluster and number of clients. That may be true, but some applications may want high single-threaded performance. (Something like lmdb can achieve tens of thousands of reads per second)

3 comments

On shrinking a cluster: you'll want to use the fdbcli to "exclude" nodes. Should be pretty straight forward (search the docs for the word "exclude").

On write amplification: a factor of 3x is not actually that unusual. The default RocksDB size amplification is 2x, and I've seen performant LSM trees with about 3x write amplification.

On the single threaded bottleneck: this is an inherent issue you have when you put your database over a network connection. LMDB can do 10k/100k+ reads/sec on a single thread since it's just doing syscalls. As soon as you start to need to distribute your database across more than 1 machine you start to need to parallelize you work for high throughput.

Scylladb/redis can also do a lot of calls with single thread/core.
FoundationDB single-core performance is fine. From my testing on the memory engine (and the docs), you can expect 70k+ reads/second/core for small keys and values. But crucially this means you must have concurrency to drive throughput.

No database can magically make your serial access pattern faster. Amdahl's law and all that.

FoundationDB's latency for your specific workload is up to how good you are at designing your algorithm for concurrency. If you do every step serially, you'll be spending most of your time waiting for the network.

Regarding your first comment, the reason I’m listed as a contributor for this release is I made a change to the documentation about large range reads. Basically, value sizes are not included in the 10mb limit for reads.
Ah yes, I just noticed that in the docs. That’s a good thing to note, though you could still run into the problem with very large ranges (maybe reading 1 million keys is a rare use case?).
Reading a million individual keys would be quite rare I would guess, but that isn’t really the issue for a large range. The keys at the start and the end of the range are what’s counted in that case. So if you read the range A-Z, the size is only those two keys A and Z, not the size of keys in between.

More relevant for the current storage engines (although changing in a future storage engine from digging through the code and the abstract for an upcoming talk) is the five second transaction duration limit. That’s just because the multi-version data structure only includes the last 5s of versions.

Oh, that’s even less-expected behavior! In that case, one would never run into the size limitation for range reads. I think the docs should clarify that only the first and last keys count toward the transaction size.

Yes, the 5 second limit could be a problem.

There are two reasons for this I can see:

1) All mutations and key ranges are stored locally and submitted when the transaction commits. Lots of data to transfer.

2) the optimistic conflict checker can process one range (even if it is a lot of underlying keys) a lot easier than each individual key in that range

I havent tested that it does work as described but the docs say that the cluster stops working entirely unless the cluster size corresponds to the configured replication. Seen here: https://apple.github.io/foundationdb/configuration.html#conf...

Look under "double" mode or "triple" mode. Is this why it maybe didn't work for you?

I’m not sure if you can switch from single to double, and then back to single. I don’t remember if this configuration was available when I was testing fdb, or if you just increased the number of processes in order to scale the cluster.

I do remember getting into a state where the status said it was migrating data, but there was no available node to migrate it to (because I wanted to shrink the cluster). Effectively, the cluster was deadlocked.

Supposedly it works, as described a section below: https://apple.github.io/foundationdb/configuration.html#chan...

But yeah, the entire point of my comments in this thread is that the database should be telling me exactly what my options are at any situation if there are any. Judging by your comments it seems that you have also encountered a silent "deadlock" and the database gave no indication of what the hell was going on. That's the key here: The database silently stopped working, right? For something as critical as a database with possibly very important data, this just isn't acceptable to me. I want to be told as if I'm a complete noobie user what I have to do and why and what is going on with my data. The database is not a place where I feel the need to put on my smartypants hat, it's where I want to be taken care of completely.