Hacker News new | ask | show | jobs
by hyc_symas 1085 days ago
This is a pretty old argument and IMO it's far out of date/obsolete.

Taking full control of your I/O and buffer management is great if (a) your developers are all smart and experienced enough to be kernel programmers and (b) your DBMS is the only process running on a machine. In practice, (a) is never true, and (b) is no longer true because everyone is running apps inside containers inside shared VMs. In the modern application/server environment, no user level process has accurate information about the total state of the machine, only the kernel (or hypervisor) does and it's an exercise in futility to try to manage paging etc at the user level.

As Dr. Michael Stonebraker put it: The Traditional RDBMS Wisdom is (Almost Certainly) All Wrong. https://slideshot.epfl.ch/play/suri_stonebraker (See the slide at 21:25 into the video). Modern DBMSs spend 96% of their time managing buffers and locks, and only 4% doing actual useful work for the caller.

Granted, even using mmap you still need to know wtf you're doing. MongoDB's original mmap backing store was a poster child for Doing It Wrong, getting all of the reliability problems and none of the performance benefits. LMDB is an example of doing it right: perfect crash-proof reliability, and perfect linear read scalability across arbitrarily many CPUs with zero-copy reads and no wasted effort, and a hot code path that fits into a CPU's 32KB L1 instruction cache.

7 comments

Out of curiosity, how many databases have you written?

This is co-authored by Pavlo, Viktor Leiss, with feedback from Neumann. I'm sorry, but if someone on the internet claims to know better than those 3, you're going to need some monumental evidence of your credibility.

Additionally, what you link here:

  > ... (See the slide at 21:25 into the video). Modern DBMSs spend 96% of their time managing buffers and locks, and only 4% doing actual useful work for the caller.
Is discussing "Main Memory" databases. These databases do no I/O outside of potential initial reads, because all of the data fits in-memory!

These databases represent a small portion of contemporary DBMS usage when compared to traditional RDBMS.

All you have to do is look at the bandwidth and reads/sec from the paper when using O_DIRECT "pread()"s versus mmap'ed IO.

This is a classic appeal to authority. Let's play the argument, not the man.

(My understanding is that the GP wrote LMDB, works on openLDAP, and was a maintainer for BerkelyDB for a number of years. But even if he'd only written 'hello, world!' I'm much more interested in the specific arguments).

Correct, and thank you. I wrote LMDB, wrote a lot of OpenLDAP, and worked on BerkeleyDB for many years. And actually Andy Pavlo invited me to CMU to give a lecture on LMDB a few years back. https://www.youtube.com/watch?v=tEa5sAh-kVk

Andy and I have had this debate going for a long time already.

Well, I eat my shorts.

Isn't LMBD closer to an embedded key-value store than an RDBMS, though? Also there's a section in the paper that mentions it's single-writer.

Yes, LMDB is an embedded key/value store but it can be used as the backing store of any other DB model you care for. E.g. as a backend to MySQL, or SQLite, or OpenLDAP, or whatever.
I think the real argument is more nuanced. Where you see mmap() fail badly on Linux, even for read-only workloads, is under a few specific conditions: very large storage volumes, highly concurrent access, non-trivial access patterns (e.g. high-dimensionality access methods). Most people do not operate data models under these conditions, but if you do then you can achieve large integer factor gains in throughput by not using mmap().

Interestingly, most of the reason for these problems has to do with theoretical limitations of cache replacement algorithms as drivers of I/O scheduling. There are alternative approaches to scheduling I/O that work much better in these cases but mmap() can’t express them, so in those cases bypassing mmap() offers large gains.

GP wrote a key-value store called LMDB that is constrained to a single writer, and often used for small databases that fit entirely in memory but need to persist to disk. There's a whole different world for more scalable databases.
"fit entirely in memory" is not a requirement. LMDB is not a main-memory database, it is an on-disk database that uses memory mapping.
Can you explain "high-dimensionality access methods" to me? (Or if it's too big for an HN comment, maybe recommend a paper).
This guy talks a lot of crap. See his website for examples, and don't waste your time with him

<<<There is one significant drawback that should not be understated. Algorithm design using topology manipulation can be enormously challenging to reason about. You are often taking a conceptually simple algorithm, like a nested loop or hash join, and replacing it with a much more efficient algorithm involving the non-trivial manipulation of complex high-dimensionality constraint spaces that effect the same result. Routinely reasoning about complex object relationships in greater than three dimensions, and constructing correct parallel algorithms that exploit them, becomes easier but never easy.>>>

http://www.jandrewrogers.com/2015/10/08/spacecurve/

I'd imagine same kind of worst case access would also be a problem doing IO the "classical" way
The argument is that:

- Queries can trigger blocking page faults when accessing (transparently) evicted pages, causing unexpected I/O stalls

- mmap() complicates transactionality and error-handling

- Page table contention, single-threaded page eviction, and TLB shootdowns become bottlenecks

1 - for reading any uncached data, the I/O stalls are unavoidable. Whatever client requested that data is going to have to wait regardless.

2 - complexity? this is simply false. LMDB's ACID txns using MVCC are much simpler than any "traditional" approach.

3 - contention is a red herring since this approach is already single-writer, as is common for most embedded k/v stores these days. You lose more perf by trying to make the write path multi-threaded, in lock contention and cache thrashing.

> for reading any uncached data, the I/O stalls are unavoidable.

Excuse me for a silly question, but whilst an I/O stall may be unavoidable, wouldn't a thread stall be avoidable if you're not using mmap?

Assuming that you're not swapping, you'll generally know if you've loaded something into memory or not, whilst mmap doesn't help you know if the relevant page is cached. If the data isn't in memory, you can send the I/O request to a thread to retrieve it, and the initiating thread can then move onto the next connection. I suspect this isn't doable under mmap based access?

It's kind of disingenuous to talk about how great your concurrency system is when you only allow a single writer. RCU (which I imagine your system is isomorphic to) is pretty simple compared to what many DB engines use to do ACID transactions that involve both reads and writes.
You don't need more than single-writer concurrency if your write txns are fast enough.

Our experience with OpenLDAP was that multi-writer concurrency cost too much overhead. Even though you may be writing primary records to independent regions of the DB, if you're indexing any of that data (which all real DBs do, for query perf) you wind up getting a lot of contention in the indices. That leads to row locking conflicts, txn rollbacks, and retries. With a single writer txn model, you never get conflicts, never need rollbacks.

take a look at http://nms.csail.mit.edu/~stavros/pubs/OLTP_sigmod08.pdf - the overhead of coordinating multiple writers often makes multi-writer databases slower than single-writer databases. remember, everything has to be serialized when it goes to the write ahead log, so as long as you can do the database updates as fast as you can write to the log then concurrent writers are of no benefit.
Yeah for workloads with any long running write transactions a single writer design is a pretty big limitation. Say some long running data load (or a big bulk deletion) running along with some faster high throughput key value writes - the big data load would block all the faster key-value writes when it runs.

No "mainstream" database I'm aware of has a global single writer design.

"Taking full control of your I/O and buffer management is great if (a) your developers are all smart and experienced enough to be kernel programmers" is already an appeal to authority in itself.

We shouldn't apply a higher bar to the counterargument than we applied to the argument in the first place.

Out of curiosity, do you have anything actually useful to add or are just throwing appeals to authority because you don't ?
Even thought the data resides mostly in-memory they still have to write transactions to disk to preserve them, don't they?
> your DBMS is the only process running on a machine. In practice, (a) is never true, and (b) is no longer true because everyone is running apps inside containers inside shared VMs.

There's nothing special about kernel programmers. In fact, if I had to compare, I'd go with storage people being the more experienced / knowledgeable ones. They have a highly competitive environment, which requires a lot more understanding and inventiveness to succeed, whereas kernel programmers proper don't compete -- Linux won many years ago. Kernel programmers who deal with stuff like drivers or various "extensions" are, largely, in the same group as storage (often time literally the same people).

As for "single process" argument... well, if you run a database inside an OS, then, obviously, that will never happen as OS has its own processes to run. But, if you ignore that -- no DBA worth their salt would put database in the environment where it has to share resources with applications. People who do that are, probably, Web developers who don't have high expectations from their database anyways and would have no idea how to configure / tune it for high performance, so, it doesn't matter how they run it, they aren't the target audience -- they are light years behind on what's possible to achieve with their resources.

This has nothing to do with mmap though. mmap shouldn't be used for storage applications for other reasons. mmap doesn't allow their users to precisely control the persistence aspect... which is kind of the central point of databases. So, it's a mostly worthless tool in that context. Maybe fine for some throw-away work, but definitely not for storing users' data or database's own data.

> There's nothing special about kernel programmers.

Yes, that was a shorthand generalization for "people who've studied computer architecture" - which most application developers never have.

> no DBA worth their salt would put database in the environment where it has to share resources with applications.

Most applications today are running on smartphones/mobile devices. That means they're running with local embedded databases - it's all about "edge computing". There's far more DBs in use in the world than there are DBAs managing them.

> mmap shouldn't be used for storage applications for other reasons. mmap doesn't allow their users to precisely control the persistence aspect... which is kind of the central point of databases. So, it's a mostly worthless tool in that context. Maybe fine for some throw-away work, but definitely not for storing users' data or database's own data.

Well, you're half right. That's why by default LMDB uses a read-only mmap and uses regular (p)write syscalls for writes. But the central point of databases is to be able to persist data such that it can be retrieved again in the future, efficiently. And that's where the read characteristics of using mmap are superior.

> "people who've studied computer architecture" - which most application developers never have

If you are developing an DBMS and haven't studied computer architecture, the best idea is probably to ask more experienced people to help out with your ideas.

From my limited knowledge, I don't think the article is old enough to be obsolete, just that there's a lot more to it.

Not to be gatekeeping or anything, but it is a pretty well studied field with lots of very knowledgeable people around, who are probably more than keep to help. There aren't too many qualified jobs around and you probably have a budget if you are developing a database commercially.

mmap doesn't allow their users to precisely control the persistence aspect

It's been a while since I've dealt with mmap(), but isn't this what msync() does? You can synchronously or asynchronously force dirty pages to be flushed to disk without waiting until munmap().

msync lets you force a flush so you can control the latest possible moment for a writeout. But the OS can flush before that, and you have no way to detect or control that. So you can only control the late side of the timing, not the early side. And in databases, you usually need writes to be persisted in a specific order; early writes are just as harmful as late writes.
I'd even take a memory ordering guarantee, something like, within each page, data is read out sequentially as atomic aligned 64-bit reads with acquire ordering. (Though this probably is what you get on AMD64.) As-is, there's not even a guarantee against an atomic aligned write being torn when written out.
That is absolutely not what you actually get from the hardware.

For fun, there is no guarantee in terms of writing a page in what order it is written. SQLite documents that they assume (but cannot verify) that _sector_ writes are linear, but not atomic. https://www.sqlite.org/atomiccommit.html

> If a power failure occurs in the middle of a sector write it might be that part of the sector was modified and another part was left unchanged. The key assumption by SQLite is that if any part of the sector gets changed, then either the first or the last bytes will be changed. So the hardware will never start writing a sector in the middle and work towards the ends. We do not know if this assumption is always true but it seems reasonable.

You are talking several levels higher than that, at the page level (composed of multiple sectors).

Assume that they reside in _different_ physical locations, and are written at different times. That's fun.

> Most applications today are running on smartphones/mobile devices.

That's patently false. There are about 8 bn. people. Even if everyone has a smartphone or two, it's nothing compared to the total of all devices that can be called "computer". I think that "smart TV" alone will beat the number of smartphones. But even that is a drop in a bucket when it comes to the total of running programs on Earth / its orbit.

But, that's beside the point. Smartphones aren't designed to run database servers. Even if they indeed were the majority, they'd still be irrelevant for this conversation because they are a wrong platform for deploying databases. In other words, it doesn't matter how people deploy databases to smartphones -- they have no hopes of achieving good performance, and whether they use mmap or not is of no consequences -- they've lost the race before they even qualified for it.

> LMDB

Are we talking about this? https://en.wikipedia.org/wiki/Lightning_Memory-Mapped_Databa... If so, this is irrelevant for databases in general.

> LMDB databases may have only one writer at a time

(Taken from the page above) -- this isn't a serious contender for database server space. It's a toy database. You shouldn't give general advice based on whatever this system does or doesn't.

>irrelevant for databases in general

It's one of the databases compared in the paper

OP is one of the authors of LMDB

Smart TVs are also all running SQLite
Can you comment on what the paper gets wrong? It says that scalability with mmap is poor due to page table contention and others. How does LMDB manage to scale well with mmap? Is page table contention just not an issue in practice?
Maybe someone should pull LMDB's mmap/paging system into a usable library. I'd love to use the k/v store part of course, but I keep hitting the default key size limitation and would prefer not to link statically.
It wouldn't be much use without the B+tree as well; it's the B+tree's cache friendliness that allows applications to run so efficiently without the OS knowing any specifics of the app's usage patterns.
Do you have benchmarks of lmdb when the working set is much larger than memory? I couldn't find any.

In my experience -- and in line with the article -- mmap works fine with small working sets. It seems that most benchmarks of lmdb have relatively small data sets.

> Do you have benchmarks of lmdb when the working set is much larger than memory? I couldn't find any.

Where did you look? This is a sample using DB 5x and 50x larger than RAM http://www.lmdb.tech/bench/hyperdex/

There are plenty of other larger-than-RAM benchmarks there.

Hm. That seems to be comparing against a 2013 era leveldb, which at the time also used mmap. (It's since switched the default for performance reasons)

It's also strange to me that there's no transition in performance when the data set size grows beyond cache.

> Taking full control of your I/O and buffer management is great if (a) your developers are all smart and experienced enough to be kernel programmers and (b) your DBMS is the only process running on a machine. In practice, (a) is never true, and (b) is no longer true because everyone is running apps inside containers inside shared VMs.

The article is about DBMS developers. For DBMS developers, "in practice" (a) and (b) are usually true I think.

Who is deploying databases in containers?
A disturbingly large number of deployments I’ve seen using Kubernetes or docker compose have databases deployed as such.
Given the ability to deploy pods to dedicated nodes based on label selectors, what is the actual performance impact of running a database in a container on a bare metal host with mounted volume versus running that same process with say systemd on that same node? Basically, shouldn’t the overhead of running a container be minimal?
The problem is kubelet likes to spike in memory / CPU / network usage. It's not a well-behaved program to put alongside a database. It's not written with an eye for resource utilization.

Also, it brings nothing of value to the table, but requires a lot of dance around it to keep it going. I.e. if you are a decent DBA, you don't have a problem setting up a node to run your database of choice, you would be probably opposed to using pre-packaged Docker images anyways.

Also, Kubernetes sucks at managing storage... basically, it doesn't offer anything that'd be useful to a DBA. Things that might be useful come as CSI... and, obviously, it's better / easier to not use a CSI, but to interface directly with the storage you want instead.

That's not to say that storage products don't offer these CSI... so, a legitimate question would be why would anyone do that? -- and the answer is -- not because it's useful, but because a lot of people think they need / want it. Instead of fighting stupidity, why not make an extra buck?

I run DB’s on K8s, not because I don’t know what I’m doing, but because most of the trade offs are worth it.

If I run a db workload in K8s, it’s a tiny fraction of the operational overhead, and not a massively noticeable performance loss.

I would absolutely love a way to deploy and manage db’s as easily as K8s with fewer of the quite significant issues that have mentioned, so if you know of something that is better behaved around singular workloads, but keeps the simple deploys, the resiliency, the ease of networking and config deployments, the ease of monitoring, etc, I am all ears.

If you think that deploying anything with Kubernetes is simple... well, I have bad news for you.

It's simple, until you hit a problem. And then it becomes a lot worse than if you had never touched it. You are now in the stage of a person who'd never made backups and never had a failure that required them to restore from backups, and you are wondering why would anyone do it. Adverse events are rare, and you may go like this for years, or, perhaps the rest of your life... unfortunately, your experience will not translate into a general advice.

But, again, you just might be in the camp where performance doesn't matter. Nor does uptime matter, nor does your data have very high value... and in that case it's OK to use tools that don't offer any of that, and save you some time. But, you cannot advise others based on that perspective. Or, at least, not w/o mentioning the downsides.

If you run in the cloud, any of the major cloud providers can take that undifferentiated heavy lifting off your hands (Amazon RDS etc.).
If you care about perf you would pin the kubelet and all other overhead workload to one core, and mask that off for your workload.
> If you care about perf you would pin the kubelet

Wrong. I wouldn't use kubelet at all. Kubernetes and good performance are not compatible. The goal of Kubernetes is to make it easier to deploy Web sites. Web is a very popular technology, so Kubernetes was adopted in many places where it's irrelevant / harmful because Web developers are plentiful and will help to power through the nonsense of this program. It's there because it makes trivial things even easier for less qualified personnel. It's not meant as a way to make things go faster, or to use less memory, or to use less persistent storage, or less network etc... it's the wheelchair of ops, not a highly-optimized professional-grade equipment.

IMO if you’re concerned about performance and yet are deploying databases this way — mmap should not even be on the radar.
How would containers even hurt performance? How does the database no longer having the ability to see other processes on the machine somehow make it slower?
There are many "holes" in these containers.

1. fsync. You cannot "divide" it between containers. Whoever does it, stalls I/O for everyone else.

2. Context switches. Unless you do a lot of configurations outside of container runtime, you cannot ensure exclusive access to the number of CPU cores you need.

3. Networking has the same problem. You would either have to dedicate a whole NIC or SRI-OV-style virtual NIC to your database server. Otherwise just the amount of chatter that goes on through the control plane of something like Kubernetes will be a noticeable disadvantage. Again, containers don't help here, they only get in the way as to get that kind of exclusive network access you need more configuration on the host, and, possible an CNI to deal with it.

4. kubelet is not optimized to get out of your way. It needs a lot of resources and may spike, hindering or outright stalling database process.

5. Kubernetes sucks at managing memory-intensive processes. It doesn't work (well or at all) with swap (which, again, cannot be properly divided between containers). It doesn't integrate well with OOM killer (it cannot replace it, so any configurations you make inside Kubernetes are kind of irrelevant, because system's OOM killer will do how it pleases, ignoring Kubernetes).

---

Bottom line... Kubernetes is lame from infrastructure perspective. It's written for Web developers. To make things appear simpler for them, while sacrificing a lot of resources and hiding a lot of actual complexity... which is impossible to hide, and which, in an even of failure will come to bite you. You don't want that kind of program near your database.

My background is more borg then k8s, but…

Alway allocate whole cores, just mask them off

Dedicate physical IO devices for sensitive workloads

You can have per cgroup swap if you want, but imo swap is not useful

I think all of this is possible in k8s

As these are obviously very real issues, and Kubernetes also isn’t going away imminently, how many of these can be fixed/improved with different design on the application front?

Would using direct-Io API’s fix most of the fsync issues? If workloads pin their stuff to specific cores can we incite some of the overhead here? (Assuming we’re only running a single dedicated workload + kubelet on the node).

> You would either have to dedicate a whole NIC or SRI-OV-style virtual NIC to your database server

Tbh I’ve no idea we could do this with commodity cloud servers, nor do I know how, but I’m terribly interested in knowing how, do you know if there’s like a “dummy’s guide to better networking”? Haha

> kubelet is not optimized to get out of your way...Kubernetes sucks at managing memory-intensive processes

Definitely agree on both these issues, I’ve blown up the kubelet by overallocating memory before, which basically borked the node until some watchdog process kicked in. Sounds like the better solution here is a kubelet rebuilt to operate more efficiently and more predictably? Is the solution a db-optimised kubelet/K8s?

This is extremely misinformed. No matter how you choose to manage workloads, ultimately you are responsible for tuning and optimization.

If you're not in control of the system, and thus kubelet, obviously your hands are tied. I'm not sure anyone is suggesting that for a serious workload.

Now to dispell your myths:

1. You can assign dedicated storage devices to your database. Outside of mount operations you're not going to see much alien fsync activity. This is paranoid.

2. You can pin kubelet CPU cores. You can ensure exclusive access to the remaining ones. There are a number of advanced techniques that are not at all necessary if you want to be a control freak, such as creating your own cgroups. This isn't "outside" of the runtime. Kubernetes is designed to conform to your managed cgroups. That's the whole point. RTFM.

3. The general theme of your complaint has nothing to do with kubernetes. There's no beating a dedicated NIC and even network fabric. Some cloud providers even allow you to multi-NIC out of the box so this is pretty solvable. Also, like, the dumbest QoS rules can drastically minimize this problem generally. Who cares.

4. Nah. RTFM. This is total FUD.

5.a. I don't understand. Are you sharing resources on the node or not? If you're not, then swap works fine. If you are, then this smells like cognitive dissonance and maybe listen to your own advice, but also swap is still very doable. It's just disk. swapon to your heart's content. But also swap is almost entirely dumb these days. Are you suggesting swapping to your primary IO device? Come on. More FUD.

5.b. OOM killer does what it wants. What's a better alternative that integrates "well" with the OOM killer? Do you even understand how resource limits work? The OOM killer is only ever a problem if you either do not configure your workload properly (true regardless of execution environment) or you run out of actual memory.

Bottom line: come down off your high horse and acknowledge that dedicated resources and kernel tuning is the secret to extreme high performance. I don't care how you're orchestrating your workloads, the best practices are essentially universal.

And to be clear, I'm not recommending using Kubernetes to run a high performance database but it's not really any worse (today) than alternatives.

> It's written for Web developers. To make things appear simpler for them, while sacrificing a lot of resources and hiding a lot of actual complexity... which is impossible to hide, and which, in an even of failure will come to bite you.

What planet are you currently on? This makes no sense. It's a set of abstractions and patterns, the intent isn't to hide the complexity but to make it manageable at scale. I'd argue it succeeds at that.

Seriously, what is the alternative runtime you'd prefer here? systemd? hand rolled bash scripts? puppet and ansible? All of the above??

I’ll assume the worst case:

- lots of containers running on a single host

- containers are each isolated in a VM (aka virtualized)

- workloads are not homogenous and change often (your neighbor today may not be your neighbor tomorrow)

I believe these are fair assumptions if you’re running on generic infrastructure with kubernetes.

In this setup, my concerns are pretty much noisy neighbors + throttling. You may get latency spikes out of nowhere and the cause could be any of:

- your neighbor is hogging IO (disk or network)

- your database spawned too many threads and got throttled by CFS

- CFS scheduled your DBs threads on a different CPU and you lost your cache lines

In short, the DB does not have stable, predictable performance, which are exactly the characteristics you want it to have. If you ran the DB on a dedicated host you avoid this whole suite of issues.

You can alleviate most of this if you make sure the DB’s container gets the entire host’s resources and doesn’t have neighbors.

> - containers are each isolated in a VM (aka virtualized)

Why are you assuming containers are virtualized? Is there some container runtime that does that as an added security measure? I thought they all use namespaces on Linux.

None of those are the fault of containers. You can do all of what you said without containers.
Nobody who matters.

Those who do that don't know what they are doing (even if they outnumber the other side hundred to one, they "don't count" because they aren't aiming for good performance anyways).

Well, maybe not quite... of course it's possible that someone would want to deploy a database in a container because of the convenience of assembling all dependencies in a single "package", however, they would never run database on the same node as applications -- that's insanity.

But, even the idea of deploying a database alongside something like kubelet service is cringe... This service is very "fat" and can spike in memory / CPU usage. I would be very strongly opposed to an idea of running a database on the same VM that runs Kubernetes or any container runtime that requires a service to run it.

Obviously, it says nothing about the number of processes that will run on the database node. At the minimum, you'd want to run some stuff for monitoring, that's beside all the system services... but I don't think GP meant "one process" literally. Neither that is realistic nor is it necessary.

>but I don't think GP meant "one process" literally. Neither that is realistic nor is it necessary.

The point was simply about other processes that could be competing for resources - CPU, memory, or I/O. It is expensive for a user-level process to perform accounting for all of these resources, and without such accounting you can't optimally allocate them.

If there are other apps that can suddenly spike memory usage then any careful buffer tuning you've done goes out the window. Likewise for any I/O scheduling you've done, etc.

I'm running prod databases in containers so the server infra team doesn't have to know anything about how that specific database works or how to upgrade it, they just need to know how to issue generic container start/stop commands if they want to do some maintenance.

(But just in containers, not in Kubernetes. I'm not crazy.)

My group and a bunch of my peer groups.

And we are running them at the scale that most people can’t even imagine.

Embedded DB