Hacker News new | ask | show | jobs
by ayende 1612 days ago
I'm the author (well, one of) RavenDB

You are correct to an extent, but there are a few things yo noted.

* you can design your system so the access pattern that the OS is optimized for matches your needs

* you can use madvise() to give some useful hints

* the amount of complexity you don't have to deal with is staggering

2 comments

OTOH, if you care about that last 5 percent or so of performance there is the complexity that what the OS has optimized for might be different between different OS's (e.g., MacOS, Linux, FreeBSd, etc.) and indeed, might change between different versions of Linux, or even, in the case of buffered writeback, between different filesystems on the same version of Linux. This is probably historically one of the most important reasons why enterprise databases like Oracle DB, DB2, etc., have used direct I/O, and not buffered I/O or mmap.

Speaking as an OS developer, we're not going to try to optimize buffered I/O for a particular database. We'll be using becnhmarks like compilebench and postmark to optimize our I/O, and if your write patterns, or readahead patterns, or caching requirements, don't match those workloads, well.... sucks to be you.

I'll also point out that those big companies that actually pay the salarise of us file system developers (e.g., Oracle, Google, etc.) for the most part use Direct I/O for our performance critical workloads. If database companies that want to use mmap want to hire file system developers and contribute benchmarks and performance patches for ext4, xfs, etc., speaking as the ext4 maintainer, I'll welcome that, and we do have a weekly video conference where I'd love to have your engineers join to discuss your contributions. :-)

The key from my perspective is that I CAN design my access patterns to match what you'll optimized.

Another aspect to remember is that mmap being even possible for databases as the primary mechanism is quite new.

Go 15 years ago and you are in 32 bit land. That rule out mmap as your approach.

At this point, I might as well skip the OS and go direct IO.

As for differ OS behavior, I generally find that they all roughly optimize for the same thing.

I need best perf on Linux and Windows. Other systems I can get away with just being pretty good

The mongodb developers once thought as you did. They were wrong, although it took a fair while for them to realise this. Yes it's complex. Extremely complex, and as another poster noted, the learning curve is horrible and documentation is extremely limited. Unfortunately there's no real substitute.

The mmap/madvise approach works well for things like varnish cache, where you have a flat collection of similar and largely unrelated objects. It does not work well for databases where you have many different types of data, some of which are interrelated, and all want to be handled differently. If you can meet the performance needs for your product by doing what you're doing then great - that's a fantastic complexity saving for your business. But the claim that "you can design your system so the access pattern that the OS is optimized for matches your needs" is unfortunately not true. It might be good enough for what you need, but it's not optimal. That's why there's so many lines of code in other DB engines doing this the hard way.

[flagged]