| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by anarazel 406 days ago

FWIW, I played with that - unfortunately it seems that the the overhead of doing twice the page cache lookups is a cure worse than the disease.

Note that we do not offload IO to workers when doing I/O that the caller will synchronously wait for, just when the caller actually can do IO asynchronously. That reduces the need to avoid the offload cost.

It turns out, as some of the results in Lukas' post show, that the offload to the worker is often actually beneficial particularly when the data is in the kernel page cache - it parallelizes the memory copy from kernel to userspace and postgres' checksum computation. Particularly on Intel server CPUs, which have had pretty mediocre per-core memory bandwidth in the last ~ decade, memory bandwidth turns out to be a bottleneck for page cache access and checksum computations.

Edit: Fix negation

2 comments

the8472 406 days ago

Ah yeah, getting good kernel<>userspace oneshot memcpy performance for large files is surprisingly hard. mmap has setup/teardown overhead that's significant for oneshot transfers, regular read/write calls suffer from page cache/per page overhead. Hopefully all the large folio work in the kernel will help with that.

anarazel 406 days ago

From what I've seen a surprisingly large part of the overhead is due to SMAP when doing larger reads from the page cache - i.e. if I boot with clearcpuid=smap (not for prod use!), larger reads go significantly faster. On both Intel and AMD CPUs interestingly.

On Intel it's also not hard to simply reach the per-core memory bandwidth with modern storage HW. This matters most prominently for writes by the checkpointing process, which needs to compute data checksums given the current postgres implementation (if enabled). But even for reads it can be a bottleneck, e.g. when prewarming the buffer pool after a restart.

derefr 406 days ago

> if I boot with clearcpuid=smap (not for prod use!), larger reads go significantly faster. On both Intel and AMD CPUs interestingly.

Is there a page anywhere that collects these sorts of "turn the whole hardware security layer off" switches that can be flipped to get better throughput out of modern x86 CPUs, when your system has no real attack surface to speak of (e.g. air-gapped single-tenant HPC)?

the8472 405 days ago

On the kernel side there's a boot parameter for all of them: mitigations=off Software that was compiled with additional fences may have to be recompiled to remove them.

https://www.kernel.org/doc/html/latest/admin-guide/kernel-pa...

starspangled 405 days ago

mitigations=off disables workarounds for bugs or "mis-features" in the CPU that could be exploited to bypass OS security measures.

smap is an OS security measure, and so does not get disabled by mitigations=off. smap can be pretty draining for certain IO performance though. IMO it should be more well-known or covered by a more obvious option.

Linux kernel developers are really bad at defining and naming options like this.

amluto 406 days ago

SMAP overhead should be roughly constant, and I’d be quite surprised if it’s noticeable for large reads. Small reads are a different story.

anarazel 406 days ago

It turns out to be the other way round, curiously. The bigger the reads (i.e. how much to read in one syscall) and the bigger the target area of the reads (how long before a target memory location is reused), the bigger the overhead of SMAP gets.

If interesting I can dig up the reproducer I had at some point.

amluto 405 days ago

That is definitely interesting.

gregjm 405 days ago

TCMalloc never munmaps, instead it mmap(MAP_FIXED) within unpopulated PROT_NONE regions, and then madvise(MADV_FREE) at page granularity to reduce RSS. Perhaps a similar approach for file I/O could help to dodge the cost of munmap TLB shootdowns after a file has been read, but using MADV_DONTNEED instead of MADV_FREE. There will probably be a shootdown associated with the MADV_DONTNEED, but maybe it will be lower cost than munmap?

You might also just keep around the file mapping until memory/address space pressure requires, and at that point MAP_FIXED over it.

yxhuvud 406 days ago

Well, nowadays there is https://www.phoronix.com/news/Linux-RWF_UNCACHED-2024

the8472 406 days ago

That doesn't speed up uerspace<>kernel memcopy, it just reduces cache churn. Despite its name it still goes through the page cache, it just triggers writeback and drops the pages once that's done. For example when copying to a tmpfs it makes zero difference since that lives entirely in memory.

senderista 406 days ago

So you're less dependent on the page replacement algorithm being scan-resistant, since you can use this flag for scan/loop workloads, right?

gmokki 406 days ago

I would initially add it for WAL writes and reads. There should never be another read in normal operation.

gavinray 406 days ago

Do you think there's a possibility of Direct IO being adopted at some point in the future now that AIO is available?

anarazel 406 days ago

> Do you think there's a possibility of Direct IO being adopted at some point in the future now that AIO is available?

Explicitly a goal.

You can turn it on today, with a bunch of caveats (via debug_io_direct=data). If you have the right workload - e.g. read only and lots of seqscans, bitmap index scans etc you can see rather substantial perf gains. But it'll suck in any cases in 18.

We need at least:

- AIO writes in checkpointer, bgwriter and backend buffer replacement (think bulk loading data with COPY)

- readahead support in a few more places, most crucially index range scan (works out ok today if the heap is correlated with the index, sucks badly otherwise)

EDIT: Formatting