Hacker News new | ask | show | jobs
by pgaddict 104 days ago
There probably is some additional inefficiency when reading pages randomly (compared to sequential reads), but most of the difference is at the storage level. That is, SSDs can handle a lot of random I/O, but it's nowhere close to sequential reads.

For example, I have a RAID0 with 4 SSDs (Samsung 990 PRO, so consumer, but quite good for reads). And this is what fio says:

# random reads, 8K, direct IO, depth=1

fio --filename=device name --direct=1 --rw=randread --bs=4k --ioengine=libaio --iodepth=256 --runtime=120 --numjobs=4 --time_based --group_reporting --name=iops-test-job --eta-newline=1 --readonly

-> read: IOPS=19.1k, BW=149MiB/s (156MB/s)(4473MiB/30001msec)

# sequential reads, 8K, direct IO, depth=1

fio --filename=/dev/md127 --direct=1 --rw=read --bs=8k --ioengine=io_uring --iodepth=1 --runtime=30 --numjobs=1 --time_based --group_reporting --name=random-1 --eta-newline=1 --readonly

-> read: IOPS=85.5k, BW=668MiB/s (700MB/s)(19.6GiB/30001msec)

With buffered I/O, random read stay at ~19k IOPS, while sequential reads get to ~1M IOPS (thanks to read-ahead, either at the OS level, or in the SSD).

So part of this is sequential reads benefiting from implicit "prefetching", which reduces the observed cost of a page. But for random I/O there's no such thing, and so it seems more expensive.

It's more complex (e.g. sequential reads allow issuing larger reads), of course.

2 comments

This is why I love this site. Thank you for sharing your data, not just opinion.
> fio --filename=device name --direct=1 --rw=randread --bs=4k --ioengine=libaio --iodepth=256 --runtime=120 --numjobs=4 --time_based --group_reporting --name=iops-test-job --eta-newline=1 --readonly

> -> read: IOPS=19.1k, BW=149MiB/s (156MB/s)(4473MiB/30001msec)

Isn't this too low? On my non-RAID configuration I get almost 2GB/s with the exact same command. Samsung 980 PRO 1TB.

Damn, I copied the wrong command. I wanted to copy this one:

fio --filename=/dev/md127 --direct=1 --rw=randread --bs=8k --ioengine=io_uring --iodepth=1 --runtime=120 --numjobs=1 --time_based --group_reporting --name=iops-test-job --eta-newline=1 --readonly

i.e. with iodepth=1 and numjobs=1. Because this is what "mimics" index scan (without prefetch) on cold data. More or less.

The command I posted earlier does ~10GB/s on my RAID, which matches your data.

> Because this is what "mimics" index scan (without prefetch) on cold data. More or less.

This is an interesting observation but does it really mimic the index scan? This would be essentially a worst case scenario. Submitting IO requests one by one would be a very inefficient way to handle scans, no?

True. Unfortunately it's what index scans in Postgres do right now - it's the last "major" scan type not supporting some sort of prefetch (posix_fadvise or AIO). We're working on it, hopefully it'll get into PG19.