| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by pgaddict 151 days ago

There probably is some additional inefficiency when reading pages randomly (compared to sequential reads), but most of the difference is at the storage level. That is, SSDs can handle a lot of random I/O, but it's nowhere close to sequential reads.

For example, I have a RAID0 with 4 SSDs (Samsung 990 PRO, so consumer, but quite good for reads). And this is what fio says:

# random reads, 8K, direct IO, depth=1

fio --filename=device name --direct=1 --rw=randread --bs=4k --ioengine=libaio --iodepth=256 --runtime=120 --numjobs=4 --time_based --group_reporting --name=iops-test-job --eta-newline=1 --readonly

-> read: IOPS=19.1k, BW=149MiB/s (156MB/s)(4473MiB/30001msec)

# sequential reads, 8K, direct IO, depth=1

fio --filename=/dev/md127 --direct=1 --rw=read --bs=8k --ioengine=io_uring --iodepth=1 --runtime=30 --numjobs=1 --time_based --group_reporting --name=random-1 --eta-newline=1 --readonly

-> read: IOPS=85.5k, BW=668MiB/s (700MB/s)(19.6GiB/30001msec)

With buffered I/O, random read stay at ~19k IOPS, while sequential reads get to ~1M IOPS (thanks to read-ahead, either at the OS level, or in the SSD).

So part of this is sequential reads benefiting from implicit "prefetching", which reduces the observed cost of a page. But for random I/O there's no such thing, and so it seems more expensive.

It's more complex (e.g. sequential reads allow issuing larger reads), of course.

2 comments

i_think_so 151 days ago

This is why I love this site. Thank you for sharing your data, not just opinion.

link

menaerus 151 days ago

> fio --filename=device name --direct=1 --rw=randread --bs=4k --ioengine=libaio --iodepth=256 --runtime=120 --numjobs=4 --time_based --group_reporting --name=iops-test-job --eta-newline=1 --readonly

> -> read: IOPS=19.1k, BW=149MiB/s (156MB/s)(4473MiB/30001msec)

Isn't this too low? On my non-RAID configuration I get almost 2GB/s with the exact same command. Samsung 980 PRO 1TB.

link

pgaddict 150 days ago

Damn, I copied the wrong command. I wanted to copy this one:

fio --filename=/dev/md127 --direct=1 --rw=randread --bs=8k --ioengine=io_uring --iodepth=1 --runtime=120 --numjobs=1 --time_based --group_reporting --name=iops-test-job --eta-newline=1 --readonly

i.e. with iodepth=1 and numjobs=1. Because this is what "mimics" index scan (without prefetch) on cold data. More or less.

The command I posted earlier does ~10GB/s on my RAID, which matches your data.

link

menaerus 150 days ago

> Because this is what "mimics" index scan (without prefetch) on cold data. More or less.

This is an interesting observation but does it really mimic the index scan? This would be essentially a worst case scenario. Submitting IO requests one by one would be a very inefficient way to handle scans, no?

link

pgaddict 149 days ago

True. Unfortunately it's what index scans in Postgres do right now - it's the last "major" scan type not supporting some sort of prefetch (posix_fadvise or AIO). We're working on it, hopefully it'll get into PG19.

link