|
|
|
|
|
by hiAndrewQuinn
363 days ago
|
|
It's true that with small files, my primary interest is simply not to wear on my disk unnecessarily. However I do also often do work on large files, usually local data processing work. "This optimization [of putting files directly into RAM instead of trusting the buffers] is unnecessary" was an interesting claim, so I decided to put it to the test with `time`. $ # Drop any disk caches first.
$ sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
$
$ # Read a 3.5 GB JSON Lines file from disk.
$ time wc -l /home/andrew/Downloads/kaikki.org-dictionary-Finnish.jsonl
255111 /home/andrew/Downloads/kaikki.org-dictionary-Finnish.jsonl
real 0m2.249s
user 0m0.048s
sys 0m0.809s
$ # Now with caching.
$ time wc -l /dev/shm/kaikki.org-dictionary-Finnish.jsonl
255111 /dev/shm/kaikki.org-dictionary-Finnish.jsonl
real 0m0.528s
user 0m0.028s
sys 0m0.500s
$
$ # Drop caches again, just to be certain.
$ sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
$
$ # Read that same 3.5 GB LSON Lines file from /dev/shm.
$ time wc -l /dev/shm/kaikki.org-dictionary-Finnish.jsonl
255111 /dev/shm/kaikki.org-dictionary-Finnish.jsonl
real 0m0.453s
user 0m0.049s
sys 0m0.404s
Compared to the first read there is indeed a large speedup, from 2.2s down to under 0.5s. After the file had been loaded into cache from disk by the first `wc --lines`, however, the difference dropped to /dev/shm being about ~20% faster. Still significant, but not game-changingly so.I'll probably come back to this and run more tests with some of the more complex `jq` query stuff I have to see if we stay at that 20% mark, or if it gets faster or slower. |
|
1 - Programs such as wc (or jq) do sequential reads, which benefit from file systems optimistically prefetching contents in order to reduce read delays.
2 - Check to see if file access time tracking is enabled for the disk-based file system (see mount(8)). This may explain some of the 20% difference.