Hacker News new | ask | show | jobs
by hiAndrewQuinn 363 days ago
It's true that with small files, my primary interest is simply not to wear on my disk unnecessarily. However I do also often do work on large files, usually local data processing work.

"This optimization [of putting files directly into RAM instead of trusting the buffers] is unnecessary" was an interesting claim, so I decided to put it to the test with `time`.

    $ # Drop any disk caches first.
    $ sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
    $ 
    $ # Read a 3.5 GB JSON Lines file from disk.
    $ time wc -l /home/andrew/Downloads/kaikki.org-dictionary-Finnish.jsonl 
    255111 /home/andrew/Downloads/kaikki.org-dictionary-Finnish.jsonl

    real 0m2.249s
    user 0m0.048s
    sys 0m0.809s

    $ # Now with caching.
    $ time wc -l /dev/shm/kaikki.org-dictionary-Finnish.jsonl 
    255111 /dev/shm/kaikki.org-dictionary-Finnish.jsonl
    
    real 0m0.528s
    user 0m0.028s
    sys 0m0.500s

    $ 
    $ # Drop caches again, just to be certain.
    $ sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
    $ 
    $ # Read that same 3.5 GB LSON Lines file from /dev/shm.
    $ time wc -l /dev/shm/kaikki.org-dictionary-Finnish.jsonl 
    255111 /dev/shm/kaikki.org-dictionary-Finnish.jsonl

    real 0m0.453s
    user 0m0.049s
    sys 0m0.404s
Compared to the first read there is indeed a large speedup, from 2.2s down to under 0.5s. After the file had been loaded into cache from disk by the first `wc --lines`, however, the difference dropped to /dev/shm being about ~20% faster. Still significant, but not game-changingly so.

I'll probably come back to this and run more tests with some of the more complex `jq` query stuff I have to see if we stay at that 20% mark, or if it gets faster or slower.

1 comments

A couple things to consider when benchmarking RAM file I/O verses disk-based file system I/O.

1 - Programs such as wc (or jq) do sequential reads, which benefit from file systems optimistically prefetching contents in order to reduce read delays.

2 - Check to see if file access time tracking is enabled for the disk-based file system (see mount(8)). This may explain some of the 20% difference.