Though, read cache can be enabled manually by creating separate device via gcache(8). This is usually not required, because caching is done at the filesystem layer.
It's important to specify block size for uncached devices, of course. dd(1) with bs= option will surely work, and with cp(1) your mileage may vary, depending on whether underlying disk driver supports I/O with partial sector size or not.
Usually just specifying a reasonable blocksize works for me. bs=1m or so.
Without that it does literally take hours.
I suspect the default blocksize is really small (1?) and combined with uncached/unbuffered writes to slower devices, it just kills all performance outright.
Per the sibling comments, you just need to specify a sane block size. dd's default is really low and if you experiment a bit with 2M or around that you'll get near-theoretical throughput.
NB: Remember the units! Without the units you specify it as bytes or something insanely small like that. I've made that mistake more than once!
Though, read cache can be enabled manually by creating separate device via gcache(8). This is usually not required, because caching is done at the filesystem layer.
It's important to specify block size for uncached devices, of course. dd(1) with bs= option will surely work, and with cp(1) your mileage may vary, depending on whether underlying disk driver supports I/O with partial sector size or not.