--single-thread probably doesn't do what you think, it forces zstd to serialize I/O with the compression job and it doesn't mean compression itself is multi-threaded with -T1 [EDIT]. I can't try that anyway because my zstd version is slightly lower (1.3.3), but I can try pigz:
3,664,854,169 0:42 pigz -1 -p4
3,664,854,169 0:28 pigz -1 -p8 (default, also probably the max possible with my box)
Now I'm very curious where you got your copy of zstdmt. (I'm using stock Ubuntu packages for all of them.)
[EDIT] Was "it doesn't make compression itself multi-threaded", which can falsely imply that --single-thread seemingly enables multi-threaded compression but it doesn't ;)
Very interesting. It might be the case that zstd optimizes more for recent machines; zstd famously uses four different compression streams to maximize instruction-level parallelism and that might not work well in older machines. I haven't seen any machine where zstd is significantly slower than it should, but those machines I could test came from 2013 or later. Or either the RHEL package might have been optimized for recent machines. It would be interesting to test a binary optimized for the current machine (-march=native -mtune=native).
[EDIT] Was "it doesn't make compression itself multi-threaded", which can falsely imply that --single-thread seemingly enables multi-threaded compression but it doesn't ;)