|
Decent speedup on my 3970x. Including this patch reduced the number of instructions in lzma_decoder.s (using gcc 8.3.1) by about 8% (4417 lines of asm vs 4824). With perf stat, an astonishing branches-missed reduction from 409K to 104K. Using the firefox example from the gist: $ tar -cJf lib.tar.xz /usr/lib64/firefox
The xz shipped from the system: $ perf stat xz -c -d lib.tar.xz > /dev/null
Performance counter stats for 'xz -c -d lib.tar.xz':
4,650.32 msec task-clock:u # 1.000 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
591 page-faults:u # 0.127 K/sec
19,849,912,300 cycles:u # 4.269 GHz (83.33%)
425,290,878 stalled-cycles-frontend:u # 2.14% frontend cycles idle (83.33%)
1,831,640,390 stalled-cycles-backend:u # 9.23% backend cycles idle (83.34%)
23,973,036,103 instructions:u # 1.21 insn per cycle
# 0.08 stalled cycles per insn (83.33%)
2,939,144,233 branches:u # 632.031 M/sec (83.34%)
409,371,860 branch-misses:u # 13.93% of all branches (83.33%)
4.650679926 seconds time elapsed
4.611657000 seconds user
0.011931000 seconds sys
The xz patched. $ git clone http://git.tukaani.org/xz.git
$ cd xz/src
$ patch -l -p1 < ../faster_lxma_decoder_x86.patch
$ cd .. ; autogen.sh && configure && make
$ LD_PRELOAD=./liblzma/.libs/liblzma.so
$ perf stat ./xz/.libs/xz -c -d ../../lib.tar.xz > /dev/null
Performance counter stats for './xz/.libs/xz -c -d ../../lib.tar.xz':
3,578.54 msec task-clock:u # 1.000 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
593 page-faults:u # 0.166 K/sec
15,186,685,715 cycles:u # 4.244 GHz (83.32%)
108,663,507 stalled-cycles-frontend:u # 0.72% frontend cycles idle (83.32%)
8,753,057,119 stalled-cycles-backend:u # 57.64% backend cycles idle (83.34%)
27,322,182,837 instructions:u # 1.80 insn per cycle
# 0.32 stalled cycles per insn (83.35%)
1,979,944,734 branches:u # 553.282 M/sec (83.34%)
104,752,154 branch-misses:u # 5.29% of all branches (83.34%)
3.578973194 seconds time elapsed
3.549329000 seconds user
0.011942000 seconds sys
|