|
|
|
|
|
by rep_lodsb
189 days ago
|
|
Only the Z80 refetched the entire instruction, x86 never did it this way. Each bus transfer (read or write) takes multiple clocks: CPU Cycles per theoretical minimum per byte for block move
Z80 instruction fetch 4 byte
Z80 data read/write 3 byte 6
80(1)88, V20 4 byte 8
80(1)86, V30 4 byte/word 4
80286, 80386 SX 2 byte/word 1
80386 DX 2 byte/word/dword 0.5
LDIR (etc.) are 2 bytes long, so that's 8 extra clocks per iteration. Updating the address and count registers also had some overhead.The microcode loop used by the 8086/8088 also had overhead, this was improved in the following generations. Then it became somewhat neglected since compilers / runtime libraries preferred to use sequences of vector instructions instead. And with modern processors there are a lot of complications due to cache lines and paging, so there's always some unavoidable overhead at the start to align everything properly, even if then the transfer rate is close to optimal. |
|
Moreover, the cache memories used with 286/386SX/386DX were normally write-through, which means that they shortened only the read cycles, not also the write cycles. Such caches were very effective to diminish the impact on performance of instruction fetching, but they brought little or no improvement to block transfers. The caches were also very small, so any sizable block transfer would flush the entire cache, then all transfers would be done at DRAM speed.