Hacker News new | ask | show | jobs
by Someone 651 days ago
> a bygone era where computers executed 1 instruction per cycle without pipelining or prediction

1 instruction per cycle? What luxury bygone era did you grow up in?

Wikipedia tells me the algorithm is from 1962 on an IBM 1401 (https://en.wikipedia.org/wiki/Bresenham's_line_algorithm#His...

That definitely didn’t have many single cycle instructions. Skimming https://ibm-1401.info/A24-6447-0_1401_1460_Instruction_and_T..., I couldn’t find any.

Certainly, in the era of 8-bit CPUs like Z80 and 6502 programmers would have been lyric about “1 instruction per cycle”

Actually, did any CPU ever “execute 1 instruction per cycle without pipelining or prediction” (or, slightly looser “had fixed time instructions without pipelining or prediction”)?

RISC introduced fixed time instructions, but also pipelining.

5 comments

Adsp-218x family of harvard architecture DSPs. Each 25 ns clock cycle = 1 MAC, 1 data fetch, and one instruction fetch. All instructions 1 cycle long. And many other gizmos like reversible address bit ranges for DFT in place, separate register bank for IRQ handlers. And all laid out by hand.
no, you're right, you almost always need pipelining to get one instruction per clock cycle

but there are a lot of cpus out there—maybe the majority now—that are pipelined microarchitectures that get one instruction per cycle without much or any prediction. avrs, most of the cortex-m* (all?), most modern implementations of old slow isas like the z80 and 8051, etc. big processors like your laptop cpu and cellphone cpu are of course superscalar, but they are a tiny minority of all processors. even inside the cellphone case, they're outnumbered by in-order scalar microcontrollers

without prediction, of course, you have a pipeline bubble every time you have a branch, so you never quite hit 1 ipc with scalar execution. but it's usually pretty close, and even with prediction, sometimes you miss. and usually if you have branch prediction, you also have a cache, because ain't nobody got time to share one triflin memory bus between instructions and data

so pipelining gets you from, say, 3 clocks per instruction down to 1.2 or so. then prediction gets you from 1.2 down to, say, 1.02

I remember the bragging point on the RTX2000 in the very late eighties was "a MIP per megahertz".
my limited experience with stack machines is that a stack mip is about half a register-machine mip :-(

consider this simple assembly-language subroutine i wrote in october (http://canonical.org/~kragen/sw/dev3/tetris.S, screencast at https://asciinema.org/a/622461):

            @@ Set sleep for iwait to r0 milliseconds.
            @@ (r0 must be under 1000)
            .thumb_func
    waitis: ldr r2, =wait           @ struct timeval
            movs r3, #0             @ 0 sec
            str r3, [r2]            @ .tv_sec = 0
            ldr r1, =1000           @ multiplier for ms
            mul r0, r1
            str r0, [r2, #4]        @ set .tv_usec
            bx lr

            .bss
    wait:   .fill 8                 @ the struct timeval
            .text
these are all perfectly normal register-machine instructions; you could translate them one-to-one to almost any register machine. on a few of them you could drop one, writing something like str #0, [wait]. the whole function is straight-line execution, a single basic block, and seven instructions long. this is almost a best case for a stack machine; it looks like this:

    lit(0)      ; load immediate constant
    lea(wait)   ; load address of wait onto stack
    !           ; store 0 in wait
    lit(1000)   ; another constant
    *           ; multiply argument by constant
    lea(wait)   ; load address again
    lit(4)      ; offset of .tv_usec
    +           ; calculate address of wait.tv_usec
    !           ; store product in tv_usec
    ret         ; return from subroutine 
that's 10 instructions, about 50% longer than the 6 or 7 of the two-operand register machine. but the typical case is worse. and basically the reason is that the average number of stack manipulations or constant pushes that you need to do to get the right operands on top of the stack for your memory accesses and computational operations is roughly 1. sometimes you'll have a * or ! (store) or + that's not preceded by a stack manipulation or a load-immediate or a load-address operation, and that time the stack machine wins, but other times it'll be preceded by two or three of them, and that time the stack machine loses

so it averages out to about two stack instructions per register-machine instruction. call and return is faster on the stack machine, but passing arguments and return values in registers on the register machine can take away most of that advantage too

the rtx-2000 did some tricks to sometimes do more than a single stack operation per cycle, but it didn't really escape from this

this doesn't necessarily mean that stack machines like the rtx-2000 are a bad design approach! the design rationale is that you get very short path lengths, so you can clock the design higher than you could clock a design that had register-file muxes in the critical path, and you also avoid the branching penalty that pipelines impose on you, and you use less silicon area. mainstream computation took a different path, but plausibly the two-stack designs like the rtx-2000, the novix nc4000, the shboom, the mup21, and the greenarrays f18a could have been competitive with the mainstream risc etc. approach. but you do need a higher clock rate because each instruction does less

i don't remember if dr. ting wrote a book about the rtx-2000, but he did write a very nice book about its predecessor the nc4000, going into some of these tricks: https://www.forth.org/OffeteStore/4001-footstepsFinal.pdf

"oh these new-fangled kids what with their superscalar processors. 100 instructions in a cycle, phooey! Back in my day, it was dang gummed 100 cycles for each instruction, and gosh darnit we liked it! Now that was just for the add instruction, a division, well some say they never figured out how many cycles that was because it took too long. I had an onion in my belt which was the style at the time"
Probably meant 'cycle' in the sense of instruction cycle, rather than clock cycle.
Even then, few CPUs execute all instructions in the same amount of time. Division, for example, still takes longer than a register move on most architectures. Multiplication, historically, did too.

As a sibling comment points out, DSPs can be an exception. I’m far from an expert on them, but CPUs that try to run all instructions in the same fixed amount of time used to accomplish that by avoiding complex addressing modes and by omitting the really slow instructions (division and instructions that push/pop multiple registers onto/from the stack being prime examples).

IIRC, some of them also accomplished it by running some of the simpler instructions slower than necessary (raw speed isn’t as important as being hard real time isn many applications)

> few CPUs execute all instructions in the same amount of time. Division, for example, still takes longer than a register move on most architectures

easy solution, as you allude to, don't have a division instruction! arm doesn't, 8048 doesn't, sparcv7 doesn't, even the cdc 6600 and cray-1 didn't, and even risc-v finally got zmmul: https://wiki.riscv.org/display/HOME/Zmmul. it's not just dsps

the big issue with complex addressing modes is i think fault handling. if your nice orthogonal add instruction updates two registers as it reads operands and a third register with the sum of the operands, what do you do if you get a page fault on the second operand? if the os can service the page fault, how does it restart the instruction?

as you point out, real-time latency is also an issue. on older arm chips you have to be careful not to ldm or stm too many registers in a single instruction so as not to damage interrupt latency. newer arm chips can restart the ldm/stm

i'm no expert on the area either

> don't have a division instruction! arm doesn't, 8048 doesn't

Heh, that's an understatement. The 8048 doesn't even have a subtract instruction, much less divide.

cpl add cpl was good enough for my daddy and it's good enough for me