Thumb-2 immediate encoding is even more gleeful--in addition to allowing rotation, it also allows for spaced repetition of any 8-bit pattern (common in low level hack patterns, like from [1]) to be encoded in single instructions.
For those interested, check out page 122 of the ARMv7-M architecture reference manual[2]:
// ThumbExpandImm_C()
// ==================
(bits(32), bit) ThumbExpandImm_C(bits(12) imm12, bit carry_in)
if imm12<11:10> == ’00’ then
case imm12<9:8> of
when ’00’
imm32 = ZeroExtend(imm12<7:0>, 32);
when ’01’
if imm12<7:0> == ’00000000’ then UNPREDICTABLE;
imm32 = ’00000000’ : imm12<7:0> : ’00000000’ : imm12<7:0>;
when ’10’
if imm12<7:0> == ’00000000’ then UNPREDICTABLE;
imm32 = imm12<7:0> : ’00000000’ : imm12<7:0> : ’00000000’;
when ’11’
if imm12<7:0> == ’00000000’ then UNPREDICTABLE;
imm32 = imm12<7:0> : imm12<7:0> : imm12<7:0> : imm12<7:0>;
carry_out = carry_in;
else
unrotated_value = ZeroExtend(’1’:imm12<6:0>, 32);
(imm32, carry_out) = ROR_C(unrotated_value, UInt(imm12<11:7>));
return (imm32, carry_out)
The set of representable ARM immediates is really nice. It's wonderfully useful for writing soft-float and math library routines, where you have very common values with just some high-order bits set:
0x3f800000 // encoding of 1.0f
The set of immediate encodings, together with "shifts for free on most operations" (which are closely related features, as the OP points out), went a long way toward preserving my sanity when writing assembly.
Worth noting: thumb-2 immediates have a different (and even more interesting) encoding scheme. arm64 immediates are pretty interesting too (there the set of representable immediates is different depending on the instruction domain).
Honestly, with so few bits, I was expecting it to be a lookup table. (Yep, I've never wrote ARM assembly.)
But then, this way you have a nice set of imediates (as you said), and can set any value at all with at most 3 instructions at the rare case you need something different.
There's one ISA I've worked with before that does have -2, -1, 0, 1, 2, and some other "commonly used constants" like powers of 2 encoded specially in the immediate. I think it was an 8-bit, but I can't remember exactly which one. Anyone know what I'm referring to?
That's the scheme. Although probably MSP430 took its inspiration from the one I had in mind since I was thinking of an 8-bit MCU from the early 80s - might've been Motorola.
If the reply link is missing on a comment (it does that after the comments get to a certain depth of nesting) then click on the comment's "link" and you'll get a text box you can reply into.
ARM's reference manual (http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc....). You need to create a (free?) account with ARM to download it. Of course, the document has been out for long enough that I'm sure it's available elsewhere for those truly paranoid about creating accounts.
The Tensilica guys took this thing an extra step. ie- profile real code to find out what constants are most typically used, enumerate the top n constants, encode the constant with 0..n-1 in the immediate instruction - the immediate value is a hardware based lookup. You can still do arbitrary immediates with longer instructions but you can apparently get some nice code size reductions using this technique.
The flipside of that is that the scheme described here will take less silicon, fewer transistors and thus . . . use less power, if you happen to optimise the code appropriately.
Code size reductions are good, but for power purposes it is a case of balancing them against decoding complexity.
The x86 encoding actually makes sense once you begin to write its instructions using octal instead of hexadecimal numbers: http://www.dabo.de/ccc99/www.camp.ccc.de/radio/help.txt
(this isn't mentioned in the Intel or AMD docs).
This encoding is nice given that they've already paid the price of having the 32b barrel shifter, but it was a non-obvious choice to have the barrel shifter to begin with. Most instructions don't benefit from the optional rotate, but they pay the price in the encoding and in the data path.
Interestingly, the website uses svg for illustrations, IE 8 and under be damned.
In the ARM instruction encoding, every arithmetic and logical instruction is "conditional". The destination register is either updated or not depending on the four bit condition field and the state of the condition flags in the processor.
As a simple contrived example, consider the following C code:
int a[100], b[100], count;
...
for (int i=0; i<100; i++) {
if (a[i] > b[i]) count++;
}
without conditional execution, one might compile this to code that uses a branch to either increment count or not; on ARM it would be more idiomatic to use conditional execution. Here's a very literal translation as an example (not tested, apologies for any inadvertent errors):
// setup: a in R0, b in R1, count in R2, i in R3.
loop: LDR R4, [R0, R3, LSL #2] // load a[i]
LDR R5, [R1, R3, LSL #2] // load b[i]
CMP R4, R5 // if a[i] > b[i]
ADDGT R2, R2, #1 // count++
ADD R3, R3, #1 // i++
CMP R3, #100 // if (i < 100)
BLT loop // continue loop
The fourth instruction, ADDGT, is conditional. Count is only updated with the result of the addition if the "greater than" condition is satisfied (the flags were set by the preceding instruction). To be more precise, all of the instructions here are conditional, it's just that for most of them the condition field is 1110, meaning "always".
Many instructions also have an "S" bit, which toggles whether or not they update the flags on which conditional execution depends. Taken together, these two features allow a clever assembly programmer to do some really clever things (but historically not too much effort has been directed at getting compilers to make really clever use of these features).
For low-power parts, this is a cute trick, as it allows a programmer to avoid stressing a limited branch predictor with lots of small branches. It does add some complication to the implementation however, especially when you get into designs that retire multiple instructions per cycle or support out-of-order execution, as conditional execution basically adds additional dependencies to every instruction.
I remember there was a "never" condition, which was present just for completeness; it turns out ARM eventually found that having 2^28 different NOPs would not be a good use of opcode space, so it's now a special extension for newer instructions...
I seem to recall from my ARM Assembler coding days that there was also a noop instruction, which of course could be conditional itself, so if you didn't actually want to do the NOOP, you could do NOOP-NE, which wouldn't do anything twice over.
From my days coding ARM assembly on the Acorn Archimedes, NOP was typically an alias for MOV R0,R0 (which effectively did nothing) rather than being its own instruction.
That's a beautiful explanation. It's one of my favorite things about the ARM instruction set.
That said, it also means debugging becomes a bit more painful. Let's say you want your (cheap) JTAG debugger to halt on the count++ instruction. You can hard break on that particular address in code, but you will always hit that address whether the condition was met or not.
This was part of the beauty of ARM when I learned it as a teenager back in the early 90s. Very simple and elegant, and writing ARM code by hand was enjoyable. Coming back to ARM now, though, in this form of Cortex-M microcontrollers, I see that things have become muddied with things like if-then-else instructions and mixed 16-bit/32-bit Thumb-2 code.
Simple, elegant, and it eats an astonishing 12.5% of instruction bandwidth (4 bits out of 32). A branch will require less space as soon as you want to conditionally execute over 8 instructions. On top of that, for that is executed unconditionally (in practice, maybe not most of the instructions if you look at the binary, but almost certainly most of the instructions if you count ones executed multiple times)
arm64 ... sort of ditches conditional execution. It’s not on every instruction any more, but it’s still available on more instructions than on most other arches.
To the usual complement of typical conditional instructions (branch, add/sub with carry, select and set), arm64 adds select with increment, negate, or inversion, the ability to conditionally set to -1 as well as +1, and the ability to conditionally compare and merge the flags in a fairly flexible manner (it’s really a conditional select of condition flags between the result of a comparison and an immediate). This actually preserves most of the power of conditional execution (except for really exotic hand-coded usages), while taking up much less encoding space.
Your last sentense made me realise why ARM64 no longer has conditional instructions. Obviously it's designed for higher power situations than most ARM cores, and OOO is an important part of doing that efficiently. reduced instruction deps == better OOO.
"Almost all ARM instructions can include an optional condition code. An instruction with a condition code is only executed if the condition code flags in the CPSR meet the specified condition. The condition codes that you can use are shown in Table 4.2."
f.e. execute this instruction only if the previous instruction resulted in a negative number
In ARM state, all instructions are conditionally executed according to the state of the
CPSR condition codes and the instruction’s condition field. This field (bits 31:28)
determines the circumstances under which an instruction is to be executed. If the state
of the C, N, Z and V flags fulfils the conditions encoded by the field, the instruction is
executed, otherwise it is ignored.
When condition is set the mnemonic of instruction is extended with one of suffixes like EQ, NE, CS, CC etc.
There are a few options for producing other values:
1. load them from a constant pool in memory (often this is put nearly inline in the instruction stream so that it can be addressed via a small constant offset from PC). If your code doesn't saturate the LSU locally, this is usually the best option.
2. assemble them via bitwise or arithmetic combinations of literals that you can represent (or already in-register values that you know something about). If you can do it with just a few operations, and the LSU is otherwise occupied, you should do this instead. This is also preferred on some limited cores that can dual-issue logical and arithmetic ops but not LSU ops.
3. in recent revisions of the instruction set, there are a pair of instructions MOVW and MOVT; MOVW conjures a 16b immediate, and MOVT sets the high 16b, so with these two instructions you can conjure any 32b value.
There is a bit of subtlety to choosing the correct approach. Some assemblers provide a "conjure this value" pseudo-operation where the assembler will choose what it thinks is the best option to materialize a constant; that way the programmer doesn't need to worry about these details. This looks something like the following:
LDR R0, =0xff0000ff
the assembler might actually stash the constant somewhere and generate a load instruction, or it might instead do:
MOV R0, 0xff000000
ORR R0, R0, 0x000000ff
and assemble the value via a couple instructions with immediates.
Here's what you can do to add a "complicated" constant stored elsewhere, in hand-crafted assembler:
add_something: ; function starts here (argument r0 == some number)
ldr r1, __tmp ; get complicated constant, store in r1
add r0, r0, r1 ; do the addition r0 = r0 + r1
bx lr ; == return result (in r0)
__tmp:
.word 0x12345678 ; store complicated constant here
You can play with your compiler, if you call gcc as "gcc -Os -S -o- file.c" if will spit out generated assembler code (-S) on stdout "-o-" for the c-code in file.c.
(but then, gcc prefers to have 4 "compact" adds, instead of loading a constant...)
$ cat dummy.c
int
add_random_number(int a)
{
return a + 0x12345678; /* guaranteed to be random */
}
$ arm-none-eabi-gcc -S -o- -Os dummy.c
(...)
add_random_number:
@ Function supports interworking.
@ args = 0, pretend = 0, frame = 0
@ frame_needed = 0, uses_anonymous_args = 0
@ link register save eliminated.
add r0, r0, #301989888
add r0, r0, #3424256
add r0, r0, #5696
add r0, r0, #56
bx lr
In at least the ADT assembler and gas a few years back, the assembler provided syntactic sugar:
ldr r0,=0x123456578
assembles to
ldr r0,pc+xxx
... and then somewhere later ...
.word 0x12345678
The assembler had some default places it would put constant pools (end of a module?), or you could explicitly tell it to generate a constant pool if the default place would be outside the limit of the pc-relative addressing mode.
As others have said, that's not the case. All constants can be constructed by ORing together a small collection of expressible constants. With a little ingenuity "most" require very few instructions to build. FWIW, I have written real-time radar processing embedded software, and rarely had to resort to stored constants.
You're being a little disingenuous. It describes the ARM architecture as 'elegant', and 1/4 of the way through the article describes the common use of fixed length instructions in various RISC architectures as 'a good design', and explains why this is mostly a win. And then it explains how the ARM manages to encode a wide range of useful immediate values using a very elegant and simple scheme.
In all my years coding in ARM assembly language, the range of immediate values it supported was rarely if ever an issue.
I agree that this clever and useful, but I get the feeling that it could have been more so.
I haven't done much assembly in a while, but I was heavily into it once upon a time, and I recall that values with lots of 1s were useful. There is not quick way to generate those here. This means that we can write a single instruction to set any single bit using an inclusive OR and the proper immediate value, but we cannot write a single instruction to clear any single bit.
The reason I think a bit more cleverness might have helped is that there are so many values with multiple encodings. Anything where the 8-bit value ends with 0 has a different encoding as well. For example, a rotation of 0000 and an 8-bit value of 00000100 gives the same result as a rotation of 1111 and an 8-bit value of 00000001 (right?). Perhaps some of the redundant instructions could have been used to represent things ending in lots of 1s?
Regardless, an interesting and informative post. :-)
Clearing an arbitrary bit actually is supported (as mentioned in the article: "you can set, clear, or toggle any bit with one instruction").
The specific details require you to dig a bit past explaining just the immediate encoding, but in the clearing case there's a dedicated instruction for clearing the bit specified by the immediate:
BIC - Bit Clear (immediate) performs a bitwise AND
of a register value and the complement of an immediate
value, and writes the result to the destination register.
As I mentioned in my other post, zero-rotation encodings are gamed out as well (to allow byte repetition).
I love that the author described the ARM as "elegant, pragmatic, and quirky." It reminds me of Gordon Bell's PCP-6/PDP-10 architecture, but applied to the RISC rather than CISC philosophy.
(well, the PDP-10 was pretty RISC for its day and gave us things like BLT, hence bitblit).
AND with 0xefffffff would become BIC ("bit clear", aka "and not") with 0x10000000. But yes, the basic idea of your comment is correct. Fortunately, compilers are quite good at this sort of thing.
CMN is the real gem. It's an endless source of bugs in the time between when compiler writers discover it and when they figure out how it actually sets flags. ARM would have done well to provide a "here's how you actually use this instruction" guide in the architecture reference manual.
Aside from this kind of trick, I have not worked with a ton of ARM assembly but what I find is very frequently compilers will just put values in the text section close to where they are used and refer to them with some PC-relative thing, rather than using immediate values as often as they might on x86. I have seen this on other RISC platforms as well.
In order to have a single-instruction PR-relative load the offset needs to be small (though it uses a different immediate scheme for the offset). So in practice you usually tuck small constant islands between blocks of executable code when you adopt this approach.
That's what I was thinking. I also started wondering how a large structure might be packed, when you can relatively address every byte in the nearest 256 bytes but only every 4 bytes in the next KB and so on. It could sort of upend the way you might normally think about packing a struct.
The problem as I see it with this sort of cleverness is that it's difficult to optimize to this. It leads to quite variable best-case and worst-case scenarios and general unpredictability. As, say, a C programmer and not a compiler designer, you might unintentionally pick lots of values that won't fit into the "immediate" scheme (worst-case). Or you might force your design to use numbers that DO fit into this scheme (best-case, but a bleed of lower-level design decisions affecting higher-level design decisions).
These days if what you're writing is in C then you should really know the behaviour of the architecture underneath.
ARM really isn't that hard if you're already thinking like a low level programmer. Things like MIPS were (better) designed for being targeted from higher level languages, but the consequence is a much messier machine language. It's always struck me as amusing that the conventional view is MIPS is minimal, when ARM is really much more so, but it's from outside the Berkeley/Stanford RISC bubble so didn't really get on to their radars for some time.
There is lots of C code that is written to be portable, and trusts the compiler to generate optimized assembly. But those are probably ok with spending a couple of extra cycles for some immediate values.
It generates fewer than 4096 unique values, as some values map to the same thing. The most obvious being that 0 shifted any number of places produces 0, but also 0x10 shifted left 0 is the same as 0x04 shifted left 2 is the same as 0x01 shifted left 4.
Anything of the form 0x0000nn00 is representable, as it has at most eight contiguous non-zero bits starting at an even bit position (i.e. these values are 0x000000nn rotated right by 24). Maybe you're wondering how a rotation of 24 can be encoded in four bits? To get the rotation amount, you double the value of the four-bit field. Only even rotation counts are representable.
Sure they can, by setting the rotate bits to 0xC (1100). Have a play with the interactive widget near the bottom of the article.
Interestingly, there is some redundancy with this encoding meaning you can't represent a full 2^12 unique values. Eg 0x1 could be represented by 0x1 and 0x0 in the rotate field, 0x4 with 0x1 in the rotate field, 0x10 with 0x2, or 0x40 with 0x3. The same applies for every other combination - this means that there are only (2^12)/4 = 1024 unique values that are representable in total.
You didn't do that math right. Out of every 256 values at a particular offset, only a quarter of them are under 64 and can be encoded with rotate+1.
Doing a quick brute force test, it appears that there are 3073 unique values. I suppose that makes sense. Each new rotate introduces 192 new values that have at least one of the new bits set, and 192 * 16 = 3072. Then you have the number 0.
For those interested, check out page 122 of the ARMv7-M architecture reference manual[2]:
[1] http://graphics.stanford.edu/~seander/bithacks.html (worth a read on its own if you're into this kind of thing)[2] http://web.eecs.umich.edu/~prabal/teaching/eecs373-f10/readi... (no-registration link)