| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by xgk 86 days ago

> CPUs actually implements 5 distinct data types

Yes, that's true, but the registers themselves are untyped, what modern CPUs really implement is multiple instruction semantics over the same bit-patterns. In short: same bits, five algebras! The algebras are given by different instructions (on the same bit patterns).

Here is an example, the bit pattern 1011:

• as a non-negative integer: 11. ISA operations: Arm UDIV, RISC-V DIVU, x86 DIV

• as an integer residue mod 16: the class [11] in Z/16Z. ISA operations: Arm ADD, RISC-V ADD/ADDI, x86 ADD

• as a bit string: bits 3, 1, and 0 are set. ISA operations: Arm EOR, RISC-V ANDI/ORI/XORI, x86 AND.

• as a binary polynomial: x^3 + x + 1. ISA operations: Arm PMULL, RISC-V clmul/clmulh/clmulr, x86 PCLMULQDQ

• as a binary polynomial residue modulo, say, x^4 + x + 1: the residue class of x^3 + x + 1 in GF(2)[x] / (x^4 + x + 1). ISA operations: Arm CRC32* / CRC32C*, x86 CRC32, RISC-V clmulr

And actually ... the floating point numbers also have the same bit patters, and could, in principle reside in the same registers. On modern ISAs, floats are usually implemented in a distinct register file.

You can use different functions in C on the bit patterns we call unsigned.

1 comments

adrian_b 86 days ago

Yes, registers are untyped, like also memory is untyped, there is no difference, and this is a good thing.

If you had a data type with type tags, that still would not mean that the storage location for it is typed, it would only mean that you have implemented a union type.

Typed memory would mean to partition the memory into separate areas for integers, floating-point numbers, strings, etc., which makes no sense because you cannot predict the size of the storage area required for each data type.

In modern CPUs, the registers are typically partitioned by data type into only 3 or 4 sets: first the so-called general purpose registers, which are used for any kind of scalar data types except floating-point numbers, second a set of scalar floating-point registers, third a set of vector registers used for any kind of vector data types and in very recent CPUs there may be a fourth set of matrix registers, also used for many data types.

In most current CPUs, e.g. Intel/AMD x86-64 and ARM Aarch64, the scalar floating-point registers are aliased over the vector registers, so these 2 do not form separate register sets.

A finer form of typing for CPU registers is not useful, because it cannot be predicted how many registers of each type will be needed.

Therefore, as you say, the data type of an operation is encoded in the instruction and it is independent of the registers used for operands or results.

Moreover, there are several cases when the same instruction code can be used for multiple data types and the context determines which was the intended data type.

For instance, the same instruction for register addition can be used to add signed integers, non-negative integers and integer residues. The intended data types are distinguished by the following instructions. If the overflow flag is tested, it was an addition of signed integers. If the carry flag is tested, it was an addition of non-negative integers. If the flags are ignored, it was an addition of integer residues.

Another example is the bitwise addition modulo 2 (a.k.a. XOR), which, depending on the context, can be interpreted as addition of bit strings or as addition of binary polynomials.

Yet another example is a left rotation instruction, which can be interpreted as either a rotation of a bit string or as a multiplication by a power of 2 of an integer residue modulo 2^N-1 (this is less known than the fact that shift left is equivalent with a multiplication modulo 2^N).

While registers and even instruction encodings can be reused for multiple data types, which leads to significant hardware savings, any program, including the programs written in assembly language, should better define clearly and accurately the exact types of any variables, both to ensure that the program will be easily understood by maintainers and to enable the detection of bugs by program analysis.

The most frequent use of "unsigned" in C programs is for non-negative integers, despite the fact that the current standard specifies that the operations with "unsigned" must be implemented as operations with integer residues. This obviously bad feature of the standard has the purpose of allowing lazy programmers to avoid the handling of exceptions, because operations with integer residues cannot generate exceptions. This laziness can frequently lead to bugs that are not detected or they are detected only after they had serious consequences.

I believe that if one reserves "unsigned" to mean "non-negative integer", then one should use typedefs for different data types whenever "unsigned" is used for another data type, and that includes bit strings, which is probably the next most frequently used data type for which "unsigned" is used.

IBM PL/I, from which the C language has taken many keywords and symbols, including "&" and "|", had distinct types for integers and for bit strings, but C did not also take this feature.

link

xgk 86 days ago

There are even more algebras on the same bits, when you take signed integers into account, such as saturating arithmetic.

One interesting programming language construct that might be useful in this context are Opaque Type Synonyms, a refined form of C's typedef, which modern languages like Rust, Haskell, Go or Scala offer. This allows the programmer to use the same underlying types (e.g. int), give it different names, and define different algebras with the alias. The typing system prevents the different aliases accidentally to flow into each other. Of course that alone does not help to manage the profusion of algebras over the same bits. I think a better approach for a high-level programming language is to follow assembly and really use different names for different operations, e.g. not have + build in. Instead use explicit names like add_uint32, add_polynomials_gf_2, add_satur_arith, etc etc. The user can then explicitly define (scoped) aliases for them, including +, as long as the typing system can disambiguate the uses. The Sail DSL for ISA specification (https://github.com/rems-project/sail) does this, and it is nice.

link

uecker 85 days ago

A user in C can just wrap the type in a structure and define explicit operations on it. You do not need another language for this.

link

xgk 85 days ago

Indeed, that is the standard approach. It is also how some of the aforementioned languages desugar opaque type synonyms during compilation. It has the slight disadvantage that we can no longer use variables like

in some situations, but need to use

    x._polynomials_gf_2

or whatever is the structure's field name. It is nice to avoid this boilerplate, which can become annoying quickly. Let the type-checker not the human do this work ...

> You do not need another language for this.

By the Church-Turing thesis you never need another language, but empirical practise has shown that the software engineering properties we see with real-world code and real-world programmers differ significantly between languages.

link

uecker 81 days ago

You could call it x.val, no need to use a long field name. But you would rarely access it directly anyway. I do not see any type checking advantage here for other languages.

link