| Just got my hands on the spec and read through it for a bit. A couple interesting things I noticed that were notable changes from GDDR6. Obviously, the big news is PAM3 signaling, and on die ECC. These aren't all that new, as NVIDIA's GDDR6X used PAM4 signaling (at lower frequencies than traditional GDDR6) and an unnamed DRAM vendor had GDDR6 DRAMs with on die ECC, though at the cost of having annoyingly high read and write latencies. Thankfully, only the DQ (data) pins use PAM3 signaling. I'm going to do my best to explain how this works in GDDR7, since I need to understand this for my work: If you don't know what PAM3 is, it stands for "Pulse Amplitude Modulation, 3 levels" Traditional communication could be thought of as PAM2, since there's a level for 0 and 1. But we usually call it NRZ for "Non-return to Zero" There's a bit of nuance, since not all binary data communication is NRZ. But that's a different discussion. Now you might ask, why PAM3 and not PAM4? With 4 levels you can transfer 2 bits, and that seems much easier to work with. Well, it's because we hate ourselves, and we hate you. That's why. For context, a GDDR6 channel uses 16 DQ (data) pins + 2 EDC (Error Detection/Correction) pins and 16 transfers for 256 data bits and 32 CRC bits transaction. GDDR6X (from my understanding) does the same thing, but since it's PAM4, it sends 2 bits per transfer with (I think) half the number of transfers. GDDR7 on the other hand has 11 DQ pins of PAM3 signaling. Since PAM3 is 3 state {-1, 0, 1} these are defined as "symbols" or "trits" (I really hate the word "trit".) These symbols have are 3 separate encoding methodologies (yikes) that are used in this protocol: 11b7S, 3b2S, 2b1S. These essentially determine how many bits of data you can encode in a given number of symbols. 11b7S means 11 bits of data encoded in 7 symbols. 11 bits of data has 2048 unique combinations, 7 symbols have 2187 unique combinations (3^7). 3b2S is 3 bits (8 combinations) encoded in 2 symbols (9 combinations). And 2b1S is a misnomer but it's application specific to the Poison and Severity flag bits and the Severity flag takes precedent, so the unrepresented combination is invalid. Like GDDR6, there are still 16 transfers, but the DQ bus width is reduced to 11 DQ pins, and a parity error pin. With this, we get a total of 176 symbols per transaction. When decoded, this allows us to send 256 bits of data, 2 bits for Poison/Severity flags (this is the 2b1S symbol), and 18 bits of CRC. Encoded though is a bit of a doozy: 163 symbols for the data, 1 symbol (2b1S) for the Poison/Severity flags, and 12 symbols (3b2s) for the CRC. If that's not confusing enough, The 163 data symbols don't even all use the same encoding. The first 161 symbols encode 253 bits in 23 sets of 7 symbol to 11 bit (11b7S) sets, for 23 sets of 11 bits. The remaining 3 bits a are encoded in the remaining 2 symbols as a 3b2S set. How this is mapped to each pin is outlined in the spec. I'll take a look at that part later, as I have had enough unfriendly math for the day. On a different note, there were some other notable changes that I noticed: They also shrunk the CA (Command Address) bus width from 10 pins down to 5 and run it twice as fast relative to the CK and WCK clocks. Instead of 2 cycles of 10 bits, it's 4 cycles of 5 bits. The 5 bits are split into "Row" [0:2] and "Column" [3:4] bits. It was a bit weird at first glance, but it's actually kind of nice. The commands make a lot more sense now. Interestingly enough, it looks like the CABI (Command Address Bus Inversion) and CAPAR (Command Address Parity) are part of the 20 bit CA command now. Though I'm not sure how useful a CABI is if it's once every 4 cycles. Not sure how much power you're really saving. There are 64 mode registers instead of 16. Still 12 bit wide though. Wait WTF? This (kind of) made sense in GDDR6, since the address and the data fit nicely into 16 bits (4 address, 12 data) and you could use the other 4 bits as the command ID for MRS. Now we have 12 bit wide registers, 6 bit wide addresses, and the MRS command is a double length command because it doesn't use the column bits. This is really weird. Registers 0-31 are defined by the spec, 32-47 are reserved for future use, and 48-63 are Vendor Specific. Found this funny - There's a feature for addressing clock drift between CA and WCK clocks due to varying Voltage and Temperature. It's called the "Command Address Oscillator" I wonder why they picked that name... "There is a CAOSC associated with each channel and operates fully independent of any channel’s operating frequency or state" Someone had fun with that one! Other Quick blurbs: - No mention of Quad Data Rate (QDR) - CTLE looks like it's supported from the DRAM side. - They added an RCK (read clock) pair, and is generated on the DRAM side. - They added a static data scrambler! You don't know how happy I am to see this! |
> For context, a GDDR6 channel uses 16 DQ (data) pins + 2 EDC (Error Detection/Correction) pins and 16 transfers for 256 data bits and 32 CRC bits transaction. GDDR6X (from my understanding) does the same thing, but since it's PAM4, it sends 2 bits per transfer with (I think) half the number of transfers.
GDDR6X did not transfer 2 bits at a time despite being PAM4, because they disallowed transitions between the lowest voltage and the highest voltage.
Because of that each group of 4 symbols has 139 easily-stackable sequences, so they went with 7b4S.
https://research.nvidia.com/publication/2022-04_saving-pam4-...
I couldn't tell you how each transaction is laid out.