Hacker News new | ask | show | jobs
by castratikron 1259 days ago
I'm hoping someday there will be an embedded Linux processor with this much cache. 128MB on-die SRAM means the PCB would no longer need separate DRAM. The complexity of the board routing would also go down. That much RAM ought to be enough for a lot of embedded applications.
7 comments

The economics don't work out. Why would you avoid something as trivial as board routing, as cheap as $2 per gigabyte DRAM, and as performance-enhancing as having gigabytes of main memory, just to use a 128 MB on-die (or on-package) SRAM (at a price of ~$500/GB?)?

The main distinction between application processors that can run Linux and microcontrollers that use onboard RAM (and often Flash) is that the former have an MMU. It's attractive to imagine that your SBC might only need something as simple as a DIP-packaged Atmega for an Arduino, and I can imagine a system-on-module - actually, saying that, I think several exist, ex. this i.MX6 device with a 148-pin quad-flat "SOM" with 512 MB of DDR3L and 512 MB of Flash:

https://www.seeedstudio.com/NPi-i-MX6ULL-Dev-Board-Industria...

Whether you consider that Seeed branded metallic QFP (which obviously contains discrete DRAM, Flash, and an iMX6) to be a single package, while a comparably-sized piece of FR4 with a BGA package for each of the application processor, DRAM, and Flash on mezzanine or Compute-module style SODIMM edge connectors would not satisfy your desire for an embedded Linux processor with less routing complexity, I don't know. They build SOMs for people who don't want to pay for 8 layers and BGA fanout all the time.

I don't think there are enough applications for embedded systems that need 128M of onboard SRAM that won't support the power budget, size, complexity, and cost of a few GB of DRAM.

> Why would you avoid something as trivial

L3 cache is orders of magnitude faster than using RAM.

You're talking a maximum of 50GB/s for DDR5, versus 1500GB/s for L3 cache

https://en.wikipedia.org/wiki/List_of_interface_bit_rates#Dy...

https://meterpreter.org/amd-ryzen-9-7900x-benchmark-zen-4-im...

It's a paradigm-shifting increase in processing speed when you don't need to hit RAM.

+ totally agree with that.

There is a use case when you can improve performance by keeping compressed (LZ4) data in RAM and decompressing by small blocks that fit in cache. This is demonstrated by ClickHouse[1][2] - the whole data processing after decompression fits in cache, and compression saves the RAM bandwidth.

[1] https://presentations.clickhouse.com/meetup53/optimizations/ [2] https://github.com/ClickHouse/ClickHouse

You're correct but that is still a niche segment because markets that need 128MB of super-fast memory are almost always happy to pay a little bit more to get 4GB+ of "L4" (aka DRAM).
The economic point stands that you aren't going to get a processor with only cache and no RAM because virtually no workloads want such an unbalanced system.
As SSDs get faster and L3 caches get larger, will conventional RAM get squeezed out? I know Optane failed a few years back, but that kind of convergence seems inevitable in the long term.
Isn’t it inevitable that conventional RAM will continue to get larger and faster as well?
IDK, I got the impression that while RAM was getting larger and higher-throughput, that was coming at the cost of higher latency.
> The economics don't work out. Why would you avoid something as trivial as board routing, as cheap as $2 per gigabyte DRAM, and as performance-enhancing as having gigabytes of main memory, just to use a 128 MB on-die (or on-package) SRAM (at a price of ~$500/GB?)?

Size? But then vendor could just ship the CPU+RAM stacked on top of eachother.

There are SOCs available with the DRAM on top already. eg https://www.microchip.com/en-us/products/microcontrollers-an...
Basically every smartphone ARM SoC for over 10 years now.

Some Raspberry PI SoCs also had the RAM soldered on top.

That would be incredibly inefficient. The price difference of such a 128MB L3 + 0GB DRAM, compared to let's say 128MB L3 + 2GB DRAM would be quite small, and in practice the performance would be much much higher because realistically in your 128+0 setup you'll be wasting easily half of that on the OS or libraries data that isn't actually needed at the moment, whereas having DRAM you can actually use the whole 128MB of L3 for things that need to be fast.

It's also extremely niche to have a workload that requires such high CPU performance, but that it would fit including a linux OS in 128MB. Usually something like that is FPGA or DSP territory.

I think what you want is a cheap ARM CPU with DRAM stacked on top of it on the same package (which exists).

If you just want to reduce board complexity (what a hobbyost/maker/homebuilder dream that would be), there's lots of package-on-package and system-in-package offerings already!

AllWinner V3s, S3. Theres a SAMA5D2 SiP. Bouffalo BL808 (featured on the Pine Ox64). There's a lot a lot more. I think there's a couple with even more memory too.

Intel's Lakefield, with Foveros stacking, was an amazing chip with 1+4 cores and on chip ram. High speed too, 4266MHz, back in 2020 when that was pretty fast. This is more for MID/ultrabooks, but wow what a chip, just epic: add power and away you go. Ok not really but not dealing with routing (and procuring!) highspeed ram is very nice.

Intels been doing such a good pushing interesting nice things in embedded, but the adoption has been not great. The Quark chips, powering the awesome Edison module, had nice oomph & Edison was so well integrated, such an easy to use & so featureful small Linux system... wifi & bt well well well before RPi.

It would be fun to see DRAM-less computers but I more imagined them being big systems with a couple GB of sram. There's definitely potential for low end too though!

The Pine64 Ox64 has a Bouffalo Labs BL808 with 64MB of pSRAM for $6-8 and an MMU. It's already got sorta-working Linux build for it.
I want to build my DYO usb keyboard, that including 64bits RISC-V assembly coding of the keyboard firmware.

I have been lurking on the Ox64 for while but I need a few more green lights:

- Is the boot rom enough to fully init the SOC? Aka, I don't need to run extra code I would need to include in my keyboard firmware on the sdcard.

- The hardware programming manual misses the USB2 controller with its DMA programming. Even with some SDK example, you would need the hardware programming manual to understand properly how all that works.

- I want to run my keyboard firmware directly from the sdcard slot, and that directly on the 64bits risc-v core, possible? (no 32bits risc-v core).

The SDCard is not listed as a supported boot target, though you could almost certainly build a small bootloader that's stored in the qSPI flash and then load the rest of your code into RAM from there.

I'm not the most familiar with it, but I believe all hardware init (setting clock source, initialing USB, GPIO, etc.) is handled by the flashable firmware of which there are open source SDKs for.

Then, it means I would need to flash my keyboard firmware, or a SDcard loader firmware.

I guess this is a "standard" flashing protocol over usb, enabled by the right button pressed at power on (plugging the USB cable). Would I need to including the flashing support code into my keyboard/SDcard loader firmware or is it handled separately by a different piece of hardware?

Any specs on the format of the firmware image, to know which core will run the real boot code?

Erk... soooo many questions in the wrong news :(

I'm not sure you could flash over USB either without significant work. There's no UART <-> USB device on the Ox64, so you need to use an external one connected to some GPIO pins. You could maybe build a DFU mode yourself, but I'm somewhat skeptical it would work (though it might be possible, there's 3 cores in the thing). Despite there being not one, but TWO USB ports on the Ox64, neither are used for flashing. The micro USB type B connector is only used for power delivery and the USB-c is primarily intended for being a host device, ie. for plugging in a camera module.

Edit: to clarify, there's a bug in the bootrom that prevents the initialization of the USB device. Newer revisions of the Ox64 may fix this.

Then:

- how do you run anything with that board if you cannot flash anything, I don't understand?

- I cannot use it as usb keyboard controller because of a bootrom bug? (power/data via usb-c)

Now I am confused.

Note that PSRAM is just a DRAM with an easier interface. It's not at all comparable to true SRAM.
I don't think that's entirely true, but the 'p' is pSRAM does stand for 'pseudo' and does have refresh circuitry and is slower than true SRAM. By how much, I have no idea.
I mean having refresh circuitry basically means it's DRAM. that's one of the unique points of DRAM, it has to periodically refreshed.
Now that's what I'm talking about. Very cool, thank you.
Might be worth taking a look at the announced Intel Xeon MAX chips then. I watched a video on it last night and these new server CPUs have a boatload of memory on the chip and can actually run without needing external DRAM.
On a "3nm" process, how much die real estate for 8GB/16GB of sram presuming we are in a fantasy world with massive dies?
No change. SRAM got almost no process improvements compared to 5nm. And 5nm had minimal compared to 7nm. So 3nm has “3nm”-class small transistors and 7nm class SRAM.
Ok, there is no change.

But how much die real estate for 8GB/16GB of sram in such fantasy world?

Going based on AMD's first generation V-Cache (TSMC 7nm), you could get 1GB of SRAM onto a die slightly larger than a top of the line NVIDIA GPU. 2GB would be too large to fab as a single die. Or you could spend several million to get a Cerebras Wafer Scale Engine 2 with 40GB of SRAM in aggregate and a ton of AI compute power all on one wafer.
ok.

Then 8GB sram with a modern CPU, Zen4 for instance, is a die of ~ 9 top-of-the-line GPUs dies.

And now, with 3D? ... mmmmmh...

What is the size of the apple M2 die already?