Hacker News new | ask | show | jobs
HPE unveils The Machine, a single-memory computer with 160 terabytes of memory (venturebeat.com)
114 points by aomega08 3322 days ago
13 comments

I am not sure if I should be impressed until they reveal more details.

160 TB over 40 nodes is 4 TB per node, which assuming 512 GB dimms requires only 8 dimms (4 per socket) or with 256 GB dimms 16 dimms (8 per socket).

How fast is the interconnect? 100Gb/s?

So far it sounds like an common high memory HPC cluster unit with really unknown technical parameters.

Well...

Consider https://www.supermicro.nl/products/system/7U/7088/SYS-7088B-...

This 7U box can have 24TB on a single 192-core/384-thread node using 128GB DIMMs. At 7U, you can fit 5 of them on 40u of rack space (assuming you can deliver power and get rid of the heat) and deal with far less inter-node communication and CPUs that can run rings around the best ARM-based ones. You can also pack 8 GPUs or your favorite Xeon Phi card (but each will behave like a cluster node) and 8 NVMe cards and 32 2.5" devices. This is slightly less dense than the HP offering, but you can order 5 of them right now. If you do, tell them I told you to.

If you move it to 256GB DIMMs which, it was pointed out, is probably what HPE is using, you have 240TB per rack.

The interesting part seems to be their NGMI (next -generation memory interface) that as far I understand acts as a kind of memory bus and each processor will use it more like a disk device (with special functionality like support for atomic operations etc).

Unfortunately I did not find any performance information about it.

There was a presentation about it done last year by Keith Packard at Linux.conf.au.

Hardware and Software Architecture of The Machine

https://www.youtube.com/watch?v=S--Kgseuy0Q

"So far it sounds like an common high memory HPC cluster unit with really unknown technical parameters."

And really anemic CPUs to boot...

Only real interesting part would be the technical details of their NGMI interconnect.

Only thing that I understood so far is that it is based on FPGA that suggest that it is experimental i.e. no ASIC yet.

Isn't the idea that a single process can address all 160TB of memory as if it were local?

Of course you can cobble together that much memory when you're programming a distributed system of communicating processes. The interesting part would be programming it as if for one computer.

As far I understand, this memory would be not local to any of the processors - all of them access it through shared bus (fabric as they call it) and it appears to be handled rather as a weird (their words) disk device than as a memory.

I did not find information about the bandwidth or latency of such architecture.

Distributed shared memory on commodity hardware/networks is a long solved problem. I was using TreadMarks back in 1994. Making it perform well, on the other hand, is a real challenge.
Where can you get 512 GB dimms?
That is the assuming part because article tries to make it extremely impressive. Well, according to the link provided by pjmlp they most likely use 256 GB dimms https://youtu.be/S--Kgseuy0Q?t=450
I think even 128GB RDIMMs using TSV and 8Gbit DDR4 is expensive at this point. Interestingly, I have not even seen any 16Gbit DDR4 announcements.
Wasn't the Machine supposed to be a memristor computer? Did they ditch the project?
AFAIK there was licencing issues with memristors - in usual forward thinking HP way, they exclusively licenced the tech to someone else, who did no work on it, so it was not ready for the machine.

The IP is also a joint asset of HPE and HP Inc, so that sounds like something I would avoid baking into my product if I was HPE

I recall having read that it was not yet economical to mass produce memristors, which had to compete against modern transistor chips.

So they went ahead with the unified memory architecture, just without the memristor component.

One could argue that this architecture is future-ready, once memristors become economical to produce

Yes, but I think the idea was that The Machine was a constellation of technology projects for memory-focused computers. Although the memristor portion did not live up to their claimed expectations (unfortunate but unsurprising), the overall project of a memory-focused computer continues under the brand of The Machine.

(My outside-in interpretation. Maybe someone from HP can add more concrete detail and insight.)

memristors are unfortunately still vaporware
Neat press release. Bombastic, eye-catching, has people asking what it is.

But the product is just a high memory HPC cluster unit, 4TB per node. It's not "revolutionary", it won't "change everything", it's not a "technological breakthrough".

TL;DR: It's press-bait.

> The new prototype has 160 TB of shared memory spread across 40 physical nodes, interconnected using a high-performance fabric protocol.

So basically something similar to RDMA scaled to 40 nodes, 4 TB RAM per node.

the article is pretty weak, people have built rdma machines that size before, and there have been architectures that allow for direct memory addressability of that size before.

so i have to assume the latter. that seems to be borne out by the little information i can find...addressable persistent memory is clearly a theme. but i haven't found any discussion of what kind of latency hiding mechanisms might be at play and what kind of consistency model is being used. will keep looking for anything detailed an authoritative

(edit - this seems to be pretty relevant https://www.labs.hpe.com/publications...but i don't know how much of it was speculative and how much was built...after all the security papers, there are some about concurrency control)

The Next Platform has had many articles with bits of detail on HPs Machine. Unfortunately, they don't really have any tagging mechanism, but you can search the phrase "the machine" or look at related articles there.

https://www.nextplatform.com/2017/01/09/hpe-powers-machine-a...

Great resource. Based on that, it looks like the fabric has a bandwidth of 600Gb/s (or 1.2Tb/s full duplex).

The interconnect seems to be the real innovation here.

I think the interesting part is the programming model and not the amount of RAM. This computer has no separate storage area. If you want to manipulate data, you can do it in-place instead of reading it into a RAM-like working memory and then writing the result to a disk-like storage.
I think the interesting part is the programming model and not the amount of RAM. This computer has no separate storage area.

You can already mount a filesytem backed by ram, or mmap files that exist on disk. What existing distinction between RAM and storage is holding anything back?

I see a reason some techniques will get more popular, but I don't really seen the difference. Saving a load operation? Not having to worry about flushing to long term storage? Am I missing something?

Based on HPE's website, it looks like they're going to be making their fabric part of this:

http://genzconsortium.org/faq/gen-z-technology/

Underwhelming - you could build the same exact thing with a bunch of white box servers and a pair of Infiniband switches.
The Machine should be fine, but if Samaritan comes online, we are doomed!
How does the Machine recover from software errors? If it has only persistent RAM, then when some important program goes bonkers (it always goes), there is no luxury of pressing the reset button to get back to a pristine state.
Taking into account the possibility of memory corruption, I think ultimately they won't get away without some sort of checkpointing to external storage.

To actually take advantage of persistent RAM and have fast restarts with no rollback, I bet early solutions will just partition the RAM into "bonkers-prone software" and "data" parts, with reboots involving a complete wipe of the former. Then somebody will submit Linux patches to keep some application or kernel data in the data part and hell knows where it goes from there.

And BTW, the announcement doesn't say they already have persistent RAM.

My guess is that there would be a microkernel service that is the core OS, like the BIOS, and it could be instructed to actively wipe the memory and reload the boot files when the machine freezes
There can be bug in a microkernel service (and as it is distributed, it is actually guaranteed it will crash unexpectedly at some point)
Well I guess it's a disposable computer then. Use until it crashes.
I suppose they have a 80GB ssd with the operational system on it.
This falls into a scale up type (vertical) architecture, in what scenarios would you need tin like this? Versus A more scale out/distributed (horizontal) architecture
It's a 40 node mesh. Isn't that scale-out?
Let me rephrase rather. a scale out architecture with commodity machines (using Hadoop, kube, mesos etc.) versus an engineered system in hardware. A client purchased a an HP super dome (another highly engineered system) the cost per CPU core, gb ram, tb of storage was horrendous. More than 10x comparable commodity hardware solutions. 8 months later and they hit performance problems, space issues and adding capacity is a serious chunk of change.

What I am trying to understand in what use cases would this make sense?

This is a scale-out solution, it's just higher performance / lower effort than building your own solution from commodity harwdare. Enterprises often want to buy canned, supported solutions rather than expend the R&D to build their own.

What workload does this configuration make sense for? I don't know, especially with those ARM cores. My guess would be the cluster performs well on workloads similar to Apache Spark. But I don't know why you would choose one over the other.

I believe the usecase is needing high bandwidth interconnects, eg, problems with a lot of synchronizations across the dataset.
Interesting to see ARM starting to become datacenter worthy now. It's cheap and there is a version geared towards any application you can think of. Now let's see HPE use some FPGAs with embedded ARM CPUs. That would really be the killer for all the smaller shops that are building similar exabyte systems.
A lot of it is the big customers can go to a chip company and say "I want this SOC, with X, Y, and Z" - and if they want enough of them, will actually be able to get them at a reasonable price.

I read this is why Qualcomm did their data center chip - some of the big Chinese companies wanted it.

Nice for Redis.
This is funny, because in 100 years, computers with 100 yottabytes of memory will be standard, and they'd be laughing at a headline like this one.
Or they will let out a nostalgic sigh as they think about those simpler days where you could get useful work done on 4 GB of RAM... ;-)
"and also allows accelerators to get direct access to a massive memory-storage footprint" -- ehh?
I'm sure those SGIers now at HPE will be delighted to see so much being made of the UV.

Oh.

Not enough.

Because, you know, there happens to be something else that is driven by memory. I'll give you a hint as to what it is: it's comprised of approximately a hundred billion units. (And no, it's not a galaxy.)

Divide 160 terabytes by the number I just stated, and you'll find that that you only get 1.6 kilobytes per unit. Not really enough (or at least, too close for comfort).

Step up your game, HP! :) But, this is a VERY good start.

-

Edit: I got downvoted to -4 but I have a right to state my requirements. If your requirements aren't that high, use whatever equipment you want. (the guesses below are correct, if you want "the answer to the puzzle.")

Maybe just say what you mean, instead of riddles, as part of contributing to the discussion?
I assume poster is referring to the human brain, but yes, too clever by half and doesn't contribute much to the conversation.

Personally, I'm quite disappointed that this isn't memristor based - not much of an improvement in holding a conscious statefile that disappears when the power goes out.

~100 billions neurons in a human brain I guess. Not sure what 1.6KB per neuron is too close to
I'd guess that it would be close to the size of the adjacency matrix for the human brain, if it is given that the brain has ~1000 dendrites per neuron.
This comment currently has a reply ( http://i.imgur.com/N7BZJfu.png ) which shows why I wanted to leave my original post as a "riddle" rather than coming out and saying it.

The following "crap" is not relevant to anything, only read it if you want the answer to why 1.6 KB is insufficient for a certain science fiction fantasy which will never, not in ten, a hundred, or a thousand years, become relevant any more than magicians will start flying around on brooms. It has nothing to do with anything. Don't get me wrong and think that I think it has relevance to something.

Original version of this comment:

--

Well, it's clear that you could not come even close to perfectly describing the state of a cell and all its connections in 1.6 KB. As just one example, every cell has a different genome[1] which if you encoded it in full would be 700 MB right off the bat. I'm not saying the fact that every neuron has a different genome is relevant to its computational functions, but 1.6 KB is cutting it very close. Let's explore why this is so.

Suppose we simplified and said that each neuron may be connected to n other neurons (we will calculate what n fits in our memory). Addressing 100 billion takes 36 bits. (You need approx. 36 bits to name a number between 1 and 100 billion, i.e. if all you're doing is naming which other neuron it's connected to.). So let's see how many "synapses" the 1.6 KB might be enough to address. -- we are looking at how many connections could be encoded into 1.6 KB.

So 1.6 KB if you did nothing else but as a simplification say that you were referring to another neuron, then 1.6 KB = 1.6 * 1024 * 8 = 13107 bits, divide that by 36 and you get 364 addresses. Suppose you were to then encode each of those addresses with a single byte (an 8-bit brain) representing the strength of that connection, and you are down to only encoding 13107/44 = 297 neural connections with an 8-bit connection value. [EDIT!! As pointed out in a reply, original version of this paragraph contained a math error - I originally divided by 8 after dividing by 36 -- as though each bit of the address needed a connection strength, rather than adding 8 to 36 to get the number of bits needed for each connection and only afterward dividing that into the 1.6 KB]

Okay, so let's see how "297 neural connections" stacks up.

In fact "Each neuron may be connected to up to 10,000 other neurons", but "Each of the neurons has on average 7,000 synaptic connections to other neurons." This means we do not have 8 bits - we do not have even a single bit. (Because 7,000 is more than 297 or even 364).

So you cannot even store the full address of each synaptic connection (the full address of the other neuron it's connected to) let alone a value of the strength of it, or anything else computationally interesting that may be going on chemically. Of course, obviously one neuron will not be directly connected with another far away, for example on the other hemisphere of the brain (meaning you can almost certainly shave nearly 1 bit of addressing off of the 36 bits from the get-go, and likely you need a far smaller address space, if you really look at it.)

But as you can see, however you slice it, you're just cutting it extremely slow. If synaptic connections had even 10 bits of value of some kind than you're way over your budget here.

But as you can see we're close - the numbers just barely don't work.

HP are on the right track - they simply need to step their game up.

[1] recently in the news - https://www.scientificamerican.com/article/scientists-surpri...

EDIT* - thanks for the correction, yes, there was a math error.

You don't need a single memory machine to simulate the human brain though. Because neurons have a clear spatial structure, you can break the problem up into a mesh, give each piece of the mesh to a different machine, and give them some interconnect so they can broadcast relevant portions of their state to neighboring machines. You don't need low latency direct memory access for every node to all of the memory. Look at something like the https://en.wikipedia.org/wiki/Tianhe-2 which has about a petabyte of RAM

EDIT: I think you're also not considering the ample opportunities to compress this data set. For instance, you can represent the connections between neurons with a sparse adjacency matrix, and there is a whole body of literature devoted to coming up with small representations of that kind of data, while still being able to perform operations efficiently. For the genome variance, keep in mind the variance is going to be absolutely tiny in comparison to the whole genome, so you can just store the genome of one of the neurons and then the diffs between it and all the other neurons or something.

> why I wanted to leave my original post as a "riddle" rather than coming out and saying it.

Because the comment was fundamentally irrelevant? You could post "step up your game" on every single tech submission. It's white noise.

You made a math error:

You can encode about 300 addresses, at 36+8 = 44 bits per item, not 45 addresses, at 36*8 = 288 bits per address.

The bits are additive, because we're storing an address and data, not a byte of data per bit of address.

thanks, corrected.