Hacker News new | ask | show | jobs
by teawrecks 1064 days ago
Ok, I see where the lay person would get confused on this. In the context of this article, every core is what Wikipedia calls a "node". There is no difference between a single 32C CPU and 4x 8C CPUs except for their ability to share memory faster. Both are similarly defined as horizontal scaling in the context of this article. You're not going to finish a single workload any faster, but you're going to increase the throughput of finishing multiple workloads in parallel.

The fact that AMD chooses to package the "nodes" together on one die vs multiple doesn't change that.

2 comments

The ability to “share memory faster” is a bigger distinction than you make it out to be. Distributed applications look quite different from merely multithreaded or multiprocess shared-memory applications due to the unreliability of the network, the increased latency, and other things which some refer to as the fallacies of distributed computing. To me, this is usually what people mean when they talk about “horizontal” vs. “vertical” scaling. With modern language-level support for concurrency, it hurts much more to go from a shared memory architecture to a distributed one than to go from a single-thread architecture to a multithreaded one.
The wikipedia article qualifies what it means with vertical scaling

> typically involving the addition of CPUs, memory or storage to a single computer.

This is one of those times when I feel like you just didn't read anything I typed. So... I'm just gonna let you be confidently incorrect.
I'm reading what you're typing, but I just don't agree with it. It's also contradicted by both the article we're discussing and the wikipedia article; further it's an interpretation of vertical scaling that effectively doesn't

Distinction between horizontal and vertical scaling becomes nonsense if we accept your definitions, because literally nobody does that sort of vertical scaling.

Wrong. If you do any of these you're scaling vertically, even by that definition:

* Replace the CPU with a faster one, but with the same number of cores. Or simply run the same one at a higher clock rate.

* Add memory, or use faster memory.

* Add storage, or use faster storage.

These are all forms of vertical scaling because they reduce the time it takes to process a single transaction, either by reducing waits or by increasing computation speed.

> It's also contradicted by both the article we're discussing and the wikipedia article

The article agrees with this definition. Transaction latency decreases iff vertical scale increases. Transaction throughput increases with either form of scaling. Without this interpretation, the analogy to conveyor belts makes no sense.

Think of it this way, instead. Building a multi-belt system is a pain in the ass that complicates the design of your factory. Conveyor belt highways, multiplexers, tunnels, and a bunch of stuff related to the physical routing of your belts suddenly becomes relevant. But you can still increase throughput keeping a single belt, if your bottleneck is not belt speed but processing speed (in the industrial sense). I can have several factories sharing the same belt, which increases throughput but not latency.

Also, it's worth pointing out that increasing the number of processing units often _does_ decrease latency. In Factorio you need 3 advanced circuits for the chemical science pack. If your science lab can produce 1 science pack every 24 seconds but your pipeline takes 16 seconds to produce one advanced circuit, your whole pipeline is going to have a latency of 48 seconds from start to finish due to being bottlenecked by the advanced circuit pipeline. Doubling the amount of processing units in each step of the circuit pipeline will double your throughput and bring your latency down to 24 seconds, as it should be. And if you have room for those extra processing units, you can do that without adding more belts.

The idea that serial speed is equivalent to latency breaks down when you consider what your computer's hardware is really doing under the scenes, too. Your cpu is constantly doing all manner of things in parallel: prefetching data from memory, reordering instructions and running them in parallel, speculatively executing branches, ...et cetera. None of these things decrease the fundamental latency of reading a single byte from memory with a cold cache, but it doesn't really matter because at the end of the day we're measuring some application-specific metric like transaction latency.

>Also, it's worth pointing out that increasing the number of processing units often _does_ decrease latency. [...]

This isn't latency in the same sense I was using the word. This is reciprocal throughput. Latency, as I was using the word, is the time it takes for an object to completely pass through a system; more generally, it's the delay between a cause and its effect/s. For example, you could measure how long it takes for an iron ore to be integrated into a final product at the end of the pipeline. This measure could be relevant in certain circumstances. If you needed to control throughput by restricting inputs, the latency would tell you how much lag there is between the time when you throttle the input supply and the time when the output rate starts to decrease.

>The idea that serial speed is equivalent to latency breaks down when you consider what your computer's hardware is really doing under the scenes, too. Your cpu is constantly doing all manner of things in parallel: prefetching data from memory, reordering instructions and running them in parallel, speculatively executing branches, ...et cetera. None of these things decrease the fundamental latency of reading a single byte from memory with a cold cache, but it doesn't really matter because at the end of the day we're measuring some application-specific metric like transaction latency.

Yes, a CPU core is able to break instructions down into micro-operations and parallelize and reorder those micro-operations, such that instructions are retired in a non-linear manner. Which is why you don't measure latency at the instruction level. You take a unit of work that's both atomic (it's either complete or incomplete) and serial (a thread can't do anything else until it's completed it), and take a timestamp when it's begun processing and another when it's finished. The difference between the two is the latency of the system.

Or other people can disagree with your interpretation, especially because the analogy is somewhat strained and highly oversimplified.