Hacker News new | ask | show | jobs
by downer68 2906 days ago
An interesting side-effect of this, is that it would enable a standard of synchronization, across geographic regions, such that one could treat a set of virtual machines as one ultra-wide-bus CPU with a 1 GHz clock speed.

All of the local overhead of real system resouces and network synchronization could handled by the remainder of the real CPU clock available to the bare metal, but contribute to the computation of a segment of a virtual bit field, at speed.

So, now maybe we get a commodity 4096 bit 1 GHz CPU as a service. Which, is maybe comparable to a 64 core processor, but without the overhead of chunking down to the width of 64 bits.

5 comments

"...across geographic regions, such that one could treat a set of virtual machines as one ultra-wide-bus CPU with a 1 GHz clock speed."

I'm not entirely sure what you're trying to say here, but I am entirely sure that it's wrong.

A precise clock isn't the same thing as the removal of latency, and the operations of a CPU are ordered. That is, I can't start working on the multiplication of A * (B + C) until the addition result is available. Furthermore, if the elements of the operation, B and C, or parts of those elements, were separated by miles (or even feet), the latency of that operation would increase by orders of magnitude.

I doubt that even a 1MHz distributed processor would be achievable as a large distributed bit field computer as you've laid out here.

If you're worried about overhead in computing, it is critical to remember that a foot is a nanosecond. I'd much rather break my data down to register size (and I often do) than ship my data over a wire or fiber (which I also often do).

Actually, if you marshall all of your addressible units up front (4096 bit sentences, instead of 64 bit words), which aligns well with raw allocation units on many file systems, as an end user of the service, the overhead (to you) is reduced to network I/O if the product is built correctly.

The only hard part requiring serialized synchronization is the carry bit, across compute nodes. Share the carry bits between nodes, and while relaying a sentence to a cluster of synchronized nodes, the pipeline can shoot the sentence into the cluster as a unit, proxy and chain together the carry bits with a coordinated execution plan, and on the other side of the pipe, you get your well-timed 4096 bit result, all at 1 GHz, because the service is designed and produced to handle input at nanosecond intervals.

What are the advantages? Predictability, and expanded throughput.

Now you can look at an entire passage of text and make a determination about it in less time. Or stack many passages and composite them to assess or intuit variation. Designing the product this way makes it easy to reason about, and thus easier to market and sell. Is it possible to make a profitable system that works like this? Gee, great question! There's no obvious answer.

But anyway, from the perspective of a subscriber, it's on them to marshall their data, and then, if they have operations for which the scale of 4096 bit chunks improves results, they can get their granular operations done at 1 GHz, which allows them to predict time spent and overall cost more easily.

(e.g. I have all these [less-than-but-up-to] 4096 bit toots marshalled in a single data store, from a shit ton mastodon instances (i did all the crawling and retrieving, and saved them in one place, as a standardized data set), and I think this fact might be true about some of them, here is the rule set to interpret, please give me back the members of the toot array that return true when the function of this rule set returns true)

BTW, don't get hung up on 4096 as "the best number" I just chose it because it's a nice square number.

"The only hard part requiring serialized synchronization is the carry bit, across compute nodes."

I don't think that's the only hard part. Branches, for instance, are rough.

"What are the advantages? Predictability, and expanded throughput."

I think the system you've described would definitely have some predictability, but I contend that it would be predictably slow. Furthermore, given that everything is going to have to be pipelined up to its eyeballs, you don't need nanosecond synchronization to achieve high throughout. Audio, for instance, often achieves higher throughout than clock. Look at the AES MADI spec for an example of this (basic link at Wikipedia here: https://en.m.wikipedia.org/wiki/MADI ).

I'm just not seeing how this is practicable, or, more critically for this conversation, how it is particularly uncorked by precision clocking in a particularly meaningful way. It strikes me as an approach that would have to deal with edge cases robustly, largely using the same mechanisms that would be necessary for imprecise clocking (but with assured sequencing).

"But anyway, from the perspective of a subscriber, it's on them to marshall their data, and then, if they have operations for which the scale of 4096 bit chunks improves results, they can get their granular operations done at 1 GHz, which allows them to predict time spent and overall cost more easily."

This strikes me as similar to the complexity sizing in Craig Gentry's fully homomorphic encryption system, in that all operation sets up to a configured encodable complexity require the same computstional effort, effectively inefficient for smaller operations. For timing attacks in cryptosystems, it actually seems reasonable to retain fixed effort, even if Gentry's original system was largely impractical.

For general computation? I think that the sweet spot between job chunking and dataset chunking for the system you've described may not actually exist.

Are you saying that 64 bit CPU + 64 bit CPU = 128 bit CPU (as long as they are time synced)?

1. It doesn't work this way 2. Why would you want a 4096 bit CPU?

For financial transactions, it would certainly allow for fast high-precision floating point math. Imagine IEEE 754 4096-bit floats. Not sure anyone would actually use this, and you'd still have to standardize the rounding precision, but it might be an interesting vein of research.

Still, I agree with you -- what the OP described is not a 4096-bit processor.

Now highly-synchronized VMs -- that's an entirely different matter. Probably a boatload of use cases for those.

64 bits already gives you 16 digits, that is enough for a trillion dollar to one one-hundredth of a cent. So maybe there is someone who needs 128 bits, which is part of IEEE 754 since 2008, but that then is probably enough to calculate the total of all financial transaction ever done.
Where's that useful? Options pricing? I have no idea.
Why would you use floating point math for finance?
The alpha calculation can (and should) use floating point math. If the market has a midpoint of $99.99 with a bid/ask of $99.98/$100.00, you could compute a bunch of signals and end up with an alpha-adjusted midpoint of $100.00383736383..., at which point you’d convert it back to fixed-point and then try to buy $100.00
A floating point representation is not really the issue, the issue is not using base 10, and IEEE 754 specifies base 2 and base 10 floating point formats and operations. But I am of course not sure whether the original comment referred to base 2 or base 10 and given how common the mistake of using base 2 floating point numbers for financial calculations is, you may be correct with the intention of your comment.
I'm aware of the fact that you don't use floating point math for finance -- for exactly the reason you described -- but the academic in me wonders if you could formally specify a high-enough degree of precision -- and all the corner cases -- to allow FP math for even just a subset of transactions. This would (in theory) allow to programmers to bypass the Decimal classes in your favorite OO language (or GMP if you're a C fan).

Again, purely an academic inquiry :-)

My point was more that it is wrong to say that financial calculations should not be done using floating point formats, for example Decimal in .NET and BigDecimal in Java are floating point formats and they are the types you should use for financial calculations. The important difference as compared to formats like IEEE 754 binary32 (formerly single) and binary64 (formerly double) is that the representation is based on base 10 instead of base 2. Fixed point or floating point and base 2 or base 10 are two orthogonal choices.

So when you initially mentioned high precision floating point numbers for financial calculations that was not necessarily a bad idea because you might have thought about base 10 floating point numbers. The comment I replied to however assumed you meant base 2 which of course most people do if they say floating point numbers without specifying the base and which of course is a bad idea for financial calculations more often than not. I just pointed out that assuming base 2 is usually but not technically correct.

And you can of course use base 2 floating point numbers for financial calculations - 32 bit, 64 bit, or 4096 bit - you just have to keep track of the accumulated errors and stop or correct the result before the error grows into the digits you are interested in. But why would one want to do this? The only thing I can really think of is that you need maximum performance and there is no hardware support for base 10 floating point numbers. And just using integers as base 10 fixed point numbers, which would often be a even better solution, must not be an option.

I don't know how you found your way onto the addition operator (+) on your keyboard, because that's not at all what I was driving at.

I think you are... JUMPING! TO CONCLUSIONS! (get it?)

Anyway, at it's core, much of the logic within a turing machine winds up being addition in an accumulator. So, you widen the pipeline, and that adds place settings to the numeric values addressed at a location in RAM.

I think we both know that each place setting increases the maximum valus of the addressible unit by an exponential factor of the base, which in computing, and so in this instance, is binary.

Specifically: 2^4096 instead of 2^64

Golly, did I get my math right? This sure is difficult to for me to understand!

Why would anyone want a 4096 bit CPU? Oh, I dunno. I suppose 640K ought to be enough for anyone.

I'm having a hard time trying to figure out if you are serious or if you are masterfully trolling everyone.
Good job, guys! Nice downvotes! Real nice!

I answered substantively, addressing each point carefully, and I was pleasantly rewarded for the time I took to respond.

Great incentive system you guys have worked out! Glad to see it being used as intended! Works like a charm!

You were likely down-voted for the snark with which you addressed the points.
Suppose you do a simple addition on your 4096bit "CPU", you have to propagate the carry from the first 64bits to the next 64. How do you do that within your clock cycle over the internet? You'd have to pipeline them so that each subsequent 64bit add waits for the previous carry, but then wouldn't it be orders of magnitude faster to just do it on the same CPU rather than taking the time and resources to do a single 64bit add followed by a high latency network transfer? At any rate what does clock synchronization buy you here exactly, data transfer are still high-latency and high-jitter, at best you're isochronous but definitely not synchronous.

Either I completely misunderstand what you're proposing or it doesn't make sense at all.

I’m not quite sure what GP is getting at, either, but I can sort of see the lockstep synchronization described letting you build something like the original Thinking Machines Connection Machine out of more distributed parts.

The original Cray supercomputers also benefitted from a design where every wire in the pipeline was the same length for “free” synchronization courtesy of the speed of light.

I don’t think you understand how memory bus width is calcuslated or what it means. You are an order of magnitude off on the layer in question.
An order of magnitude. Jeepers, that sounds really bad.
How would the math work on that? Simple addition now requires coordination of results across many CPUs. Worst case is N-1 ticks where N is the CPU count. What operation would get faster by such a virtual CPU?
Economy of scale, my dude.

An organization seeking to market a product based on any spare slack or wastage of their bare metal could stitch together a niche product like this from enough resources, and price it in the space where it nets them money, and is cheaper than something an individual or small business might be capable of building on their own, with the cheapest possible parts.

That's basically the the core principle of every cloud product being sold.