Hacker News new | ask | show | jobs
by chaboud 2909 days ago
"...across geographic regions, such that one could treat a set of virtual machines as one ultra-wide-bus CPU with a 1 GHz clock speed."

I'm not entirely sure what you're trying to say here, but I am entirely sure that it's wrong.

A precise clock isn't the same thing as the removal of latency, and the operations of a CPU are ordered. That is, I can't start working on the multiplication of A * (B + C) until the addition result is available. Furthermore, if the elements of the operation, B and C, or parts of those elements, were separated by miles (or even feet), the latency of that operation would increase by orders of magnitude.

I doubt that even a 1MHz distributed processor would be achievable as a large distributed bit field computer as you've laid out here.

If you're worried about overhead in computing, it is critical to remember that a foot is a nanosecond. I'd much rather break my data down to register size (and I often do) than ship my data over a wire or fiber (which I also often do).

1 comments

Actually, if you marshall all of your addressible units up front (4096 bit sentences, instead of 64 bit words), which aligns well with raw allocation units on many file systems, as an end user of the service, the overhead (to you) is reduced to network I/O if the product is built correctly.

The only hard part requiring serialized synchronization is the carry bit, across compute nodes. Share the carry bits between nodes, and while relaying a sentence to a cluster of synchronized nodes, the pipeline can shoot the sentence into the cluster as a unit, proxy and chain together the carry bits with a coordinated execution plan, and on the other side of the pipe, you get your well-timed 4096 bit result, all at 1 GHz, because the service is designed and produced to handle input at nanosecond intervals.

What are the advantages? Predictability, and expanded throughput.

Now you can look at an entire passage of text and make a determination about it in less time. Or stack many passages and composite them to assess or intuit variation. Designing the product this way makes it easy to reason about, and thus easier to market and sell. Is it possible to make a profitable system that works like this? Gee, great question! There's no obvious answer.

But anyway, from the perspective of a subscriber, it's on them to marshall their data, and then, if they have operations for which the scale of 4096 bit chunks improves results, they can get their granular operations done at 1 GHz, which allows them to predict time spent and overall cost more easily.

(e.g. I have all these [less-than-but-up-to] 4096 bit toots marshalled in a single data store, from a shit ton mastodon instances (i did all the crawling and retrieving, and saved them in one place, as a standardized data set), and I think this fact might be true about some of them, here is the rule set to interpret, please give me back the members of the toot array that return true when the function of this rule set returns true)

BTW, don't get hung up on 4096 as "the best number" I just chose it because it's a nice square number.

"The only hard part requiring serialized synchronization is the carry bit, across compute nodes."

I don't think that's the only hard part. Branches, for instance, are rough.

"What are the advantages? Predictability, and expanded throughput."

I think the system you've described would definitely have some predictability, but I contend that it would be predictably slow. Furthermore, given that everything is going to have to be pipelined up to its eyeballs, you don't need nanosecond synchronization to achieve high throughout. Audio, for instance, often achieves higher throughout than clock. Look at the AES MADI spec for an example of this (basic link at Wikipedia here: https://en.m.wikipedia.org/wiki/MADI ).

I'm just not seeing how this is practicable, or, more critically for this conversation, how it is particularly uncorked by precision clocking in a particularly meaningful way. It strikes me as an approach that would have to deal with edge cases robustly, largely using the same mechanisms that would be necessary for imprecise clocking (but with assured sequencing).

"But anyway, from the perspective of a subscriber, it's on them to marshall their data, and then, if they have operations for which the scale of 4096 bit chunks improves results, they can get their granular operations done at 1 GHz, which allows them to predict time spent and overall cost more easily."

This strikes me as similar to the complexity sizing in Craig Gentry's fully homomorphic encryption system, in that all operation sets up to a configured encodable complexity require the same computstional effort, effectively inefficient for smaller operations. For timing attacks in cryptosystems, it actually seems reasonable to retain fixed effort, even if Gentry's original system was largely impractical.

For general computation? I think that the sweet spot between job chunking and dataset chunking for the system you've described may not actually exist.