Hacker News new | ask | show | jobs
by xodjmk 1493 days ago
Parallel yes, as other people mentioned, this is almost the entire point of using FPGAs. Regarding Asynchronous, it depends on what you mean. Xilinx(AMD)/Altera(Intel) FPGAs are designed from bottom up to be synchronously clocked. The fabric and tools are designed to use synchronous pipeline registers everywhere to minimize combinatorial logic and increase throughput. You might want to have a design with multiple asynchronous clock domains, but this increases complexity and requires care whenever you want to jump between clock domains. Trying to force asynchronous design into an FPGA seems counter productive. What would be the advantage of asynchronous design?
3 comments

There are some advantages.

For example, in Alpha AXP they measured that 60% of energy spent in device is due to clock propagation. No clocks to tick - no energy spent. Why do we need to even clock FPU? Or bus - if we are in loop that is in cache.

Another example: in async design ripple-carry adder will exhibit O(log(N)) expected time, with worst case being O(N) and most of the time it will be even less O(log(L)) where L is number of bits that are non-zero. Basically, adding 1 will be as fast as, well, doing AND and XOR in parallel. For clocked design you need to make adder more complicated to make sure that worst case is O(log(N)).

The same is true for other parts as well - multiplier may not even need to wait for some values multiplied by zero bits. You may end up with O(log(N)) and even faster average case multiplier.

Your design does not need strict adherence to timing requirements: if you have seldomly used slow part, your chip still would work fast most of the time (in average). I know of one case where clock frequency of synchronous design had to be turned down because of problems in the placement of some, you guessed it, not frequently used part of a chip - a long bus line to some I/O controller that operated on main clock frequency. This means your asynchronous design can be more modular.

No clocks to tick - no energy spent

If only it were that simple. Logic gates take time to settle, and each input gate switch or transient will have a ripple effect on all its downstream gates, which can be many in a complex circuit. Synchronous logic elements such as latches will block the spurious transients from propagating beyond the next clock barrier, but if you lack those, you also lose the protection against propagating logic transients. And every transient draws a little bit of power.

Imagine the ripple effects of a 64-bit 2-operand multiplier (simple ripple-carry, as it's the easiest to reason about). Since the inputs are probably not gated either, each of the 4096 adder tree inputs may arrive at a different time, and each input has an average of 96 downstream gates (64/2 adder tree height, 128/2 carry propagation length). The carry propagation is done through and-gates which have an attenuating effect on the propagation length (each input bit flip only has 50% percent change of propagating the change), but the xor-gates for the adder propagate every transient. On average, you still get 64 transients per adder input transient, and 2048 (64 and-gates * 50%) transients for every operand bit flip. That's a lot to account for in your worst-case power envelope.

Yes, asynchronous designs are more flexible to work with. But they are less predictable in operation, not just in propagation delay but also in power usage. And you still need some form of inter-module communication, and that communication needs to account for differences in signal path length -- which is much easier to do if you can refer to a global clock.

I'm sure there have been successful asynchronous designs for specific applications (e.g. analog feedback control loops), and I haven't kept up with the last ten years of IC development which is a lifetime, but most asynchronous logic designs weren't necessarily faster than their synchronous implementations last time I checked.

Contemporary intermodule designs are pipelined and message-oriented exactly because it is hard to predict difference in signal path length for long paths. I am talking about high speed buses from ARM, I think I read about them in 2016 or so.

The same can be done with asynchronous designs, in more relaxed way.

You said that asynchronous designs are less predictable in their use of power. Can you elaborate on that?

> The same can be done with asynchronous designs, in more relaxed way.

Sure, just ask these guys:

https://chronostech.com/technology

Chronos Link: A QDI Interconnect for Modern SoCs https://ieeexplore.ieee.org/document/9179196

It's compatible with TileLink, which is SiFive's Fabric. https://bar.eecs.berkeley.edu/projects/tilelink.html

Another advantage is higher yield due to higher tolerance to the production defects.
This implies yield loss is mostly due to small delay defects and not stuck-at faults. Are you sure this is the case?
> What would be the advantage of asynchronous design?

Just the regular advantages, only with FPGA, which means one can choose how logical elements are interconnected, and what's the logic of the chip. Among regular advantages are absence of clocks (less devices, no need to synchronize...) and energy is used when and where the switching happens.

A friend of mine unsuccessfully tried to squeeze asynchronous designs into some mainstream FPGA a few years ago. Tooling wasn't cooperative, and when he used some workarounds to avoid generating clocks, it was simply crashing. I don't think it's useless or for lack of trying - but asynchronous circuits in FPGAs are certainly not common.

> choose how logical elements are interconnected

On the RTL level, you can already do that with FPGAs. On the physical level, you can't do that with an asynchronous design either.

> absence of clocks (less devices

The clocks are still there physically and consume space, even if you don't use them.

> no need to synchronize

Synchronization becomes very easy when the clocks are aligned and the frequencies are multiples of each other. FPGAs have delay elements in the clock blocks to help with the alignment.

> energy is used when and where the switching happens.

There are several points of energy use: * the clock network -- you are right about this. Does anyone know how much of the total energy use goes into the clock network? * registers and downstream logic -- behaves the same, whether synchronously or asynchronously. A register that doesn't "flip" will not consume energy for that, and the downstream logic will not flip either. * whatever the asynchronous logic needs for coordination -- don't forget that this is not for free.

Analyze energy consumption first before jumping to conclusions or even measures. The whole energy topic reeks of premature optimization.

I think I see where you are coming from. There certainly could be some power reduction if you limit the amount of switching. It's just combinatorial logic, so there shouldn't be any tool issues. The real challenge would be in verification. Usually, there are timing constraints that analyze your design and attempt to guarantee that everything will work across worst case temperature and process variations. Without timing verification there would be a lot of uncertainty in the actual path delays. So just because it might work on one device that was tested, this wouldn't guarantee that the design would work consistently. There would be a ton of glitches and phantom pulses to contend with, and every time you change something the routing delays will change! But maybe you have a method to deal with this.
The combinational part of async design is built to be self-synchronous. You derive a clock signal to write computed value from the computed signal itself.

The combinational part also synthesized as monotone function without ringing - voltages there never go down after they went up during compute, and they never go down and then up and then down again when computation is reset.

This means that timing guarantees can be local, related only to parts next to concrete registers.

Usually, asynchronously designed chips work in the first batch. They also often work being underpowered, when power voltage is slightly lower than switching voltage - because switching voltage is set for typical transistor to work at the speed needed. Asynch designs usually are much less speed-dependent and can work being "officially underpowered".

Yes, makes sense. I can see how that could be beneficial in some situations.
You can clock fpga ffs from non clock signals in xilinx/amd fpgas. Not sure how well it scales, but it's possible.