Hacker News new | ask | show | jobs
by fpgaminer 3487 days ago
These FPGAs are absolutely _massive_ (in terms of available resources). AWS isn't messing around.

To put things into practical perspective my company sells an FPGA based solution that applies our video enhancement technology in real-time to any video streams up to 1080p60 (our consumer product handles HDMI in and out). It's a world class algorithm with complex calculations, generating 3D information and saliency maps on the fly. I crammed that beast into a Cyclone 4 with 40K LEs.

It's hard to translate the "System Logic Cells" metric that Xilinx uses to measure these FPGAs, but a pessimistic calculation puts it at about 1.1 million LEs. That's over 27 times the logic my real-time video enhancement algorithm uses. With just one of these FPGAs we could run our algorithm on 6 4K60 4:4:4 streams at once. That's insane.

For another estimation, my rough calculations show that each FPGA would be able to do about 7 GH/s mining Bitcoin. Not an impressive figure by today's standards, but back when FPGA mining was a thing the best I ever got out of an FPGA was 500 MH/s per chip (on commercially viable devices).

I'm very curious what Amazon is going to charge for these instances. FPGAs of that size are incredibly expensive (5 figures each). Xilinx no doubt gave them a special deal, in exchange for the opportunity to participate in what could be a very large market. AWS has the potential to push a lot of volume for FPGAs that traditionally had very poor volume. IntelFPGA will no doubt fight exceptionally hard to win business from Azure or Google Cloud.

* Take all these estimates with a grain of salt. Most recent "advancements" in FPGA density are the result of using tricky architectures. FPGAs today are still homogeneous logic, but don't tend to be as fine grained as they were. In other words, they're basically moving from RISC to CISC. So it's always up in the air how well all the logic cells can be utilized for a given algorithm.

3 comments

Any thoughts on why AWS/Xilinx didn't go for a mid-range FPGA to help validate customer requirements?

My guess is that Amazon will have to be very careful not to price themselves out of the market, for mid-range Deep Learning based cloud apps.

Wild guestimate but I think it'll cost more than $20/hr for each instance.

Based on my speculation, and to make a long analysis short: fewer, bigger FPGAs are better in the cloud from a user experience perspective than more, smaller FPGAs. The big applications are all going to consume as much FPGA fabric as they can (machine learning, data analysis, etc). Even "mid-range" Deep Learning will consume these FPGAs like candy. Non-deep learning will too; they can always just go more parallel and get the job done faster.

Amazon is betting on the fact that they can get better pricing than anyone else. They probably can. No one else will be buying these FPGAs in quantities Amazon will if these instances become popular (within their niche). So for the medium sized players it'll be cheaper to rent the FPGAs from Amazon, even with the AWS markup, than to buy the boards themselves. Especially for dynamic workloads where you're saving money by renting instead of owning (which is generally the advantage of cloud resources).

That's my guess anyway.

It would not be inconceivable that Amazon just buys Xilinx (before someone else does).
Thank you so much for these posts, fpgaminer. They've been extremely helpful to me in framing how these things could be used.

Once upon a time I thought seriously about going in to hardware design. I took a couple different courses in college (over 10 years ago now... sigh) dealing with VHDL and/or verilog and entirely loved it. If not for a chance encounter with web programming during my co-op my career would have been entirely different. With AWS offering this in the cloud if it is not prohibitively expensive I'll be looking in to toying with it and hopefully discovering uses for it in my work.

What can each one of those 2.5 million "logic elements" do? Last time I used an FPGA, they were mostly made up of 4-bit LUTs.

How many NOT operations can this do per cycle (and per second)? I realise FPGAs aren't the most suited for this, but the raw number is useful when thinking about how much better the FPGA is compared to a GPU for simple ops.

The 2.5 million number quoted in the article is "System Logic Cells", not Logic Elements. Near as I can tell, since I haven't kept pace with Xilinx since their 7 series, a "System Logic Cell" is some strange fabricated metric which is arrived at by taking the number of LUTs in the device and multiplying by ~2. In other words, there is no such thing as a System Logic Cell, it's just a translucent number.

Anyway, the FPGAs being used here are, I believe, based on a 6-LUT (6 input, 2 output). So you'd get about 1.25 million 6-LUTs to work with, and some combination of MUXes, flip-flops, distributed RAM, block RAM, DSP blocks, etc.

Supposing Xilinx isn't doing any trickery and you really can use all those LUTs freely, then you'd be able to cram ~2.5 million binary NOTs into the thing (2 NOTs per LUT, since they're two output LUTs). So 2.5 million NOTs per cycle. I don't know what speed it'd run at for such a simple operation. Their mid-range 7 series FPGAs were able to do 32-bit additions plus a little extra logic, at ~450 MHz and consume 16 LUTs for each adder.

6-input, 1-output or 5-input, 2-output. They're implemented as a 5-input, 2-output LUT with a bypassable 2:1 mux on the output.
The metrics have gotten pretty opaque since the old days when an FPGA was a "sea of LUTs" all alike; modern ones include a ton of (semi-)fixed function hardware like multiply-accumulate blocks and embedded dual-port RAM. Even the LUTs themselves can be reprogrammed into small RAM blocks or shift registers, so counting "logic elements" is mostly a marketing exercise.
While yes the architectures have become more "CISC-like", they aren't particularly convoluted or opaque. It's pretty easy to describe the architectures and come up with numbers for them. Xilinx could literally just say, "1 Million 6-to-2 LUTs" and that would be entirely transparent and helpful.

So it's not so much changes in architecture that have given rise to the translucency of these numbers. It's a measuring contest between Xilinx and IntelFPGA who believe you need to present bigger numbers in marketing material to win engineers. I can't speak for other FPGA engineers, but personally it just frustrates me and wastes my time. I don't ever take those numbers at face value, and I wouldn't hire anyone who did. Xilinx is the worst offender here. At least IntelFPGA will often quote their parts both in transparent terms (# of ALMs) and useful comparisons (# of equivalent LEs). I've never seen them pull a completely made up "System Logic Cell" out of thin air.