A Cloud-Scale Acceleration Architecture: FPGAs in Microsoft’s Datacenters

CalChris · on Oct 21, 2016

TL;DR

  Microsoft uses FPGAs. Google uses ASICs. MapD uses GPUs.
  FPGAs are more flexible. ASICs are faster.

I think the FPGA approach is generally more useful and will trickle out into the marketplace. Indeed they are already available. So this article is more about Microsoft's fabric.

  https://www.mrcy.com/products/ensemble-fcn8213-server-class-fpga-processing-blade-advancedtca/

It's not impossible to think this would make its way into an Azure cloud product. That won't happen with ASICs. But Google will clearly have an advantage on the problems they focus on.

Cyph0n · on Oct 22, 2016

FPGAs are already powering key parts of both Azure and Bing. I asked Doug Burger after his presentation two weeks back at my university, and he said that currently the fabric is not reconfigurable. In other words, hardware acceleration is not available to customers yet, but it looks like they're working towards that.

A nice little tidbit from the talk. They were able to translate all of Wikipedia from English to Russian using 90% of the deployed FPGAs in 0.1 seconds.

DannyBee · on Oct 22, 2016

If that's the TL;DR, it don't make a lot of sense.

Microsoft/Google/whoever is going to use what makes the most sense for a given set of applications/etc they have, whether it's CPU's, GPU's, FPGA's, ASIC's, or massive armies of intelligent pigeons.

I'm not sure why one would think all of them don't do all of them :)

(in fact, i'd pretty much guarantee it)

mycall · on Oct 22, 2016

> I think the FPGA approach is generally more usefu

Intel agrees with its purchase of Altera.

jackyinger · on Oct 22, 2016

The reuse of their network infrastructure with this approach is elegant, BUT it takes about half of the FPGA to support the network pass through and lightweight transport layer.

FPGAs are about as expensive as high end CPUs, can you afford to buy a high end server and burn half the cores on your OS?

On the flip side, if Microsoft can really eat that cost and offer FPGA space as a public cloud service, as an FPGA dev, I'd rather pay to play than bust 5K on a high end dev board.

astrodust · on Oct 22, 2016

FPGAs are really expensive today, but I think we're starting to see a real shift in production volumes, especially with Intel jumping in, which might rapidly cut costs on them.

Exciting opportunities ahead for those that can take advantage of hardware like this.

jackyinger · on Oct 22, 2016

It certainly is exciting. Tech that began by replacing discrete logic ICs (TI 7400 series) is getting powerful enough to pack a computing punch.

Look up systolic array architectures for matrix multiplication if you're not already familiar. They play nice with FPGAs (which are design space limited in comparison to ASICs). And can efficiently implement algorithms that are prone to combinatoric explosion by amortizing expenisve off chip memory accesses over multiple operations as the data passes trough the sys array.

If the FPGA manufactures want to open the advantages to all comers they'd make their design software free (as in beer, not freedom) an pAy for it with silicon sales.

dom0 · on Oct 22, 2016

> It certainly is exciting. Tech that began by replacing discrete logic ICs (TI 7400 series) is getting powerful enough to pack a computing punch.

<nitpick> Those were PALs, basically fuse-programmed logic-in-tables. Even a single of the later CPLDs could already replace a board or two filled with SSI logic. FPGAs took that to replacing an entire cabinet.

trhway · on Oct 21, 2016

looks like a new class of hardware component emerging - NIC with an additional FPGA chip. Or may be a NIC which has only FPGA, an FPGA larger than the NIC's code alone would require. Or just FPGA card with NIC ports on it.

deadgrey19 · on Oct 21, 2016

The company I work for builds exactly that. NICs, which are actually FPGA's with area larger than what is necessary to build the NIC alone. Our customers then either just use the NIC or add custom firmware for their own applications.

http://exablaze.com/exanic-x10

http://exablaze.com/exanic-x40

We also build switches this way too:

http://exablaze.com/exalink-fusion

dom0 · on Oct 22, 2016

Marketing hint: Don't use renders of hardware as hero images if you already have the real hardware (as the video below shows). Using a render always makes the impression that you didn't build it yet.

deadgrey19 · on Oct 22, 2016

Ha. Thanks yes. Too busy selling stuff to get around to fixing the website. It's on the list of things to do...

kijiki · on Oct 21, 2016

NetFPGA. Been around in various revisions since 2007.

https://en.wikipedia.org/wiki/NetFPGA

deadgrey19 · on Oct 22, 2016

NetFPGA is a great development board with a lot of features. Unfortunately its size and research quality software/firmware do not make it a suitable device for deployment at scale into production environments.

xellisx · on Oct 22, 2016

I see they haven't done 25 GB / 100 GB stuff yet.

pyvpx · on Oct 22, 2016

you can get a nice "SmartNIC" from Silicom that has a Tile Gx-72 and a Virtex FPGA on-board. I have no idea how much they cost; only that you can't buy just one :(

http://www.silicom-usa.com/pr/server-adapters/programmable-f...

ddorian43 · on Oct 21, 2016

So what algorithms/data-structures/dbs do they use the fpgas for ?

algorithmsRcool · on Oct 21, 2016

They are using the FPGAs for specializing network hardware on the fly. This lets them decrease latency across their network as well as allow centralized management.

Edit:

It looks like they are also allowing applications to use the FGPA-FPGA communication layer to reduce latency for distributed systems.

> "All network traffic is routed the FPGA, allowing it to accelerate high-bandwidth flows. An independent PCIe connection to the host is also provided, allowing the FPGA to be used as a local accelerator..."

> "By enabling the FPGAs to generate and their own networking packets independent of the each and every FPGA in the datacenter can reach every other one (at a scale of hundreds of thousands) in small number of microseconds, without any intervening. This capability allows hosts to use remote FPGAs for with low latency, improving the economics of the deployment, as hosts running services that do not their local FPGAs can donate them to a global pool and value which would otherwise be stranded."

pliu · on Oct 22, 2016

If you read the PDF, it seems the FPGAs can be used in a number of ways:

1) As a local co-processor, they have been deployed to accelerate selected Bing search ranking operations. This is described a little bit in the white paper, but I think the most interesting parts are going to be proprietary.

2) As a network accelerator, because they are inline with the server network interface. In the white paper using the FPGA for crypto acceleration is described. This is cool. The CPU overhead of encrypting 40Gbps is significant and managing keys for this many hosts is probably non trivial. Moving encryption to another layer would simplify things significantly I think, let app developers not think about it and let datacenter peeps manage keys independently of whatever is going on in the host.

3) As a large scale distributed processing grid. Because the FPGAs have their own network interfaces, they can be used independent of the host. Regular host network traffic gets passed through unaffected while the FPGA is simultaneously running distributed compute tasks from the grid. The white paper described training a DNN, but also describes their Hardware As A Service delivery model, meaning developers I guess have access to the grid and can deploy whatever they want. I would guess there is a lot of machine learning and map reduce type tasks going here, but who knows. The whitepaper also tantalizingly contains the phrase "cross-datacenter" which implies a globally distributed network of these things. Rad.

So that's a long way to say that I think there are very many use cases for this kind of thing. Coupling the FPGAs with regular compute nodes is operationally beneficial I think - they can be co-processors or they can be part of the grid. Microsoft is doing some pretty cool shit.

Edit: This whole thing is called Project Catapult. There are some other good papers here https://www.microsoft.com/en-us/research/project/project-cat...

user5994461 · on Oct 22, 2016

> [From the article] In this paper we propose a new cloud architecture that uses reconfigurable logic to accelerate both network plane functions and applications. This Configurable Cloud architecture places a layer of reconfigurable logic (FPGAs) between the network switches and the servers, enabling network flows to be programmably transformed at line rate...

Network planes are done with FPGA and ASIC since a long time. Microsoft is late to the game.

LargoLasskhyfv · on Oct 22, 2016

Maybe late rolling it out in production. Experimenting with it not so much. 2007 https://www.microsoft.com/en-us/research/project/emips/ Even open sourcing it in 2011 http://blog.netbsd.org/tnf/entry/support_for_microsoft_emips...

adrianratnapala · on Oct 22, 2016

Is this entirely about how Microsoft is using FPGAs for some of its own internal datacenter needs. In that case I can understand why Google might see ASICs as a better way to do the same thing.

Where I see FPGAs shining on the cloud is if vendors started putting them online so that clients could program them for their own applications. But I don't know enough about such things to know if there is yet a market for that.

youdontknowtho · on Oct 22, 2016

I think that it's only a matter of time until you see that from Amazon and Azure. Financial companies have been offloading hot paths in code to FPGA for about five years.

gumby · on Oct 22, 2016

I'm not convinced that FPGAs are effective in general purpose computation. FPGAs are quite expensive in terms of power, density and overall cost per function. It's not like this hasn't been tried many times before (Google even bought a company that used FPGAs for a reconfigurable network stack processor, but as far as I can see the product never made it).

webaholic · on Oct 21, 2016

I wish they would work out how to run the chisel toolchain on those fpgas. From what I understand, you have to program in verilog currently which is a huge pain.

Please enable chisel and put it in your Azure cloud. It will be a great platform for custom software.

aseipp · on Oct 21, 2016

To be fair, this is a paper about FPGA use internally for Microsoft for their datacenters for select applications. You're probably never going to be exposed to the fabric yourselves, unless you have ungodly amounts of money to pay them behind closed doors, or just buy your own custom systems. Average individuals are no closer to these hybrid, re-programmable systems than they were before.

Anyway -- doesn't Chisel just generate Verilog, like basically every other alternative HDL does? What's the problem with just plugging that resulting code into your synthesis tool and going at it?

Granted, the "plugging" part is the annoying bit, since most EDA toolchains seem to be fundamentally hostile to the outside world. But it doesn't really seem like Chisel should be the limiting factor here, at all. (As an alternative case study -- it was very easy for me to integrate a 3rd party HDL (Clash) with tools like IceStorm, but it did require some automation boilerplate)

webaholic · on Oct 22, 2016

It does generate verilog, but there is quite a lot of glue which you need to run that verilog on an fpga. And all this glue is proprietary(I heard some rumblings about open tools being pretty limited), you need a license to use the compilers which compile the verilog to bitcode.

aseipp · on Oct 22, 2016

Sure, but that's the truth no matter what EDA toolchain you're using and has nothing to do with Microsoft, or their FPGA deployments at all - all modern, high-end FPGAs are proprietary from top to bottom, and every vendor has different logic libraries and hardware capabilities. High end customers such as Microsoft very likely have their own completely custom silicon with custom chip layout -- It would be extremely costly to start developing and supporting such things at a "consumer level".

And things like Chisel and Clash - despite my affections - are still actively evolving research projects, and they only concern themselves with one part of the overall process. They still leave many things to be desired -- for example, I'm still working out the best ways to integrate Clash with the formal verification tools of Yosys, in a way that doesn't just require me to shim some Verilog code. That might require improving the language/libraries to allow the notion of things like SystemVerilog properties. I don't know what it would look like to integrate multi-clock verification with the type safe cross-domain clocks of Clash, etc. There's a lot of things I still can't do with these tools alone, and because they're moving targets, that means I generally have to pick up the slack. They can certainly offer many advantages in terms of time-to-iterate and time-to-fabricate, but they aren't magic wands.

Furthermore -- and I really don't mean to be flippant -- if you're the kind of group who would benefit from FPGAs being deployed to accelerate your workload, but you're blocked on things like gluing Chisel code to your EDA toolchain -- well, you're going to be in for a very rude awakening when you realize that's an incredibly tiny, easy part of the overall process.

webaholic · on Oct 22, 2016

There is no reason for microsoft not to expose the fabric to customers of Azure. I am sure there are applications out there which can benefit by being able to synthesize their own code just like microsoft did for its bing search traffic etc.,

aseipp · on Oct 22, 2016

Of course there are reasons not to expose it, what are you talking about? Support, reliability, isolation, debuggability, and platform control are a few reasons (it would be a goddamn nightmare to debug these things remotely, not to mention provide support for them through traditional channels -- and let's not even begin thinking about the security nightmare it could imply to do things like securely provision FPGAs on demand, or wipe them after the fact. Bare-metal cloud providers already have a lot to do here, because the threat model involves people who can do shit like flash your BIOS or attack your firmware).

If these aren't a challenge for you, and FPGA development would actually pay off economically -- you're already in a decidedly different market than the vast majority of Azure customers.

Even then, there's also no indication it would actually be profitable for them - FPGAs are actually quite niche in their applications and require extensive development cycles in order to utilize to their full potential, especially for complex tasks. Microsoft and Azure employees can justify this because of economies of scale as of right now. It's a labor intensive process -- there's no indication a vast majority of applications would find a good use case for developing their own fabric, one that would offset the costs.

What's more likely is that people like Microsoft and Google will start offering you serves that are powered by this stuff under the hood, to give their platforms a competitive advantage. A better SQL server enhanced with custom logic is far more valuable to the vast majority of their customers than being able to program some custom fabric themselves, in today's world.

Microsoft already has custom fabricated Xeons with unlocked silicon for example, giving them private functionality to accelerate things like SDN workloads, and those Xeons power Azure -- why aren't you being given access to those? For the same reasons: those features only make sense to support, develop and utilize at a certain scale for the involved parties -- and it gives them competitive advantages when used to accelerate traditional workloads (that can only be realized with a substantial amount of work on part of the vendor, but is "free" for the customers.)

The venn diagram of "People who have a problem which FPGAs would solve" and "Customers who desire commodity cloud computing on Azure" is extremely small.

That could change one day, but it doesn't look like it's set to change anytime in the near future, especially given the landscape of EDA tooling, and the fundamentally different cost models associated with hardware vs software development.

webaholic · on Oct 22, 2016

And how is any of that different from providing GPUs in the cloud? I am not saying that they provide them for free. They can charge for all of that support and tools.

We haven't reached a stage where fpgas make economical sense for small customers. GPUs were in the same stage 5 years ago. I hope Microsoft becomes a trail blazer in bringing fpgas on par with GPUs. They have one of the best tool-chain teams working for them, so it is doable.

tw04 · on Oct 22, 2016

A GPU doesn't have custom (potentially proprietary) logic programmed into it. It's the same reason we have shared CPUs but not FPGAs.

monocasa · on Oct 22, 2016

GPUs had their own MMUs five years ago for running user shaders.

wyager · on Oct 21, 2016

I've tried both Chisel and Clash. They are both huge improvements over Verilog, but I was more impressed with the latter. Clash converts run-of-the-mill non-recursive Haskell code into hardware, and does so fairly efficiently. It turns out there's a really really close mapping between lazy functional semantics and hardware semantics.

DigitalJack · on Oct 21, 2016

Can you recommend some resources on chisel? I've been an ASIC/FPGA designer for 20 years. We mostly do VHDL and SystemVerilog for design, and SystemVerilog for verification.

I've only heard about chisel briefly, used in reference to a risc design I think.

I've considered making a lisp like language for design on a few occasions. But maybe chisel solves the pain points?

aseipp · on Oct 21, 2016

Chisel is the language used to describe the RISC-V Rocket core, which is the current primary RISC-V implementation and reference (with real, synthesized boards coming out of it).

> I've considered making a lisp like language for design on a few occasions. But maybe chisel solves the pain points?

To be honest, I've only been doing FPGA design for a few weeks but, like: basically anything is better than VHDL/Verilog IMO. :)

Personally, I prefer CLaSH as opposed to Chisel. (Chisel is a DSL built using Scala - Clash is actually a compiler, not a DSL, from Haskell to Hardware. I think CLaSH conceptually is very nice.)

But honestly it really does seem to me that anything is a vast improvement over the current languages, for the most part. Your life is simply vastly improved by being able to do things like use a REPL in Scala/Haskell, or having cosimulation support in your toolchains (so you can run your designs as cycle-accurate software simulations, with no code changes, before doing synthesis).

Even then -- the intuition to use, or build something like Lisp isn't necessarily misplaced! Hardware description and functional programming share some interesting commonalities, it seems.

blackguardx · on Oct 22, 2016

The key to learning HDL is to understand that it isn't software. There are no instructions to a state machine. You are describing the functionality of the state machine itself.

aseipp · on Oct 22, 2016

Not to sound rude, but what are you addressing with this statement? I've been doing functional programming for over a decade; I'm quite familiar with the concept of declarative programming, so circuits were not a big leap from this point (one of the many ways in which FP nicely dovetails with hardware design.)

My point wasn't really about state machines or whatever. Languages like Verilog and VHDL are absolutely terrible at things like abstraction capabilities. You don't have modules, you don't have a REPL, you don't have any kind of abstraction facilities almost at all, on top of the languages being ridiculously verbose. They sit in some bizarre uncanny valley where anything takes a surprising amount of code to write, yet is terrible in terms of reuse, modularity, understandability, etc.

Example: A few weeks ago I ran into a case where I couldn't easily wrap a cell library primitive by wrapping it in a Verilog module, and then re-use that module a few times. The primitive was a PLL on my board. Oops, can't do that -- because the synthesis toolchain disliked a module around the PLL primitive, as the primitive had some arguments which were required to be constants (despite the fact all call sites to the wrapping module had constant arguments). So I had to use a Verilog macro which generated a Verilog module, and I had to instantiate that macro like 10 times to 10 different modules, each one corresponding to a different clock speed, with no reuse outside of a Macro. Because some parameters had to be constant. If Verilog metaprogramming wasn't just a rip off of C macros, it might not be so terrible. But shit like this is just tedious.

I can't even do things like use higher order functions, or strong typing capabilities, or things like leverage a real module system (with boundary and abstraction capabilities) -- even when I am completely in control of their use and completely understand how they will synthesize. Tools like REPLs vastly improve time-to-iterate with an interactive environment. Cosimulation being a primary feature of these tools means I generally get fairly accurate software simulations directly from my HDL itself, independent of any EDA toolchain etc.

The languages are just bad languages. Can you get work done in them? Yes, and everybody does. Do they accurately reflect the fact you're declaratively describing a hardware circuit? Yes. That doesn't mean Verilog/VHDL, as languages, aren't objectively terrible.

I understand why this is. The cost models associated with these tools -- and process -- are fundamentally different (for hardware, it's not about "determinism" like software programmers care about -- but things like MTTF, lead time, and optimizing around that), and Verilog is really a small portion of the overall lifecycle of developing a chip (you still need post-synth testing, possibly do gate-level debugging, post-synth modifications, on-board testing, etc). And tools like Clash/Chisel are not perfect by any means, and I still have plenty of slack to pick up by myself, nonetheless.

The languages are still bad, despite all that.

blackguardx · on Oct 22, 2016

I think we are approaching this from opposite perspectives. I've been doing FPGA development for 10 years and am an EE. C was the first programming language that I learned over 20 years ago as a teenager. I've only very recently been learning high-level languages and associated abstractions.

A lot of the trouble with FPGA development in my opinion isn't the HDL, but the toolchains themselves. I haven't met anyone that likes FPGA toolchains. Things like PLL instantiation problems are likely to do with the tool and not the language.

A lot of people (I'm not saying this is you) coming from a software background tend to approach FPGAs as if it were a software problem. They seem very similar, after all. You write some code into a window, hit a button, and bam, you have some functionality. To some extent, I think it leads them to tend to shove a square peg in a round hole. Just like there are programming paradigms governing software design, there paradigms governing FPGA design as well. Trying to force certain abstractions can lead to headache.

Abstractions are very important in any design process, be it analog hardware, digital hardware, software, bridge design, etc. Software tends to favor abstractions because they are powerful, but also very cheap. One can afford to waste some CPU cycles if it reduces development time and the occurrence of bugs. In FPGA and ASIC design, abstractions are just as important but one can't afford to waste clock speed or chip area to get it. FPGAs are expensive and ASICs even more so. The design paradigms of HDL exist for that reason, as frustrating as they may be.

In a sense, I've just grown accustomed to Verilog's warts. I was appalled at the FPGA development process when I first started but at this point I just take it for granted. I think thing will get better but it will take time. Even so, I think that it is best to stick with Verilog/VHDL for the time being. Just like if one wanted to web development, one would have to learn javascript, which is also considered terrible.

ZenoArrow · on Oct 22, 2016

> "Even so, I think that it is best to stick with Verilog/VHDL for the time being."

If you do it for your job, perhaps. However, if you're a hobbyist, that doesn't make as much sense. As a hobbyist you have the luxury of being able to use the tools you like the best. In any case, FPGA design shouldn't be closely coupled to any one language, they're just a tool that allows you to express the design you have in mind. If Clash or Chisel can help you express that design more succinctly and clearly than the competition, there's no need to stick to the other tools. The need only exists for professionals, as changing to languages with lower market share carries risk.

aseipp · on Oct 21, 2016

Also, just to answer your actual question: If you're going to use a language like Chisel or Clash, I'd probably recommend you at least have working familiarity with the languages they're based in -- e.g. Scala or Haskell (or Python for MyHDL, or whatever). Start there, learn some of the fundamentals, and you'll be more equipped to start describing actual hardware. If you've been doing VHDL/Verilog for 20 years, but haven't done FP/software, then some things will probably seem obvious, while others may be harder to initially understand (e.g. designs in Scala or Haskell are going to heavily leverage abstractions present in the base language, so familiarity with them helps).

Generally, these tools are going to expect some level of familiarity with the base language, and most of the material around them is going to be rooted on that assumption. So if you've never done FP, or Scala, or Haskell, it'll likely be more difficult. Although I don't think its insurmountable.

I've been thinking recently it might be fun to write an article introducing Clash from a hardware design perspective, for people without prior Haskell/FP/etc experience. It certainly seems there isn't much material out there along these lines, IMO.

DigitalJack · on Oct 22, 2016

I've never done anything with Scala. I've learned some Haskell, and I liked what I learned.

Lisp seems to be where I've settled in my programming. Mostly clojure, but I've been working through some common lisp as the JVM is a pain point for me. Inspite of all the libraries and ecosystem it enables.

I think I will perhaps continue with my idea of a lisp based hardware design language, as Scala is very definitely not appealing to me. I will take a look at clash though as Haskell was very compelling.

Thanks

kevinnk · on Oct 22, 2016

I'll take the contrarian point of view and say that it's probably not worth it to invest a lot of time into Chisel at the moment. There's a lot of pain points around things outside of the "application logic" like clocking and synchronization. I've also heard that gate level debugging is a pain, but I haven't actually gotten that far in a Chisel project yet. The language definitely seems interesting but it's probably not ready for prime time just yet.

If you're interested in a higher level hardware language, I'd highly recommend MyHDL or PyMTL. Both are much closer to Verilog (for better or worse) and give you higher level semantics.

webaholic · on Oct 22, 2016

You do know that there are CPU cores out there which were written in chisel? It's true that it is still in the growing phase, but investing time to learn it will definitely pay off.

kevinnk · on Oct 22, 2016

Yep, I worked on one :) My personal belief is that time spent learning Chisel would probably be better spent learning MyHDL, but I will concede that I could be completely wrong; I just thought it might be worth it to add a counterweight to the pro-chisle/clash comments already posted.

blackguardx · on Oct 22, 2016

I'm curious as to why you think Verilog a huge pain.

I would think that the toolchains would be the biggest complaint.

If one wanted to do web development, javascript knowledge is important. It is the same with FPGAs and HDL.

monocasa · on Oct 21, 2016

What? Chisel compiles to Verilog generally?

webaholic · on Oct 22, 2016

see above...

sargun · on Oct 22, 2016

Azure and Bing are two different departments.

webaholic · on Oct 22, 2016

They are moving towards homogeneous systems in all their data centers. Each and every system will have these fpgas.

LyalinDotCom · on Oct 22, 2016

Things will never be perfect as it is a huge company, but thanks to Satya teams are working a lot more together