Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Combine Multiple AWS Instances into a 16-GPU Monster Machine (bitfusion.io)
140 points by Noughmad on March 31, 2016 | hide | past | favorite | 45 comments


I've found amazon GPU instances to be really expensive (even the spot prices have been high recently), especially if you need it for longer runs for deep learning. The other issue is that the additional layers of virtualization create bandwidth overhead issues.

I'd like to see something in the cloud thats bare-metal / full access to GPUs (Maybe a good idea to start one). For scaling higher with a very large number of GPUs, you'd need Infiniband but at some point there is going to be a bandwidth tradeoff.

It would be interesting if someone could run some benchmarks of these instances versus a physical server.


I did the analysis about 9 months ago for my team when the 980 ti's came out, and the AWS pricing was expensive (we built a server that paid for itself in 2 weeks compared to a g2.8xlarge). This is largely because the 980 ti is actually a ridiculously good deal for price/performance.

The bigger problem we ran into is that AWS instances use a lot of small GPUs, which don't scale well using a lot of deep neural network tools (e.g. theano). It was never really a viable option for us.


Somebody should make a startup that allows people to sell access to their computers by the minute. Like spot instances in the cloud ... in people's basements. The true sharing economy.


Why not just move forward to realizing some sort of distributed/decentralized internet?

Something like a combination of Freenet, TOR, BOINC, blockchain etc. technologies, using the current "legacy" internet as a backbone, where anyone can voluntarily offer their computing and storage resources to the network at varying levels of participation.

Say you could offer your laptop as a simple discovery/directory node to simply help others connect and find stuff, and your desktop as either a static-content serving node or as a computation node that can host distributed applications, like SETI@home or web apps like Facebook.

Maybe even reward cryptocurrency to those who offer the most resources.


There is Gridcoin which is a cryptocurrency that rewards for BOINC projects.

I wouldn't be surprised if certain services became more centralized, offloading computing to the cloud. Consumer grade devices would become thin clients (like back in the day). NVIDIA has hinted in this direction, I can remember something about GaaS (Gaming as a Service).


Various companies have tried this. What you end up with is 99% of your workforce is compromised machines, and law enforcement at your door every other day asking where checks are being mailed.


That sounds like a nightmare. You'd have zero uptime guarantees!


Would you?

What if you limited people to writing in a domain specific language: one that ran distributed on this infrastructure? How would that make it different than folding at home, for example?


Like Amazon spot instances. People follow habits, so you can make uptime predictions.


Does Amazon have any instance uptime guarantees?


Yes, >99.95% Monthly Uptime Percentage.

"“Monthly Uptime Percentage” is calculated by subtracting from 100% the percentage of minutes during the month in which Amazon EC2 or Amazon EBS, as applicable, was in the state of “Region Unavailable.” Monthly Uptime Percentage measurements exclude downtime resulting directly or indirectly from any Amazon EC2 SLA Exclusion (defined below)."

https://aws.amazon.com/ec2/sla/


That's a regional outage, they provide no SLA for individual instances:

Amazon EC2 SLA Exclusions... (v) that result from failures of individual instances or volumes not attributable to Region Unavailability

Presumably if you're using a cloud hosted in people's basements, if someone's basement server dies, you'd just pick one from someone else's basement, so this model could provide better availability than AWS.


IO is the bottleneck there for most applications, but there are some massively distributed grid computing projects that run on BOINC like Seti@HOME.


Looks like you can install their software on baremetal GPUs too: boost.bitfusion.io. Doesn't say if they have support for Infiniband though.


That's right, you can install on your own GPU servers. Infiniband+RDMA transport is also supported which typically doubles the number of GPUs you can scale to.

We're adding support for other clouds, particularly ones with higher-end GPUs so feedback like this is good to know.


Softlayer has something like it:

http://www.softlayer.com/gpu%20


That's some really cool tech. It seems like it's Linux only. Is there windows support planned? That would solve the problem with wanting to run code on the GPU within a Linux VM while the host is windows.


Windows support is coming in mid-April, stay tuned!


That's great! Reading the documentation it seems there is no support for multiple clients and multiple GPUs (Many-To-Many), is there anything planned on that side?


You can absolutely do that. That's actually one of the more interesting configurations: the ability to pool GPU systems.

Just go to the custom link at the bottom of the page, the link is: https://console.aws.amazon.com/cloudformation/home?region=us...

There you can select any number of clients and servers. For example: 5 clients and 1 server (many to one), or 5 clients to 5 servers (many to many).


Nice. The doc at https://bitfusionio.readme.io/docs/bitfusion-boost is a bit misleading with the possible configurations. Maybe add one configuration with multiple Boost Clients(CPU) and many Boost Servers(GPU)


At first I thought this was the same problem as automatically breaking up apps to run in multiple cpus. This problem has been heavily researched with no success.

Is it the fact that GPU code already runs in parallel streams that makes this possible?


Yes, your app would have to support multiple GPUs. What's done here is remoting CUDA/OpenCL/etc. calls so that remote GPUs can be accessed from a single instance. When performing device/platform enumeration, all GPUs appear to be directly connected to a single instance -- hence no change to the application required.


Sounds like Plan9's concept of "CPU server mounts" has been reborn as "GPU server mounts." Could actually get traction this time, given that existing multi-GPU programs will Just Work.


I cant wait for company to provide opencl/cuda mflops as a service instead of giving you vms as a whole, so one could just attach remote engine to any smallish controller vm


What you suggest is technically possible by installing our Boost software on any GPU machine, and then accessing that machine from any clients running our Boost software as well. That client does not need to have a GPU. This configuration is supported in AWS today, where for example you can connect one or more t2.large isntance to a g2.8xlarge. All that would have to be done is some metering on the GPU machine to implement the service you suggest :)

We are not limiting our software to AWS so you can built this kind of service on any kind of cluster by installing our software directly from https://boost.bitfusion.io - I say cluster, because we have played with the idea of thin devices accessing remote GPU instances in the cloud, but over public networks the network performance was a limiting factor.


Also that gpu have little bandwith toward main memory and most cuda patterns involve loading, computing and retrieving which works as well remotely as locally, as opposed to programs accessing memory at random time all over their memory space the gpu memory is neatly organized in texture areas and the programming paradigm already entails moving them in and out the device as few times as possible


Are there benchmarks/code examples for the Monster Machines?


Yes! Whenever you spin up one of our AMIs, there is a README that will guide you through a couple of simple examples. We are about to publish performance results on the monster machines in a few days, so watch out for it. Scaling depends on the compute density of the GPU workload, but in general we've seen pretty good results with 1) Deep learning (caffe) scaling to 16 GPUs (near native scaling with local GPUs, especially deep nets), 2) Raytracing of photo-realistic and complex scenes - near linear scaling with increasing GPUs, and 3) Physical modeling and simulation does very well too.


Have you done any molecular dynamics benchmarks? If so, what kind/what system? I'd be very interested to see those.

If you haven't, I could probably contribute some strong and weak scaling testcases.


We've only done cursory evaluation of NAMD scaling. We saw a 7X improvement going from a non-GPU system to remote GPUs located in a different datacenter over shared 10g. We're not sure if that was with a representative dataset (MD is not our skill set), so if you can help us with a case study we'd be excited to work with you. Please do contact me.


I sent you a message with some more info and my email via the contact form at bitfusion.io


Do you have a support for spot instances?


No yet, but it is on our roadmap. We have had several customer inquire about it. Drop us a note on our site and I will ping you when it becomes available.


Congrats guys this is a really neat hack, really impressive.

Did you guys think about building it further out to provide a GPU load balancer for multiple frontend machines running Cuda / OpenCL?


Hmm, can you elaborate? Do you mean having multiple smaller instances talk to a single GPU instance?


I'm talking about time-sharing. It doesn't mater if it's smaller instances sharing a single GPU instance or many instances sharing many GPU instances. Essentially N:M sharing (with some scheduling).

Since the GPU client is now abstracted from the GPU devices by placing the GPUs across the network. It seams like time-sharing should be next logical step.


Got it, this is actually already supported. At the very end of the blog post there is a link to create a custom configuration. You can create any N:M configuration, that is any number of clients to servers and therefore the level of performance scaling or GPU pooling.

Check it out: https://console.aws.amazon.com/cloudformation/home?region=us...


Great, thanks for clearing that up.


This is really cool; publishing an AMI seems like such a good win for you guys; configuration is done, you get paid as customers use it.

Hopefully you'll see some good uptake.


It goes nicely with the "supercomputing to the masses" mission. Especially when the alternative is buying lots of machines and installing all the required software manually.


Congrats Bitfusion Team. This is really exciting!


You can do this 10x cheaper at home


Congratulations it's awesome to see the AMIs published.


When will Bitfusion be available for Google and Microsoft?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: