Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Use pytorch2+cu118 with ADA hardware for 50%+ speedup (gpux.ai)
141 points by vans554 on July 19, 2023 | hide | past | favorite | 35 comments


I accidentally stumbled upon this and did not expect such a speedup. It seems anything less than cu118 does not properly support the RTX4090 (or H100).

Bumping to cuda12.2 with pytorch2.0.1+cu118 made my SDXL go 50% faster and ESRGAN 80% faster on the 4090.


You can also run PyTorch cu121 nightly builds,

These also allow `torch.compile` to function properly with dynamic input, which should net another 30%+ boost to SD.


Is there a trick to getting pytorch+cu121 and xformers to play nicely together? All the xformers packages I can find are torch==2.01+cu118.

Edit: After a bit more research it looks like scaled dot product attention in Pytorch 2 provides much the same benefit as xformers without the need for xformers proper. Nice.


xformers has to match the PyTorch build. For PyTorch nightly, you need to build from source.

xformers still has a tiny performance benefit (especially at higher resolutions IIRC), but yeah, PyTorch's SDP is good.


This comment brings a tear to my eye.


The underlying problem is the community's decision to make users manage this in the first place.

This is an example of a setup.py that correctly installs the accelerated PyTorch for your platform:

https://github.com/comfyanonymous/ComfyUI/blob/9aeaac4af5e19...

As you can see, never merged. For philosophical reasons I believe. The author wanted to merge it earlier and changed his mind.

Like why make end users deal with this at all? The ROI from a layperson choosing these details is very low.

Python has a packaging problem, this is well known. Fixing setuptools would be highest yield. Other package tooling can't install PyTorch, for example: https://github.com/python-poetry/poetry/issues/6409#issuecom....

PyTorch itself is wonkily packaged. But I'm sure they have a good reason for this. Anyway, it goes to show that you can put a huge amount of effort into fixing this particular problem that everyone touching this technology has, and the maintainers everywhere will go nowhere with it. And I don't think this is a "me" problem, because there is so much demand for packaging PyTorch correctly - all the easy UIs, etc.


> But I'm sure they have a good reason for this.

CUDA and ROCM make this an intractable problem. Basically there is no way to sanely package everything users need, and the absolutely enormous, cude/rocm versioned pytorch packages with missing libs are already a compromise.

TBH the whole ecosystem is not meant to be for end user inference anyway.


Sorry, no idea what you are talking about.

I am talking about dynamic shapes in torch.compile.

You seem to be talking about software packaging. You also make heavy use of the word "this" without it being clear what "this" is.


The two most popular stable diffusion UIs (automatic1111 and comfy) have longstanding issues with a few known but poorly documented bugs, like the ADA performance issue.

For instance, the torch.compile thing we are talking about is (last I checked) totally irrelevant for those UIs because they are still using the Stability AI implementation, not diffusers package that Huggingface checks for graph breaks. This may extend to SDXL.


Pretty interesting. Using nightly + cu121 im getting 8.18 it/s, another 5% improvement vs 7.78 that cu118 gave.


This was one of the reasons I skipped the 4090.

So few people have the technology that I knew I'd be spending significant time figuring out solutions to problems.

The other reason is that I'd wait a few years and get some 6090 with 4x the VRAM.


I doubt future generations will up the consumerarket cards past 24gb.

They know it's a bottleneck for LM training and inference, so they'll want to extract value by reserving it for the professional line cards


Good find!


Surprised people don't know about this, as it has been common knowledge in the SD community [1] since october last year. Strictly speaking you don't even need cuda 11.8+ to get the speedup; it's sufficient to use cuDNN 8.6+, though you should use the newest versions for other reasons.

[1]: https://github.com/AUTOMATIC1111/stable-diffusion-webui/issu...


PyTorch has been listing this install option for months, just click the "CUDA 11.8" button:

https://pytorch.org/get-started/locally/


Yes, but 11.7 has been the "stable" release: https://github.com/pytorch/pytorch/blob/main/RELEASE.md#rele...


Or if you are using Nix use https://github.com/nixvital/ml-pkgs

  outputs = { self, nixpkgs, ... }@inputs: {
    overlays = {
      dev = nixpkgs.lib.composeManyExtensions [
        inputs.ml-pkgs.overlays.torch-family
        # Add some other overlays
      ];
    };
  };
in your flake.nix and you can use pytorch 2.0.1 compiled with CUDA 11.8 running on your 4090s. Downside is that the first time you will have to compile it and it can be quite long.


Always cool to see :)

If you build from source, it should be even faster compared to release builds, if only because we keep on landing fixes and speedups regularly.

If anyone tries this and runs into bugs or issues, feel free to respond here and I can take a look.


I can confirm that it's true on RTX 4080 on Ubuntu 22.04 LTS.


Can the same speedup be obtained on a 3090?


Oh man, I deal with CUDA version nuances all the time. ML dependency management in particular is always extra fun. Between all the different CUDA, CuDNN, NCCL versions and versions of TF frameworks and numpy dependencies, etc. it can quickly become a mess.

We've started really investing into a better solution-- always interesting to see just how big a difference getting the right CUDA version for a given build of eg; torch is.


Anytime I want try out some ML stuff I run into this driver and app version hell. Have you given Docker a try? I'm considering going this route for portability reasons but don't know if it will actually help.


Docker comes with its own problems (especially if you need to mutate the image under the hood). And multi arch quickly becomes a pain. Also much harder to make it truly portable.

We're rolling something around conda+conda-pack and plan to contribute it upstream. Look out for a blog post on HN later this year :)


Docker is good, but I also haven't had any issues with conda. At least for most ML projects.


Does it apply to Windows?


I feel this pain. Package management on Python has slightly improved over the years, but it’s still not ideal. Add Nvidia to the mix, and it gets even worse.


wow if those benchmarks are true that is amazing to read.


Its true. I've been installing nightly builds of pytorch for months specifically to access this fix. Have been getting 40it/s outputting a 512x512 image on my 4090. Prior to the fix would get around 19it/s.


Why am I with a 3090 @ 3 it/s?

Am I doing something heavily wrong? All through WSL2


is/s depends on resolution and other factors like batch size. What are you getting for 512x image?


Also sampler and bunch of other parameters.


Fair, 12.3. My numbers are with the dev branch and 1024 with the xl model


Yeah that’ll do it. 3it/s sounds normal then.


Eli5?


Using newer CUDA version with supported hardware and software boost performance




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: