Use pytorch2+cu118 with ADA hardware for 50%+ speedup

vans554 · on July 19, 2023

I accidentally stumbled upon this and did not expect such a speedup. It seems anything less than cu118 does not properly support the RTX4090 (or H100).

Bumping to cuda12.2 with pytorch2.0.1+cu118 made my SDXL go 50% faster and ESRGAN 80% faster on the 4090.

brucethemoose2 · on July 19, 2023

You can also run PyTorch cu121 nightly builds,

These also allow `torch.compile` to function properly with dynamic input, which should net another 30%+ boost to SD.

cheald · on July 19, 2023

Is there a trick to getting pytorch+cu121 and xformers to play nicely together? All the xformers packages I can find are torch==2.01+cu118.

Edit: After a bit more research it looks like scaled dot product attention in Pytorch 2 provides much the same benefit as xformers without the need for xformers proper. Nice.

brucethemoose2 · on July 19, 2023

xformers has to match the PyTorch build. For PyTorch nightly, you need to build from source.

xformers still has a tiny performance benefit (especially at higher resolutions IIRC), but yeah, PyTorch's SDP is good.

voz_ · on July 19, 2023

This comment brings a tear to my eye.

doctorpangloss · on July 19, 2023

The underlying problem is the community's decision to make users manage this in the first place.

This is an example of a setup.py that correctly installs the accelerated PyTorch for your platform:

https://github.com/comfyanonymous/ComfyUI/blob/9aeaac4af5e19...

As you can see, never merged. For philosophical reasons I believe. The author wanted to merge it earlier and changed his mind.

Like why make end users deal with this at all? The ROI from a layperson choosing these details is very low.

Python has a packaging problem, this is well known. Fixing setuptools would be highest yield. Other package tooling can't install PyTorch, for example: https://github.com/python-poetry/poetry/issues/6409#issuecom....

PyTorch itself is wonkily packaged. But I'm sure they have a good reason for this. Anyway, it goes to show that you can put a huge amount of effort into fixing this particular problem that everyone touching this technology has, and the maintainers everywhere will go nowhere with it. And I don't think this is a "me" problem, because there is so much demand for packaging PyTorch correctly - all the easy UIs, etc.

brucethemoose2 · on July 19, 2023

> But I'm sure they have a good reason for this.

CUDA and ROCM make this an intractable problem. Basically there is no way to sanely package everything users need, and the absolutely enormous, cude/rocm versioned pytorch packages with missing libs are already a compromise.

TBH the whole ecosystem is not meant to be for end user inference anyway.

voz_ · on July 19, 2023

Sorry, no idea what you are talking about.

I am talking about dynamic shapes in torch.compile.

You seem to be talking about software packaging. You also make heavy use of the word "this" without it being clear what "this" is.

brucethemoose2 · on July 19, 2023

The two most popular stable diffusion UIs (automatic1111 and comfy) have longstanding issues with a few known but poorly documented bugs, like the ADA performance issue.

For instance, the torch.compile thing we are talking about is (last I checked) totally irrelevant for those UIs because they are still using the Stability AI implementation, not diffusers package that Huggingface checks for graph breaks. This may extend to SDXL.

vans554 · on July 19, 2023

Pretty interesting. Using nightly + cu121 im getting 8.18 it/s, another 5% improvement vs 7.78 that cu118 gave.

hospitalJail · on July 19, 2023

This was one of the reasons I skipped the 4090.

So few people have the technology that I knew I'd be spending significant time figuring out solutions to problems.

The other reason is that I'd wait a few years and get some 6090 with 4x the VRAM.

VHRanger · on July 20, 2023

I doubt future generations will up the consumerarket cards past 24gb.

They know it's a bottleneck for LM training and inference, so they'll want to extract value by reserving it for the professional line cards

latchkey · on July 19, 2023

Good find!

SekstiNi · on July 19, 2023

Surprised people don't know about this, as it has been common knowledge in the SD community [1] since october last year. Strictly speaking you don't even need cuda 11.8+ to get the speedup; it's sufficient to use cuDNN 8.6+, though you should use the newest versions for other reasons.

[1]: https://github.com/AUTOMATIC1111/stable-diffusion-webui/issu...

WithinReason · on July 19, 2023

PyTorch has been listing this install option for months, just click the "CUDA 11.8" button:

https://pytorch.org/get-started/locally/

baby_souffle · on July 19, 2023

Yes, but 11.7 has been the "stable" release: https://github.com/pytorch/pytorch/blob/main/RELEASE.md#rele...

breakds · on July 19, 2023

Or if you are using Nix use https://github.com/nixvital/ml-pkgs

  outputs = { self, nixpkgs, ... }@inputs: {
    overlays = {
      dev = nixpkgs.lib.composeManyExtensions [
        inputs.ml-pkgs.overlays.torch-family
        # Add some other overlays
      ];
    };
  };

in your flake.nix and you can use pytorch 2.0.1 compiled with CUDA 11.8 running on your 4090s. Downside is that the first time you will have to compile it and it can be quite long.

voz_ · on July 19, 2023

Always cool to see :)

If you build from source, it should be even faster compared to release builds, if only because we keep on landing fixes and speedups regularly.

If anyone tries this and runs into bugs or issues, feel free to respond here and I can take a look.

VadimPR · on July 19, 2023

I can confirm that it's true on RTX 4080 on Ubuntu 22.04 LTS.

mrwizrd · on July 19, 2023

Can the same speedup be obtained on a 3090?

alfalfasprout · on July 19, 2023

Oh man, I deal with CUDA version nuances all the time. ML dependency management in particular is always extra fun. Between all the different CUDA, CuDNN, NCCL versions and versions of TF frameworks and numpy dependencies, etc. it can quickly become a mess.

We've started really investing into a better solution-- always interesting to see just how big a difference getting the right CUDA version for a given build of eg; torch is.

magixx · on July 20, 2023

Anytime I want try out some ML stuff I run into this driver and app version hell. Have you given Docker a try? I'm considering going this route for portability reasons but don't know if it will actually help.

alfalfasprout · on July 20, 2023

Docker comes with its own problems (especially if you need to mutate the image under the hood). And multi arch quickly becomes a pain. Also much harder to make it truly portable.

We're rolling something around conda+conda-pack and plan to contribute it upstream. Look out for a blog post on HN later this year :)

fnands · on July 20, 2023

Docker is good, but I also haven't had any issues with conda. At least for most ML projects.

lostmsu · on July 19, 2023

Does it apply to Windows?

chaostheory · on July 20, 2023

I feel this pain. Package management on Python has slightly improved over the years, but it’s still not ideal. Add Nvidia to the mix, and it gets even worse.

boredumb · on July 19, 2023

wow if those benchmarks are true that is amazing to read.

valine · on July 19, 2023

Its true. I've been installing nightly builds of pytorch for months specifically to access this fix. Have been getting 40it/s outputting a 512x512 image on my 4090. Prior to the fix would get around 19it/s.

photoGrant · on July 19, 2023

Why am I with a 3090 @ 3 it/s?

Am I doing something heavily wrong? All through WSL2

valine · on July 19, 2023

is/s depends on resolution and other factors like batch size. What are you getting for 512x image?

capableweb · on July 19, 2023

Also sampler and bunch of other parameters.

photoGrant · on July 19, 2023

Fair, 12.3. My numbers are with the dev branch and 1024 with the xl model

valine · on July 19, 2023

Yeah that’ll do it. 3it/s sounds normal then.

bilsbie · on July 19, 2023

Eli5?

thangngoc89 · on July 19, 2023

Using newer CUDA version with supported hardware and software boost performance