I accidentally stumbled upon this and did not expect such a speedup. It seems anything less than cu118 does not properly support the RTX4090 (or H100).
Bumping to cuda12.2 with pytorch2.0.1+cu118 made my SDXL go 50% faster and ESRGAN 80% faster on the 4090.
Is there a trick to getting pytorch+cu121 and xformers to play nicely together? All the xformers packages I can find are torch==2.01+cu118.
Edit: After a bit more research it looks like scaled dot product attention in Pytorch 2 provides much the same benefit as xformers without the need for xformers proper. Nice.
PyTorch itself is wonkily packaged. But I'm sure they have a good reason for this. Anyway, it goes to show that you can put a huge amount of effort into fixing this particular problem that everyone touching this technology has, and the maintainers everywhere will go nowhere with it. And I don't think this is a "me" problem, because there is so much demand for packaging PyTorch correctly - all the easy UIs, etc.
CUDA and ROCM make this an intractable problem. Basically there is no way to sanely package everything users need, and the absolutely enormous, cude/rocm versioned pytorch packages with missing libs are already a compromise.
TBH the whole ecosystem is not meant to be for end user inference anyway.
The two most popular stable diffusion UIs (automatic1111 and comfy) have longstanding issues with a few known but poorly documented bugs, like the ADA performance issue.
For instance, the torch.compile thing we are talking about is (last I checked) totally irrelevant for those UIs because they are still using the Stability AI implementation, not diffusers package that Huggingface checks for graph breaks. This may extend to SDXL.
Surprised people don't know about this, as it has been common knowledge in the SD community [1] since october last year. Strictly speaking you don't even need cuda 11.8+ to get the speedup; it's sufficient to use cuDNN 8.6+, though you should use the newest versions for other reasons.
outputs = { self, nixpkgs, ... }@inputs: {
overlays = {
dev = nixpkgs.lib.composeManyExtensions [
inputs.ml-pkgs.overlays.torch-family
# Add some other overlays
];
};
};
in your flake.nix and you can use pytorch 2.0.1 compiled with CUDA 11.8 running on your 4090s. Downside is that the first time you will have to compile it and it can be quite long.
Oh man, I deal with CUDA version nuances all the time. ML dependency management in particular is always extra fun. Between all the different CUDA, CuDNN, NCCL versions and versions of TF frameworks and numpy dependencies, etc. it can quickly become a mess.
We've started really investing into a better solution-- always interesting to see just how big a difference getting the right CUDA version for a given build of eg; torch is.
Anytime I want try out some ML stuff I run into this driver and app version hell. Have you given Docker a try?
I'm considering going this route for portability reasons but don't know if it will actually help.
Docker comes with its own problems (especially if you need to mutate the image under the hood). And multi arch quickly becomes a pain. Also much harder to make it truly portable.
We're rolling something around conda+conda-pack and plan to contribute it upstream. Look out for a blog post on HN later this year :)
I feel this pain. Package management on Python has slightly improved over the years, but it’s still not ideal. Add Nvidia to the mix, and it gets even worse.
Its true. I've been installing nightly builds of pytorch for months specifically to access this fix. Have been getting 40it/s outputting a 512x512 image on my 4090. Prior to the fix would get around 19it/s.
Bumping to cuda12.2 with pytorch2.0.1+cu118 made my SDXL go 50% faster and ESRGAN 80% faster on the 4090.