A couple 5060s and a couple 3060s. They are wired via PCI risers to an older mono with an amd cpu. (I wanted to avoid long 3-fan cards.) It looks like a mining rig, but with thicker pci risers. Many llm tools easily leverage multiple GPUs. Sucks 800w at full load, idles below 50w.
Would you please share a link to your chassis and risers? I have the PCIE lanes, but not yet encountered a reasonable way to have more than 3 GPUs directly attached to a host, both from physical space and power requirements. External PCIe switch cases are not reasonably available to mortals :/
Just search amazon for "pci4 riser" and you can get them up to a foot long. Any sort of mining frame will do. Power is a bigger issue. Running multiple power supplies is something i know about but have not done personally. Nor do i want to. Im happy keeping everything on one circuit.
If you have either PCIe slots or risers you can put them in the one system.
llama.cpp will let you run inference remotely across different systems but I suspect this would be far too latent to be worthwhile. If you have three systems already then it would cost you a few minutes to test it.
With multiple cards in normal PCI express slots, LLM layers are split across cards.
When you run inference, it runs on one card then the other card. You can repeat this for as many cards as you want.
You only copy the activations between the cards which ~10 MB/sec at runtime so PCIe width or generation is irrelevant. Even PCIe 1.0 x1 would be sufficient.
There are other software optimisations (row split, tensor parallel) which require fast interlinks like NVLink but you can get a long way without any of that.