The biggest one just to pick on one is hipblaslt is "a library that provides general matrix-matrix operations. It has a flexible API that extends functionalities beyond a traditional BLAS library, such as adding flexibility to matrix data layouts, input types, compute types, and algorithmic implementations and heuristics." https://github.com/ROCm/hipBLASLt
There are mostly GPU kernels that by themselves aren't so big, but for every single operation x every single supported graphics architecture, eg:
There are mostly GPU kernels that by themselves aren't so big, but for every single operation x every single supported graphics architecture, eg: