The unified memory on the Apple M1 Macs which go up to 64G is really quite intriguing. I managed to create a model taking 32G using PyTorch the other day and it was able to handle it using native GPU acceleration. This is larger than any other GPU memory I have access to. Curious if this actually makes such machines an interesting target for ML developers? or not.
The concept is solid, but Apple needs to work on improving performance to be relevant in this field. If they add dedicated matmul capabilities to their GPUs and implement native limited-precision support, their ML training performance will improve by a factor of 4-6x which will instantly make Apple Silicon much more attractive in this domain. The disregard stack and programmability need some improvements as well. For example unified virtual address space and improvements in CPU/GPU communication would be a welcome addition (contrary to intuition latency of CPY/GPU transfers is higher on M1 than on many dGPUs because it can take a very long time for a GPU program to be scheduled).
What do you mean by this? They do support integers from 8 bits and up and natively support 16-bit floats. Are you referring to something else, like 8-bit floats?
I mean actually executing the operations at that precision with improved performance. Apple GPUs support both FP16 and FP32 as data types, but the ALU throughput for both is identical (my personal speculation is that ALUs are 32-bit only and rest is data type conversion). From the operational standpoint, Apple G13 SIMD can only do 32 flops per cycle, not more and not less.
But other GPUs support doing operations on limited-precision data types faster. And Nvidia has dedicated matrix multiplication units that can perform very wide limited precision operations per cycle (Apple has similar units but they are part of the CPU clusters).
Apple since A15/M2 offers SIMD matrix multiplication intrinsic (very similar to VK_NV_cooperative_matrix). But the performance is limited by the fact that each SIMD only offers 32 ALUs. If they add the ability to reconfigure these as 64 FP16 ALUs (or 128 FP8 ALUs) and then maybe even doubled the ALUs like Nvidia/AMD recently did with their architectures, they could achieve much higher matmul performance for ML.
I have very limited knowledge of these things but I did compare matmul with PyTorch using GPU vs not and there is a dramatic improvement. So even if its not fully optimised yet, it's still a huge bonus to have this available. If it could be improved another 4-6x that would be stupendous.
(For context: am observing an 8000x8000 matrix takes ~1s to multiply with CPU and 5ms to multiply with GPU).