Yes, GDS will accelerate the IO to the GPU. I’d love to see the above C code compared to hyperoptimized GPU code on the right hardware, but I don’t want to accidentally nerd snipe myself :-) The unfortunate part of this particular benchmark is that once you have the data in the right place in your hardware there is very little compute left. The GPU code would probably have constant performance with an additional couple thousand operations on each row whereas CPU would slow down.
Haha. Apologies. I hope I didn’t accidentally make anyone waste their time. If they did I’m sure there are people interested in hiring such people anyways so maybe it’s a time invested well in the end.
I am testing a gh200 and the speed you can access the system memory is amazing.. Assuming you have already encoded the station into a smallint and the size of the dataset would be around 6gb that on such system takes just 20 ms to be transfered (I am sure about that because I'm observing transfer a 9.5gb that took about 33ms right now).
If the disk can process the data at 1GB/s, the CPU can process the data at 2GB/s, the GPU can process the data at 32GB/s, then the CPU can process the data at 1GB/s and the GPU can process the data at 1GB/s.
(also, personally, "large dataset" is a short way to say "doesn't fit in memory". if it fits in memory it's small, if it doesn't fit in memory it's large. but that's just my opinion. I generally avoid calling something a "large" or "small" dataset because it's an overloaded term that means different things to different people.)
The meaningful boundary between small data and large data is the difference whether the whole dataset/processing is expected to fit in the RAM of a single common machine. From that perspective, 1 billion rows a borderline case, which can be small or large depending on how large the rows are - and in this particular challenge the rows are tiny.
on large datasets, once loaded into GPU memory, cross GPU shuffling with NVLink is going to be much faster than CPU to RAM.
on the H100 boxes with 8x400Gbps, IO with GDS is also pretty fast.
for truly IObound tasks I think a lot of GPUs beats almost anything :-)