this is true for small datasets. on large datasets, once loaded into GPU memory,...

pama · on April 14, 2024

Yes, GDS will accelerate the IO to the GPU. I’d love to see the above C code compared to hyperoptimized GPU code on the right hardware, but I don’t want to accidentally nerd snipe myself :-) The unfortunate part of this particular benchmark is that once you have the data in the right place in your hardware there is very little compute left. The GPU code would probably have constant performance with an additional couple thousand operations on each row whereas CPU would slow down.

https://docs.nvidia.com/gpudirect-storage/overview-guide/ind...

pama · on April 15, 2024

This would be the code to beat. Ideally with only 8 cores but any number of cores is also very interesting.

https://github.com/gunnarmorling/1brc/discussions/710

_zoltan_ · on April 15, 2024

so you'd rather nerd snipe others, gotcha ;) :D

pama · on April 17, 2024

Haha. Apologies. I hope I didn’t accidentally make anyone waste their time. If they did I’m sure there are people interested in hiring such people anyways so maybe it’s a time invested well in the end.

candido_heavyai · on April 14, 2024

I am testing a gh200 and the speed you can access the system memory is amazing.. Assuming you have already encoded the station into a smallint and the size of the dataset would be around 6gb that on such system takes just 20 ms to be transfered (I am sure about that because I'm observing transfer a 9.5gb that took about 33ms right now).

nwallin · on April 14, 2024

> on large datasets, once loaded into GPU memory,

You're yada-yada-yadaing the best part.

If the disk can process the data at 1GB/s, the CPU can process the data at 2GB/s, the GPU can process the data at 32GB/s, then the CPU can process the data at 1GB/s and the GPU can process the data at 1GB/s.

(also, personally, "large dataset" is a short way to say "doesn't fit in memory". if it fits in memory it's small, if it doesn't fit in memory it's large. but that's just my opinion. I generally avoid calling something a "large" or "small" dataset because it's an overloaded term that means different things to different people.)

konstantinua00 · on April 14, 2024

1 billion rows is "small dataset"?

PeterisP · on April 14, 2024

The meaningful boundary between small data and large data is the difference whether the whole dataset/processing is expected to fit in the RAM of a single common machine. From that perspective, 1 billion rows a borderline case, which can be small or large depending on how large the rows are - and in this particular challenge the rows are tiny.

belter · on April 14, 2024

Just the click stream data of four 15 year old's, chilling out on the living room while watching TikTok videos....

_zoltan_ · on April 14, 2024

eh, it's just two columns. 12 or 13GB IIRC.

15155 · on April 14, 2024

Try an array of FPGAs.

dan-robertson · on April 14, 2024

Why does that help with IO bandwidth?

15155 · on April 15, 2024

Because every transceiver pair can do 32Gb/s or 56Gb/s and you have dozens of them.

_zoltan_ · on April 14, 2024

I've never been a fan plus good luck getting 800GB/s across.