Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

this is true for small datasets.

on large datasets, once loaded into GPU memory, cross GPU shuffling with NVLink is going to be much faster than CPU to RAM.

on the H100 boxes with 8x400Gbps, IO with GDS is also pretty fast.

for truly IObound tasks I think a lot of GPUs beats almost anything :-)



Yes, GDS will accelerate the IO to the GPU. I’d love to see the above C code compared to hyperoptimized GPU code on the right hardware, but I don’t want to accidentally nerd snipe myself :-) The unfortunate part of this particular benchmark is that once you have the data in the right place in your hardware there is very little compute left. The GPU code would probably have constant performance with an additional couple thousand operations on each row whereas CPU would slow down.

https://docs.nvidia.com/gpudirect-storage/overview-guide/ind...


This would be the code to beat. Ideally with only 8 cores but any number of cores is also very interesting.

https://github.com/gunnarmorling/1brc/discussions/710


so you'd rather nerd snipe others, gotcha ;) :D


Haha. Apologies. I hope I didn’t accidentally make anyone waste their time. If they did I’m sure there are people interested in hiring such people anyways so maybe it’s a time invested well in the end.


I am testing a gh200 and the speed you can access the system memory is amazing.. Assuming you have already encoded the station into a smallint and the size of the dataset would be around 6gb that on such system takes just 20 ms to be transfered (I am sure about that because I'm observing transfer a 9.5gb that took about 33ms right now).


> on large datasets, once loaded into GPU memory,

You're yada-yada-yadaing the best part.

If the disk can process the data at 1GB/s, the CPU can process the data at 2GB/s, the GPU can process the data at 32GB/s, then the CPU can process the data at 1GB/s and the GPU can process the data at 1GB/s.

(also, personally, "large dataset" is a short way to say "doesn't fit in memory". if it fits in memory it's small, if it doesn't fit in memory it's large. but that's just my opinion. I generally avoid calling something a "large" or "small" dataset because it's an overloaded term that means different things to different people.)


1 billion rows is "small dataset"?


The meaningful boundary between small data and large data is the difference whether the whole dataset/processing is expected to fit in the RAM of a single common machine. From that perspective, 1 billion rows a borderline case, which can be small or large depending on how large the rows are - and in this particular challenge the rows are tiny.


Just the click stream data of four 15 year old's, chilling out on the living room while watching TikTok videos....


eh, it's just two columns. 12 or 13GB IIRC.


Try an array of FPGAs.


Why does that help with IO bandwidth?


Because every transceiver pair can do 32Gb/s or 56Gb/s and you have dozens of them.


I've never been a fan plus good luck getting 800GB/s across.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: