Tbh, this paper is a bit much, unless you have a lot of time on your hands for it, and also the necessary understanding of CPU architecture to really grok it.
I think there are better blog posts out there that explain the concepts. "Cache locality" "memory cost of context switching" are some terms to put in search engines. Maybe add "thread local storage" in there too.
The big takeaway from this paper though, at least when I first read it, and the concept has risen to prominence in the years since- backed up by benchmarks- that Big O isn't really the end of the argument when it comes to performance.
The way CPUs work is that its extremely fast to rip through contiguous datasets. Jumping around to random memory addresses is way slower. A corollary to that is that jumping between threads is very slow for the same reasons- loading and unloading random chunks of memory. There are a lot of details around the specifics of why though- different cache levels and their latencies, and what happens when there is a cache miss, etc..
This lead HFT type programs to favor linear arrays to store things, and pinning threads to CPU cores, amongst other things, but that's the gist of it.
I think there are better blog posts out there that explain the concepts. "Cache locality" "memory cost of context switching" are some terms to put in search engines. Maybe add "thread local storage" in there too.
The big takeaway from this paper though, at least when I first read it, and the concept has risen to prominence in the years since- backed up by benchmarks- that Big O isn't really the end of the argument when it comes to performance.
The way CPUs work is that its extremely fast to rip through contiguous datasets. Jumping around to random memory addresses is way slower. A corollary to that is that jumping between threads is very slow for the same reasons- loading and unloading random chunks of memory. There are a lot of details around the specifics of why though- different cache levels and their latencies, and what happens when there is a cache miss, etc..
This lead HFT type programs to favor linear arrays to store things, and pinning threads to CPU cores, amongst other things, but that's the gist of it.