Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Rust Atomics and Locks: Low-Level Concurrency in Practice (marabos.nl)
292 points by signa11 on Jan 6, 2023 | hide | past | favorite | 46 comments


The foreword by Paul E. McKenney makes a great case to read this book even if you don't care about Rust at all:

> Which brings us to another group of potential readers, the Rust skeptics. While I do believe that most Rust skeptics are doing the community a valuable service by pointing out opportunities for improvement, all but the most Rust-savvy of skeptics would benefit from reading this book. If nothing else, doing so would enable them to provide sharper and better-targeted criticisms.

> Then there are those dyed-in-the-wool non-Rust developers who would prefer to implement Rust's concurrency-related safety mechanisms in their own favorite language. This book will give them a deeper understanding of the Rust mechanisms that they would like to replicate, or, better yet, improve upon.

https://marabos.nl/atomics/foreword.html


In case anyone here doesn’t know, Paul McKenney was one of the main contributors to the RCU[1] implementation for the Linux kernel. So the guy knows a thing or two about concurrency.

[1] https://en.wikipedia.org/wiki/Read-copy-update if you just want to know what it is and http://www.rdrop.com/users/paulmck/RCU/rclock_OLS.2001.05.01... if you want the good stuff


Although, somewhat amusingly, neither C++ nor Rust (which basically just copies the C++ atomics and memory model) can be used to correctly express RCU written within the language. The leading proposal for C++ (at least AFAICT) is to just add RCU primitives to the library and make their correctness the implementation's concern:

https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2022/p25...

I hope we get a next generation of systems languages with improved memory models that can actually express these things directly.


> Although, somewhat amusingly neither C++ nor Rust [...] can be used to correctly express RCU written within the language.

What do you mean exactly? Of course the required remote memory barrier need to be a primitive as it is provided by the OS, and ultimately the hardware, but AFAIK the asymmetric synchronization needed by RCU can be modelled within the C++ as a signal and a atomic signal fence.


You can't express the equivalent of Linux's rcu_dereference primitive in a way that avoids data races without using memory_order_acquire (or memory_order_consume, which is treated like memory_order_acquire in all implementations, and is fundamentally broken as I link to in my sibling reply). With a memory consistency model weaker than TSO (e.g. ARM, RISC-V, POWER), this imposes unnecessary overhead and eliminates some of the benefits of RCU.

To work around this flaw, you can use inline assembly, implementation-specific "compiler-only" barriers, etc. but that's outside of the language. It also forces the compiler to be much more pessimistic about optimizations than is actually necessary to preserve dependency ordering.


Can you provide any extra information about what is unable to be expressed? I’d like to look into what sort of constructs could be useful in this context.


The keyword to search for is "memory_order_consume", which was C++11's proposed solution to the problem that turned out to not work in practice. Here are some of the C++ WG documents describing the issues:

https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p00... https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/p01... https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/p04... https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p07...

A lot of the problems center around the reliance on data dependencies for ordering, and the difficulty of expressing these requirements in a manner that is compatible with separate compilation and compiler optimizations.


Memory_order_consume was added to the C/C++ memory model to support architectures like the Alpha, where indirecting through memory (which includes array indexing, etc.) need not create a true happens_before relationship. So you can have things like:

Assume: a = 1, p = &a, b = 0

  Thr. 1         | Thr. 2
  b = 1          |
  memory_barrier | i = *p
  p = &b         |
The result can be i = 0

Even though b=1 happens "before" p=&b within Thread 1, it is seen as happening after it by Thread 2. To overcome this, Thread 2 needs a barrier between the load of p from memory and the indirect load that assigns to i. Practically no other architectures have that quirk, and it only happens on Alpha because of its two separate cache banks.

Unfortunately, it has turned out to be infeasible to give memory_order_consume a proper weak semantics setting it apart from memory_order_acquire, given e.g. how common it is for indirections to be altered by compiler optimizations. So Ordering::Consume does not, as of yet, exist in Rust.


The problem of the CPU reordering things only exists on architectures like Alpha. However, there's still the problem of the compiler reordering things (which exists on every architecture), and memory_order_consume was intended to solve both. As you can see from the original rationale document from 2010, this RCU use-case was one of the main motivations:

https://www.open-std.org/jtc1/sc22/wg14/www/docs/n1525.htm


He also wrote a good introduction to parallel programming that's freely available as well:

https://mirrors.edge.kernel.org/pub/linux/kernel/people/paul...


The combination of Mara as the author and Paul McKenney writing a foreword is already a good reason to read the book. :)



> Then there are those dyed-in-the-wool non-Rust developers who would prefer to implement Rust's concurrency-related safety mechanisms in their own favorite language.

The book's called "atomics and locks". What language doesn't already have those? If I'm touching those, doesn't that mean I'm already trying to reimplement my own concurrency-safety mechanisms?

Howabout Rust implements my favourite concurrency-related safety mechanisms, so I don't need to ever know about atomics or locks.


The "About this book" tells you what the book covers. The whole book is available for you to peek at if you don't trust it to correctly summarize itself.

As you would expect, Rust does have concurrency primitives, which is why the book shows you how they are implemented and why they are implemented they way they are. In doing so it also teaches you how to do low level things for when you can't use the language provided primitives, like if you're interacting with FFI boundaries or implementing your own lockless datastructure.

Your comment is like seeing a link to "Learning Rust With Entirely Too Many Linked Lists" and reacting with "what language doesn't have linked lists?"

Edit: I see that you might have meant "why would I reimplement Rust's atomic primitives in another language". Well, some documentation is needed for anyone that will have to implement those language primitives, in existing languages (because there are undressed needs) or in new ones. More documentation and explanation of tricky concepts is good.


What sort of things are "your favourite" and why ought Rust to specifically do things you favour, rather than what anybody else might like ?

Rust got scoped threads (again), that's a rather nice safe concurrency feature, a way to say "This thread shall only live during this object's lifetime". For example you can give a new HttpConnection to a scoped thread, knowing that Rust ensures the thread ends before the HttpConnection ceases to exist. Before scoped threads Rust wouldn't let you (safely) make threads which use outside objects with a finite lifespan, as it can't be sure the thread won't touch an object which no longer exists.

But this sort of book is more aimed at people who are interested in the details you apparently don't care about (or don't want to care about). Somebody has to know how it works to build the higher level mechanisms you desire.


> Howabout Rust implements my favourite concurrency-related safety mechanisms, so I don't need to ever know about atomics or locks.

How on earth are you doing concurrency without touching these two key primitives at some point? Lots of lockless?


> How on earth are you doing concurrency without touching these two key primitives at some point?

using one of the many higher level concepts/frameworks available.

- actor model

- fork join

- CSP

etc.


Someone has to implement those frameworks, and those people need to understand pretty much everything covered by this book :)


not to disagree or antagonize, but the question was

> How on earth are you doing concurrency without touching these two key primitives

and not how to write high level concurrency frameworks.

I wouldn't trust very much a concurrency framework written by someone who just learned about locks and atomics on a book.

I wouldn't even trust myself to write low level concurrent code using locks and atomics.

I have 26 years of experience as a programmer and if there's something I've learned is that low level concurrency is really hard to get right, even when you perfectly understand the underlying concepts, and it's best left to real experts that dedicated their career to it.

edit: I'm sure the book is great and the authors very knowledgeable and I'm really curious to read it.


Sometimes all you need is an atomic counter though (or lock). Like, for the commonly used Prometheus metrics format, you usually just have global, atomic counters, because it's arguably the most efficient/easiest way to do it. Higher level primitives are nice, but even lower level APIs might be "the right choice" for even high-level problems.


They predate rust.


There are many practical instances where an atomic is the simplest solution, e.g., shared counters.


Maybe the book isn't for you then. Not every book has to be.


Maybe I was criticising the quoted text, not the book.


FWIW, this isn't mentioned in the book, and only extremely cursorily in the official Rust docs, but looks like 128bit atomics have landed (on supported architectures): https://github.com/rust-lang/rust/blob/19bc8fb05ab083a315ee4...


Its very kind of Mara to share the book online, though I still preordered a paper copy for bedtime reading. :)


It is exceptionally good book about very hard topics.

I'm programming high-performance tasks in Java for 10+ years, and Java was first commonly-used language with formal memory model, as far as I know. C/C++/Rust memory models are modeled after Java's one.

There was no new material for me, but all material is very well organized, explained, and presented. This is very hard material, there are a lot of tutorials about JMM and most of them are worse than this book.

Other good property of this book is density of information. Many IT-related books now contains unneeded repetitions and artificial bloat to increase page count. Each page of this book is useful!

This book will be exceptionally useful not only for any Rust programmer, but, also for any non-Rust programmer who uses language with same or similar memory models, including C, C++ and Java programmers.

5 stars out of 5, recommended for every programmer who want to understand complex mechanics of memory models and high-performance concurrency of modern languages.


I'm curious how folks in Rust implement striped-locking?

Striped locking is where you chunk sections of data behind the same lock to improve scalability. So, 1 lock might protect 1,000 items in an array, for example.


First thing that'd come to mind for me would be replacing the unstriped Vec<Mutex<T>> with some wrapper like:

    struct Striped<T, const StripeSize: usize>(Vec<Mutex<[T; StripeSize]>>, Mutex<tinyvec::ArrayVec<[T; StripeSize]>>);
I think I'd probably want a wrapper either way to ensure I'm always taking locks in a particular order to avoid deadlocks, anyway.


Realistically you don't unless you really really need it. If you really need it you drop to unsafe and build a "safe" interface in-front of it.

Rust makes writing very high performance multi-threaded code incredibly hard because most of the tricks to get that last bit of performance out are basically asking to generated races and are near impossible to express in a way that is provably safe. You basically thinking of your data-layout from day one if you need that kind of performance your just going to have to deal with a bunch of unsafe in your codebase.


I've never done this but I imagine you could use something like

  [Mutex<[T; 1000]>; 10]


The book, especially chapter 7 seems to say several things completely opposite to what Herb Sutter did in his famous talk a number of years ago.

In the book:

> This means that on ARM64, sequentially consistent operations are exactly as cheap as acquire and release operations. Or, rather, that ARM64 Acquire, Release, and AcqRel operations are as expensive as SeqCst. Unlike x86-64, however, Relaxed operations are relatively cheap, as they don’t result in stronger ordering guarantees than necessary.

However in Herb Sutter's talk (specifically the second half here: https://www.youtube.com/watch?v=KeLBd2EJLOU ) he says that ARMv8 Acquire/Release/AcqRel operations are cheaper than SeqCst. This book seems to claim that x64 is somehow superior to ARMv8, which from my understanding is completely wrong.


Let's go to primary sources: https://developer.arm.com/documentation/ddi0487/latest

The author is right. ARMv8 supports relaxed memory ordering but its memory model does not support acquire-release ordering without sequential consistency. (Update: Support for weaker acquire-release ordering was added in a later revision.)

I'll quote a relevant excerpt from B2.3.11:

    Where a Load-Acquire appears in program order after a Store-Release, the memory access generated by the Store-Release instruction is Observed-by each PE to the extent that PE is required to observe the access coherently, before the memory access generated by the Load-Acquire instruction is Observed-by that PE, to the extent that the PE is required to observe the access coherently. 
Sequential consistency is needed to avoid store/load reordering in cases like this:

    // Thread 1
    flag1.store(1, SeqCst)
    if flag2.load(SeqCst) == 0 {
        // Guarded action
    }

    // Thread 2
    flag2.store(1, SeqCst)
    if flag1.load(SeqCst) == 0 {
        // Guarded action
    }
If these were instead implemented with acquire/release ordering as defined by the C++ or Rust memory model, the resulting happens-before constraints would not prevent both threads from executing their guarded actions.

The excerpt from the Architecture Reference Manual says that if you use their load-acquire (ldar) and store-release (stlr) instructions, it is not possible for the store-release to be moved after the load-acquire, as observed by PEs (processing elements, their abstraction of hardware threads).

Let's look at how C++ compilers implement acquire-release vs sequential consistency on x86 and ARMv8:

https://godbolt.org/z/3fd5jse18

The machine code on ARMv8 is identical for thread_acq_rel and thread_seq_cst. Whereas on x86 the thread_seq_cst code has to use xchg (an alternative to store + mfence) to achieve sequential consistency.

Update: shachaf pointed out that ARMv8 more recently added support for weaker acquire-release semantics in the ARMv8.3 revision. It looks like the first processor to ship with ARMv8.3 support was the A12X from Apple in 2018, which is 5 years after Herb's talk. If we take the code from before and compile for ARMv8 with all architectural features enabled, you will see different machine code for thread_acq_rel which uses the newer ldaprb instruction:

https://godbolt.org/z/dnP9sebcz

This illustrates a difficulty with talking about "ARMv8" as a fixed thing. It's much more of a rapidly moving target than x86. That said, the ARMv8.3 addendum should have been mentioned, at least parenthetically; I emailed the author suggesting an info box.


Interesting to check what the various stores prices the book as. Cheapest I found was 34 EUR and most expensive 68 EUR, pretty much double the price.


One of the things most concurrency material leaves out is the actual definition of concurrency. I read the first chapter, but didn't see it defined, but I see this touches upon Happens-Before so it might include it later (?)


Thanks for putting the time and extensive effort into this book. I am deeply interested in concurrency, asynchrony and parallelism I even journal about it everyday on my GitHub journal ideas repository (see my profile)

I've been talking on the r/programminglanguages discord about syntax for async/await. I am writing a programming language that has parallelism as a first class citizen. I use this actor2.java lock free algorithm below for sending messages between threads.

I want my async/await tasks to be eager and run in their own threads. So if I do this

  handle1 = async task1();
  handle2 = async task2();
  handle3 = async task3();
  // at this point in time, the token ring thread pool shall be running 3 tasks in parallel
  return1 = await handle1;
  return2 = await handle2;
  return3 = await handle3;
I believe other languages do not run CPU bound async/await computations on a thread except for IO.

I wrote an unrolled state machine for my async/await in Java. This models a simple async/await program and runs tasks on other threads - without locks. I use a design I call token ring parallelism, where threads take turns and are linked together in a ring structure. A semaphore communicates write status along the ring. The plan is to write a transpiler to generate the switch statements automatically. Then any language with threading can have async await support even if it is not native.

https://github.com/samsquire/multiversion-concurrency-contro...

I wrote a own lock free algorithm here that I use to do message passing between actor threads. My goal is high throughput performance and low latency.

https://github.com/samsquire/multiversion-concurrency-contro...

With 11 threads (on a 12 core processor, deliberately left one core for Windows) I get

This code with 100 threads, gets 114657500 total requests 22079241.286347 requests per second Time taken: 5.193000

166512180 total requests 33249237.220447 requests per second Time taken: 5.008000

For a lock based benchmark with 100 threads, I get:

16307090 total requests 3246484.172805 requests per second

With 11 threads, the lock benchmark gets

178770160 total requests 35732592.444533 requests per second Time taken: 5.003000

Each request is a synchronization event that sends 10 messages in a loop or adds 10 to a counter to represent the synchronization event.

There's probably problems with my measuring methodology, I am doing the simplest possible thing that shall work. I am trying to show that my lock free algorithm scales even with heavy contention.


Out of curiosity, How do you benchmark your algorithms.

> e.g, This is another parallel multithreaded actor model. Run Actor2.java to run it. I get around 1.1 billion requests per second with this model.


Does anyone know how to pronounce the author's name? Is the "s" more like an "sh"?


This book is absolutely wonderful and a pleasure to read.*

* I say this as a staunch critic of Rust.


I'm interested in the concept of concurrency, but is there anywhere that it would be useful outside of scientific applications and operating systems? It feels like a powerful paradigm to master but I'm just not sure it's worth the investment for the average developer.


It’s a common misconception but there is a difference between concurrency and parallelism.

Concurrency is very useful across all aspects of the stack including OS code, and even web app code.

Your comment about “usefulness” is likely about Parallelism. Which I agree with you is typically only useful in abstractions like OS code and even web frameworks.


For what it's worth, if I hadn't learned about low-level concurrency primitives in Rust, I wouldn't have had the mental framework to succeed in creating a realtime websocket service in Python.

My two cents is that learning is enriching and inspiring. If you learn concepts from one context you can apply them in another in new and surprising ways. Don't put yourself in a box like "I'm just a normal developer, I don't know about that OS stuff"; learn about the things that excite you and worry about applying them later. Specialization is for ants.

But if this advice isn't useful to you, feel free to discard it.


I've used tons of concurrency in Android apps, iOS apps, Unreal Engine games, C++ code running a Raspberry Pi drone, embedded Android telemetry (Java/C++), and C++ AR/VR high-resolution realtime media streaming systems.

If you can wield concurrency, you can use it anywhere. It's liberating to be able to drastically improve the performance of your code by leveraging the multi-core processors our software runs on these days.

I've never done any OS work nor scientific computing.


Most web servers / apps. Just think about how you would test a simple register endpoint for example.


What a great looking book! Congrats!


BTW, I am looking for a statically linked rust compiler (the one written in rust ofc) toolchain for ELF64.

This is for bootstrap.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: