Things Unix can do atomically (2010)

atrust · on May 30, 2016

In a few simple words, can someone explain what does "atomically" mean? I personally used this term when talking about some Redis operations, but never knew the real gist of the word and concepts behind it. I have a very brief understanding of the term and if I'd have to explain it to a person, I'd say it's "the operation that does not have any side effects when performing its unit of work". Is my understanding even close to what atomic operation really is?

asveikau · on May 30, 2016

If something alters an object such that it goes from state A to state B, it might need to do some work along the way (call it "state C"). Atomicity means that if the operation is interrupted or observed while it's going on, the existence of a "state C" never leaks out. It's always in either state A or B; any third state that might exist along the way is never visible. (Hence what people often say: the operation is either completed or didn't happen at all.)

Renaming a file is a good example. Within the internal structure of the filesystem, you have a directory entry in an old location. That must be removed. You may have another file with the same name in the destination directory. That file must be overwritten. Internally, these things happen by a multi-step process, eg: remove entry for old name, remove pre-existing entry for new name, create new entry for new name. But the system creates the appearance of just 1 step. You don't get file not found while it's overwriting the destination file. You don't ever see the file having both old and new names at the same time.

jonaf · on May 31, 2016

edit: I just realized you said "renaming." Original comment left below, but I edited before I get downvoted for a classic reading comprehension fail.

Atomicity requires that the leakage mentioned shall not occur from any context aside from its own internal context. That makes your example somewhat of a simplification because these state transitions are visible to other processes. It is a common mistake to try to use files for locking, for example, instead of using the more robust flock(1).

geofft · on May 31, 2016

> It is a common mistake to try to use files for locking, for example, instead of using the more robust flock(1).

Why is this a mistake? It is my understanding that, if all the locking you need is a simple mutex, creating a file with a well-defined name with O_CREAT | O_EXCL is atomic -- the file will either be created or not (in which case the call will fail with EEXIST), and no two processes can possibly both succeed at creating the file. This even works on NFS; it was apparently broken in the NFS client in Linux 2.6.5 and below, but it is supposed to work in NFS, and is generally the only reliable way of getting locks in NFS.

You don't get any better way to wait on the lock than re-trying to create the file, and you don't have any mechanism for dealing with clients that die while holding the lock (i.e., it's an aggressively CP system), but for what it does, it's supposed to work correctly and atomically.

avar · on May 31, 2016

The flock() method is preferable when you don't need to use NFS because as you say it'll automatically clean the lock up if the process holding it dies.

This gets rid of all the edge cases with stale locks in one fell swoop.

But as you point out if you want to do this e.g. over NFS you should create a file, but then you need to deal with stale locks.

If you can at all avoid that using flock() is generally better.

geofft · on May 31, 2016

http://0pointer.de/blog/projects/locking.html claims that flock() is less reliable over NFS (returns true without actually locking anything on Linux < 2.6.12 and "BSD" - not sure which BSDs or whether that's still true).

And my instinct is that in a networked scenario, you're at least as worried about a machine dying as a process on the machine (i.e. a network partition). A flock()-based lock doesn't clean itself up if the client is unreachable, does it?

avar · on May 31, 2016

Yes as I pointed out you don't want this if you're doing NFS.

Personally I prefer something like a MySQL table with GET_LOCK() to process things instead of NFS if I need multiple machines. It gives you flock() like semantics in that if a machine or client goes away the GET_LOCK() is automatically freed, i.e. it survives as long as the connection to the database survives.

Not having to deal with stale locks generally sucks way less than the extra overhead of a database.

For any NFS-based scenario you usually end up creating a "task" "task.underway" and "task.done" files as locks, and re-enqueuing tasks if you have a "underway" file that's too old without a "done" file.

You'd do the same with a MySQL table that you GET_LOCK() on, except you can safely re-enqueue "underway" tasks if you acquire the lock on them, since you know their consumers have gone away.

jonaf · on June 1, 2016

Technically, you're right that it's atomic to create a file. But creating a lock using a file can be deceptive and is a common pitfall in my experience. I have seen a lot of shell scripts take this form:

  if [ ! -f $FILE ] ; then
   touch $FILE
   # do something dangerous, assuming I have a lock
   rm $FILE
  fi

The problem here is, of course, that I've checked whether the file exists, but another process (even a concurrent execution of the same script) could remove $FILE after I've checked that it doesn't exist. Now I (or any other process) can happily proceed to create $FILE, thinking that no one else is executing simultaneously. Actually, if I ran two executions of this script at about the same time, they could both pass this check and executed the (mistakenly expectedly) "synchronized" block.

Of course, you don't have to use flock(1) to make this operation atomic. It just handles a lot of the extra work that I don't want to have to think about, even if I did set `noclobber` or something like that.

asveikau · on May 31, 2016

Nonetheless the reply prompted me to look into the rename case specifically. Apparently on Linux, the replacement of the destination file is atomic (as many of us already knew and take for granted), but there's no guarantee that you won't see both old and new names in flight for a brief moment in time (like the last sentence of my comment).

Would not be surprised if all bets are off once you get an NFS mount involved.

As always it's a tradeoff between useful behaviors and the cost of synchronizing.

sp332 · on May 30, 2016

Not really. It just means that it's indivisible (the original meaning of "atom"). Either it succeeds or fails, you never have to worry about it being half-finished. This includes actions which are so small they are literally indivisible, or actions which roll back to the original state if they fail.

LukeShu · on May 30, 2016

Not just about it only completing or failing, but another observer in the system should never be able to find it in the half-way state. To everyone but the implementer of the atomic operation, it has no half-way states.

lomnakkus · on May 31, 2016

In the database terminology that's usually called "Isolation" to keep the concept separate from the rather restricted definition of atomicity.

atrust · on May 30, 2016

Got it. Now it makes more sense to me. Now I know people tend to talk about atomicity when it comes to low-level-ish things. But say I create some sort of a web service with a bunch of business logic. Does it worth to follow this principle in that case? For instance, client sends an API request (let's say "Add user to friends"), is it even possible to apply atomicity for these type of things?

Edit: Thanks everyone for taking time to explain it to me.

anon946 · on May 31, 2016

Atomicity can be important at any level. For example, assume that adding a friend involves two edges in a graph, one edge in each direction. Now assume that there is some other piece of code that does some analysis or processing of friend relationships. This piece of code might rely on there always being two edges, and crash or give erroneous results if not.

Thus, the adding of the two edges must be atomic (when observed from the rest of the system).

(The example is a bit contrived, but hopefully gets the idea across.)

toast0 · on May 30, 2016

Yes, depending on how you store the relationships, its possible that when a adds b to its list of friends, a could be added to b's list of 'friendofs'. If they're stored separately, the edit may not be atomic. If it's stored in a single place (queryable from either direction), it's generally going to be atomic, unless you're doing something way outside the norm.

stormbrew · on May 30, 2016

Yes. To take a larger example action, creating a user could fail to be atomic if, say, it stored the username in a separate table from the user object, wrote the username first, then referenced it from the user object, but didn't roll back the username insert if the user object insert failed.

Likewise, for the seemingly simpler example if establishing a friend relationship, you may be tracking that relationship in both directions, in which case one could fail and the other succeed.

trungaczne · on May 30, 2016

The business logic of your web application should resolve around database calls. Popular databases should already guarantee these atomicity properties for you through transactions.

gnaritas · on May 31, 2016

> The business logic of your web application should resolve around database calls

Might, not should. All web apps are not just front ends to a single database where transactions are useful and once you leave the realm of a single database into a more distributed type system then transactions are no longer an option.

toast0 · on May 30, 2016

Database guarantees are great until your data doesn't fit in a single database; it's good to examine what your database is providing for you, and to contemplate the costs for providing it in the database level.

amelius · on May 31, 2016

Yes. But in this case, the concept you are probably looking for is "transactional". Transactions are atomic, but the difference is that they can fail (the state is then rolled back), and they can be retried at a later point.

kalleboo · on May 31, 2016

Here are some real-world examples of what can go wrong if your API requests aren't atomic http://josipfranjkovic.blogspot.jp/2015/04/race-conditions-o...

lomnakkus · on June 2, 2016

(A bit late, but hoping that you see this:)

If possible you should probably go for an even stronger property, namely Idempotence[1]. (This can be relatively easy if you can force clients to provide some sort of unique token for every operation.)

It's usually makes this even easier to reason about for clients since they can just retry anything while knowing that it doesn't matter if they retry an already "applied" operation.

[1] https://en.wikipedia.org/wiki/Idempotence

agumonkey · on May 31, 2016

Reminds me of graphics double buffering. Back in the days games would write directly in the video buffer, while the graphic chip would scan that same buffer and push the content on screen at the same time.

If your code is too slow (a complex effect, too many character at that point), you might not be done writing a full frame when the graphic chips starts to output the pixels.

This means your TV is now showing partly old and new state. Nothing important most of the time, it's only games, it's only a few ms of absurd information, people's brain can compensate. It is ugly to see though. You have that weird 'line' somewhere below.

Since people changed the structure a bit, with two (or more) buffers, the program computes the new image in one buffer B, while the chip shows another buffer A. When you are done with a picture, the chip will now scan B, while you can write in A. This means the output never shows partial frame anymore.

pif · on May 31, 2016

Translating in English from Italian Wikipedia: From ancient Greek ἄτομος - àtomos - [indivisible], made of ἄ - a - [Privative alpha] + τέμνειν - témnein - [cut].

Personally, I struggled long time before fully understanding its use in IT, because I learned programming after sub-nuclear physics, thus I had a hard time conciliating the huge atom (a million of billions of times bigger than a nucleus) with the concept of "cannot be split" :-)

simula67 · on May 31, 2016

It is helpful to understand what problem it solves.

Lets say that we have a banking application that consists of a program which updates someones bank account by $Y every time it is called. Y is the command line parameter. The program's algorithm is like this :

1. Read the current balance amount to X

2. Add Y to X and store it in Z

3. Write Z to the database.

This program cannot be called by multiple processes at the same time. Lets say that it is payday, the account holder holds two jobs and each employer is trying to deposit $10 into someone's account, at the same time. Both these processes call the program with Y = $10. What happens ?

1. Process 1 reads the current balance ( $100 ) to X

2. Now, process 2 reads the current balance ( $100 ) to X

3. Process 1 adds 10 to X ( Z = 110 )

4. Process 2 adds 10 to X ( Z = 110 )

5. Process 1 writes the updated value to the database ( Z = 110 )

6. Process 2 writes the updated value to the database ( Z = 110 )

Now the account reflects a balance of $110, when it should have reflected $120. What we need is a guarantee from the system that some actions will not be parallelized ( i.e, they will be atomic ). From TFA it is given that "mkdir" is an atomic operation in UNIX ( i.e, only one process can create a directory at the same time ). You can write the program with the following logic

1. mkdir /tmp/lock_dir

2. If above step was unsuccessful sleep 10 seconds and go back to step 1

3. Read current account balance to X

4. Add Y to X and store it in Z

5. Write Z to database

6. Remove /tmp/lock_dir

Multiple processes can invoke this program simultaneously.

LukeShu · on May 30, 2016

It means that the operation can't be divided any smaller--that it is impossible to catch it only part-way completed; it either hasn't happened yet, or has completely happened.

_pmf_ · on May 31, 2016

> I'd say it's "the operation that does not have any side effects when performing its unit of work". Is my understanding even close to what atomic operation really is?

No. It can has as many side effects as it wants. Atomicity means: when going from state 1 to state 2, no matter how complex the transition, there are no externally observable intermediate states.

woodman · on May 30, 2016

As the others have said, indivisible operations. This is important in the context of race conditions, imagine two threads incrementing a counter with no locks. Using non-atomic ops to read, increment and then store the number will lead to a bad time. Or more on topic, checking if a file exists and then trying to open it - lots of bad things can happen.

estrabd · on May 31, 2016

Why one would like to have an atomic operation is easier to understand.

For example, one can use the atomic nature of creating a symbolic link on nix to create a lock file to prevent a race condition in a forking shell script. Say you have two or more processes wanting to do something that can (or should) only be done by one process at a time; one naive solution is to manage access of each process to said action by using a lock file. However, writing or touching a file itself is not atomic.

The answer is to throw a symbolic link into the mix. In this scenario, the lock file already exists. However, the lock is not the file itself, but a symbolic link to the file. The protocol for each process to follow is:

1. try to create a symbolic link to lock file (any file really)

2. if successful, proceed; if failed, wait (or exit)

3. when process is done, delete symbolic link to lock file

Simply checking for the existence of the symlink is not sufficient since there is a period of time between checking for the symlink (or file) and proceeding with said action where another process can think it has the lock.

The OS ensures that one and only one symlink (of the same name) can exist; attempts to create it again (even simultaneously) will result in a failure of one process to create the symlink. There is one winner; all others are losers. This is to say, the kernel ensures that the operation is atomic. As a result, the OS is now arbitrating what process can proceed to action, at the very lowest level. Another way to think about it is that it provides a way to make competing processes serialize - or get in line so that they may complete their action one at a time.

In my experience, it is important to experiment and test to make sure that the atomic primitive you're using is actually working as expected. I've run up against some inconsistent implementations of symlink creation that make this action not as straightforward to use as one is lead to believe.

GauntletWizard · on May 30, 2016

Atomic means "Does precisely what is is asked to do, or does nothing". This is the core of synchronization primitives, preventing two tasks from doing the same work or trying to access the same resource at once. open(O_CREAT| O_EXCL) for example, open a new file for you, or report an error. Regardless of who else is attempting operations at that moment, at most one of those calls will succeed (they can all fail, for a number of reasons). This allows you to, say, create a lockfile (often named filename.lock or .filename.lock) that serializes access to a shared resource, like a simple database or a printer control port or network connection. This is core to multithreaded programming (At least, imperative multithreaded programming; Functional languages tend to abstract this away by having little-or-no shared state)

tlrobinson · on May 31, 2016

It's more about multiple operations taking effect "instantaneously" from the perspective of some observer, and sometimes it also implies that either all or none of the operations take effect, but never just some of the operations.

syngrog66 · on May 31, 2016

I googled for "atomic operation" and this was top result:

https://en.wikipedia.org/wiki/Linearizability

ridiculous_fish · on May 31, 2016

Here's one that's not super well-known: writes to a pipe are atomic as long as the write size is <= PIPE_BUF, which is at least 512 bytes (on Linux it's a full 4k). So pipes can be used as a naïve message queue: ensure each message is under the limit in size, and they will not be split.

Anyone know if there's a similar guarantee for files?

bluetomcat · on May 31, 2016

The PIPE_BUF POSIX requirement is about guaranteeing that when multiple processes write less than PIPE_BUF to the same pipe, the reader will not see its input intermingled from different processes. It is often wrongly interpreted as "a single write() of less than PIPE_BUF will be retrieved via a single read() on the other side".

colmmacc · on May 31, 2016

They won't be split, bit one thing to watch out for is that they can be coalesced.

brobinson · on May 31, 2016

I read a while back that this PIPE_BUF restriction also applies to multiple writers appending a file.

adrianratnapala · on May 31, 2016

I think Linux goes out of its ways to make writes <= 512 bytes atomic. Helpful for writing log files. But I don't think this is a true guarantee in any standardised sense.

the8472 · on May 31, 2016

Opening with O_APPEND you get atomic appends if they're <= PIPE_BUF.

And the list already mentions atomic memory operations. Those also apply to memory-mapped files.

rwmj · on May 31, 2016

I always thought it would be a good idea for system calls to support transactions. Probably in a limited way because implementing general transactions would require massive changes to the kernel. But it would be nice to be able to do [error checking omitted]:

    begin ();
    fp = fopen ("file", "w");
    fputs (content, fp);
    fclose (fp);
    commit ();

It could solve the whole thing with ending up with zero-length files because you didn't use the right incantation to update a file atomically on ext4 (https://thunk.org/tytso/blog/2009/03/12/delayed-allocation-a...).

In Unix v7 mkdir was not a system call. It was a setuid program implemented using mknod + link. That was racy so the mkdir(2) system call was added. But it could have been solved more generally (and more elegantly) by adding transactions.

tobias3 · on May 31, 2016

Windows has it, but not many people seem to use it: https://msdn.microsoft.com/en-us/library/windows/desktop/aa3...

krylon · on May 31, 2016

Beware, it appears to be deprecated:

"Microsoft strongly recommends developers utilize alternative means to achieve your application’s needs. Many scenarios that TxF was developed for can be achieved through simpler and more readily available techniques. Furthermore, TxF may not be available in future versions of Microsoft Windows."

SolarNet · on May 31, 2016

Windows has this.

kevin_nisbet · on May 31, 2016

Ah the good ol day's, this reminded me of some file transfer problems we used to get.

We had multiple systems that generated usage records, and stored them to flat file (think stuff that would end up on a bill). Because FTP the file was a thing, some other system would come in any copy the file, but every once in awhile there would be a partial file copied that would be missing records. Yep, it was the good ol process was still writing to the file when the collector decided to pick it up.

The first system I had control over, I made sure the vendor wrote to a temporary directory, and then hard linked to the transfer directory when the file operation was complete, knowing it avoided the race condition. I'm pretty sure I had one of the few platforms that handled this correctly, from what I remember we had corrupt files from almost all the platforms we bought.

Anyways, just because the system cost a million dollars doesn't mean it's any good.

zAy0LfpBZLC8mAC · on May 31, 2016

It should be noted that those filesystem operations are only atomic with respect to an observer running on the same operating system incarnation. Whether they are also atomic across power loss depends on the filesystem (though with a modern journaling filesystem, that generally should be the case).

gpderetta · on May 31, 2016

That's what people would expect. But there was some drama around ext4, renames and fsync a few years ago.

zAy0LfpBZLC8mAC · on May 31, 2016

You mean around truncation and rewrite?

gpderetta · on May 31, 2016

It is not necessarily about truncation. IIRC the problem was that rename doesn't (didn't?) act as a full barrier on ext4 and the metadata write that updates the name from the old file to the new file can be committed to disk before the updates to the new file. This means that after a crash the new name might point to a corrupted file.

The barrier behavior wasn't explicitly mandated by POSIX, but it is an intuitive release consistency-like model which was implicitly expected by most programmers.

edit: spurious words and parens.

throwaway76543 · on May 31, 2016

msync() with MS_INVALIDATE doesn't belong on this list. It has nothing to do with atomic memory access. msync() is used when flushing a mapped file to durable storage. I very often see this mistake of conflating flushing caches with atomic access of memory. What's committed to durable storage has nothing to do with what multiple processes will see when mapping a file.

All that's needed is the initial mmap to share a memory segment, then to use atomic operations like CMPXCHG -- the x86 building block the later mentioned gcc atomic macros leverage.

gpderetta · on May 31, 2016

Right. Msync is an ordering barrier. Barriers go hand in hand with atomic operations, but they are really about visibility, not atomicity.

prodigal_erik · on May 31, 2016

I haven't tested but I would expect MS_INVALIDATE on a large buffer to be much faster than filling it a word at a time with __sync_val_compare_and_swap (each causing its own bus transaction).

gpderetta · on May 31, 2016

MS_INVALIDATE is likely a no-op on any modernish Unix, including Linux. It is there to accommodate old systems with non-coherent mapped files and page caches or even multiple mappings of the same file.

throwaway76543 · on May 31, 2016

msync() doesn't fill buffers at all. It has no function in the operation you've described.

Even if it did, it's not atomic ...

dmm · on May 30, 2016

rename(2) was not atomic on OS X for years. It was finally fixed in Lion: http://www.weirdnet.nl/apple/rename.html

LukeShu · on May 30, 2016

It should also be said that `mv`(1) is only atomic if the source and destination are on the same filesystem.

mancerayder · on May 31, 2016

So that implies that realistically there's a scenario where moving files or directories from one filesystem to another and some interruption occurs can lead to lost data?

Dylan16807 · on May 31, 2016

It's easy to design an algorithm that either succeeds with a move or fails with a copy, never losing data, but someone would have to look deeper into what was actually guaranteed and/or implemented.

Sanddancer · on May 31, 2016

Very implausible. Mv across filesystems is usually implemented as a copy then delete. Worst case, you'll have the data exist in both locations.

nshepperd · on May 31, 2016

It's possible, in case of physical disconnection or power interruption. There is no guarantee that the copied data will be flushed out of buffers into nonvolatile storage before the delete is (unless the program asks for such a flush and the system honors it).

netdog · on May 31, 2016

The GCC Atomic Builtins mentioned in the article are not specific to Unix. They are compiler constructs, and depend on specific architecture hardware support. All x86 CPUs have such support for some years now. So these atomic operations can also be used in non-Unix software running on x86 CPUs.

The GCC documentation lists other non-intel architectures which also have the features required to support the atomic built-ins.

comex · on May 31, 2016

Also, if you can depend on recent compilers you should probably be using the standard C <stdatomic.h> or C++ <atomic> instead.

DonaldFisk · on May 31, 2016

All system calls in ITS behaved as if they were atomic, using a mechanism called PCLSRing: http://fare.tunes.org/tmp/emergent/pclsr.htm

Other systems which took a similar approach were WAITS, the Fluke Kernel, and EROS and its successors.

estrabd · on May 31, 2016

Always a nice reminder.