Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> It is a common mistake to try to use files for locking, for example, instead of using the more robust flock(1).

Why is this a mistake? It is my understanding that, if all the locking you need is a simple mutex, creating a file with a well-defined name with O_CREAT | O_EXCL is atomic -- the file will either be created or not (in which case the call will fail with EEXIST), and no two processes can possibly both succeed at creating the file. This even works on NFS; it was apparently broken in the NFS client in Linux 2.6.5 and below, but it is supposed to work in NFS, and is generally the only reliable way of getting locks in NFS.

You don't get any better way to wait on the lock than re-trying to create the file, and you don't have any mechanism for dealing with clients that die while holding the lock (i.e., it's an aggressively CP system), but for what it does, it's supposed to work correctly and atomically.



The flock() method is preferable when you don't need to use NFS because as you say it'll automatically clean the lock up if the process holding it dies.

This gets rid of all the edge cases with stale locks in one fell swoop.

But as you point out if you want to do this e.g. over NFS you should create a file, but then you need to deal with stale locks.

If you can at all avoid that using flock() is generally better.


http://0pointer.de/blog/projects/locking.html claims that flock() is less reliable over NFS (returns true without actually locking anything on Linux < 2.6.12 and "BSD" - not sure which BSDs or whether that's still true).

And my instinct is that in a networked scenario, you're at least as worried about a machine dying as a process on the machine (i.e. a network partition). A flock()-based lock doesn't clean itself up if the client is unreachable, does it?


Yes as I pointed out you don't want this if you're doing NFS.

Personally I prefer something like a MySQL table with GET_LOCK() to process things instead of NFS if I need multiple machines. It gives you flock() like semantics in that if a machine or client goes away the GET_LOCK() is automatically freed, i.e. it survives as long as the connection to the database survives.

Not having to deal with stale locks generally sucks way less than the extra overhead of a database.

For any NFS-based scenario you usually end up creating a "task" "task.underway" and "task.done" files as locks, and re-enqueuing tasks if you have a "underway" file that's too old without a "done" file.

You'd do the same with a MySQL table that you GET_LOCK() on, except you can safely re-enqueue "underway" tasks if you acquire the lock on them, since you know their consumers have gone away.


Technically, you're right that it's atomic to create a file. But creating a lock using a file can be deceptive and is a common pitfall in my experience. I have seen a lot of shell scripts take this form:

  if [ ! -f $FILE ] ; then
   touch $FILE
   # do something dangerous, assuming I have a lock
   rm $FILE
  fi
The problem here is, of course, that I've checked whether the file exists, but another process (even a concurrent execution of the same script) could remove $FILE after I've checked that it doesn't exist. Now I (or any other process) can happily proceed to create $FILE, thinking that no one else is executing simultaneously. Actually, if I ran two executions of this script at about the same time, they could both pass this check and executed the (mistakenly expectedly) "synchronized" block.

Of course, you don't have to use flock(1) to make this operation atomic. It just handles a lot of the extra work that I don't want to have to think about, even if I did set `noclobber` or something like that.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: