Kazlib contains a few single-file (actually one .c file and one .h file) open source C libraries. For example, there's a dictionary data structure based on red-black trees, dict.c/dict.h, which I've found to be useful and reliable. There's also a linked list library, a hash table library, etc., but I haven't used these.
> The amalgamation is a single C code file, named "sqlite3.c", that contains all C code for the core SQLite library and the FTS3, FTS5, RTREE, DBSTAT, JSON1, and RBU extensions. This file contains about 184K lines of code (113K if you omit blank lines and comments) and is over 6.4 megabytes in size. Though the the various extensions are included in the "sqlite3.c" amalgamation file, they are disabled using #ifdef statements.
In other words, it's stretching the definition of "single file" a bit... but then again, I don't know any other SQL databases anywhere near as small as SQLite.
Smallest commercially viable SQL database is probably the latest k5/kdb (http://kparc.com/), which is implemented in something like 9KB of code (source: http://kparc.com/o.htm).
Yeah I think that's rather silly, both the lack of an explanation and the implied reason if we guessed it correctly (thanks for making us guess and have this hypothetical discussion).
Being single file is about simplified project management. How is the file size relevant?
Lock-free Single Producer, Single Consumer (SPSC) queue. The dependencies include C11 and that's it. No POSIX required, should work on any arch you can find a C11 compiler for.
I was just reading your code, which interests me since I've been interested in lock-free programming for a while but only recently started learning about the C11 and C++11 memory model and atomic operations.
I had some thoughts, which I'm offering in hopes of clarifying my understanding, and possibly helping to improve your software in the process.
I don't see any correctness problems in your code, but I do see what look like several opportunities for optimization. Though please take these with a slight grain of salt, as I am still learning the C11 atomics.
Your aring_give() ends with a release barrier and your aring_take() begins with an acquire barrier. That makes sense to me, as a way of ensuring the sequencing of the reads/writes to aring->rb and to item. However I don't see why aring_give() needs to begin with an acquire and aring_take() needs to end with a release. I think both of these could be changed to memory_order_relaxed with no change in correctness.
But I think we can go a step further actually. Your atomic_fetch_add_explicit() and atomic_fetch_sub_explicit() operations operate on a shared aring->items member. Both the reader and writer write to this variable, which requires expensive locked operations and which will degrade under contention. I don't think this is actually necessary.
Instead you could eliminate aring->items completely and simply compute it based on aring->head and aring->tail. Only the writer writes to head and only the reader writes to tail, so you could use just atomic_load_explicit()/atomic_store_explicit() with memory_order_relaxed on these variables to read and write them. Then calculate the number of items present by comparing them. This could make the overall queue significantly more efficient.
As you might have guessed, after a few conversations today (not very good ones, for the most part, I was tempted to give up the Internet.) I've been thinking about this code quite a bit in the last 12 hours.
I think you're right that there might be an opportunities for improvement. I admit that once the performance hit about 140 cycles per item, I mostly shrugged and called it a day. I felt a lot more concerned about its safety across various architectures than bringing that cycle count down to the absolute minimum. But I also did my benchmarking on x86_64, which has surprisingly strong memory guarantees by default, and so the difference here may be larger on other architectures.
Thanks for looking it over! I definitely am interested and hope to grab some time to try benchmark some tweaks (and test for correctness) on a bunch of different architectures.
Unfortunately I don't have all the state I'd need sitting at the top of my brain to revise this code at the moment, or even know whether your proposed changes are safe, since I find I can't write atomics code safely unless I devote at least an afternoon to the process of analyzing the problem's safety.
I remember there was a reason I chose using a shared item count instead of having the pointers be shared, I think mostly just to minimize the shared state to one thing, but it may not be the fastest choice.
If you want to give trying some changes a shot, I'd love to hear what you find! Email is in my profile. :)
Not the author, but anyway; I'm not sure you can release on one variable and acquire on an other while having ordering guarantee between the two.
Anyway, I agree that having both acq and rel on both reader on writer side seems weird. I guess you can come with a solution with only one rel for the writer and only one acq for the reader.
UPDATE: I'm stupid, actually not possible if you don't want to overflow.
Interesting, it was my impression that the release/acquire are not specific to a memory location. They are just barriers, aren't they? But as I said I'm just learning.
They are specific. See 5 in 5.1.2.4 Multi-threaded executions and data races in n1570.
In term of real hardware it translates well in MESI/MESI-like protocol on cache lines on ooo cores, without much more constraints (of course some arch are sufficiently weak to still require special instructions, but on the other hand x86 don't need anything for acq/rel). If the different cores never touch the same cache line, they don't need to do interact at all even when both execute unrelated acq/rel atomics.
Interesting. I bought C++ Concurrency in Action today and am learning all these details of the memory model.
You mention of caches makes me realize that a single-reader single-writer queue can probably also be optimized by putting the head pointer and the tail pointer on different cache lines. The reader and writer can cache the other's value on their own page, and only reload it when the queue otherwise looks full or empty, respectively. This should allow the reader and writer to act without needing to synchronize cache lines for many operations.
I think that works on some architectures but would not be guaranteed behavior. In theory on a system with no memory ordering constraints, the pointers could update before the cache lines for the underlying ring buffer. Which means that without an acquire barrier at the beginning of aring_take, which without an item count, would require writes to the shared head/tail pointers to sync anyway, you can't ensure the the data written to the ring buffer is visible to the thread, even though your pointers would indicate there's data there and so you may load either torn or completely different data into the consumer thread.
Which means that if your memory barriers are operating correctly, they're just causing two cache lines to sync for the metadata instead of one, if the pointers are split across two lines.
Whereas with an item count, only one line ever has to sync, even if the pointers are split across two, since the pointers aren't shared between threads.
In practice, the compiler might not be smart enough to realize this and might be enforcing order for all side effects before the barriers though, even if the other thread doesn't read them. So maybe this isn't a good approach.
If all you're doing is passing pointers from one thread to another, lock free is not a very good choice. Try a trivial spinlock based implementation. It will outperform it easily.
I strongly doubt you're going to build a faster thread-safe workqueue by adding spinlocks to it, but if you think what I wrote is too slow, feel free to build something faster and show the speed difference in a benchmark. You can use the code in test/ if you'd like.
The code I wrote does leave an optimization or two on the table. Apple's lockfree queue in GCD is faster than mine because of batching. It also has a ton more features, including being (IIRC) MPSC.
Locking is cheap, switching a pointer around is cheap. The only expensive thing is waking up a thread that is waiting on a lock. Spinlocks avoid that problem, which is why they outperform other implementations in terms of throughput (of course inversely proportional to the amount of work you do during a lock).
Atomic operations have a fixed cost. Spinlocks only really have a cost when contention is high. When you're just moving a pointer to another place, you're not likely to have that much contention to make atomic operations worth it.
I don't think you've got this quite right. Spinlocks require memory barriers. C11 atomics are, among other things, a standardized way to implement cross platform memory barriers without dropping into platform specific assembly.
The only way to test it out is with code. In my own tests spinlocks turned out to provide the highest throughput. I do not care about the problem enough anymore to go try out your implementation. If you implemented a lock free queue, you should be the one testing it out if it actually is better than a simple spinlock.
My code is on github with test harnesses for doing benchmarking on it. Yours is... some files on a hard drive that likely show something completely different than you think it does.
I'm sorry, but I don't have time to comment on this thread anymore.
If you know what you are doing, the C11/C++11 memory model (or another one with roughly the same features, like the linux kernel memory model) and how a computer actually works, I don't think you will find a single architecture on which an atomic implementation of Lock-free Single Producer, Single Consumer (SPSC) queue will not outperform a spinlock.
If you use overkill sequential consistency to do that, you might actually end up slower in some cases.
Given the amount of expertise you need to use atomics properly (far more difficult than sticking to critical sections), it might as well be preferable to stick with those at application level unless you really need extreme performance and know the applicable memory model and the corresponding target stack (maybe processor architecture, maybe C11/C++11 memory model if you use it, maybe additional compiler guarantee on that and related stuffs, maybe all at the same time).
Not that in the end an acquire / release pattern is so hard to understand, but in my experience programmers already can't even stick to properly using critical sections as in textbooks (not even talking about what happens when the textbook is a piece of shit, like when it presents double-checked locking without any regard to the model needed for it to work properly => people end up writing that in C++ on plain variables...).
Really, stick to the (good) textbooks and the simplest patterns if you don't know enough about memory models and how processors and compilers work in 2016... How fast you execute buggy code is not a very interesting property, sequential consistency is expensive in lots of cases, and using weaker atomic is incredibly difficult (each time I think I know enough about weak atomics I found something that explain how various model are subtly weaker than I previously thought, not even talking about the fact that the C11/C++11 model is subtly formally unsound and compilers actually not formally respect it... so much if you try to prove some algorithms)
It's a little thing for simple printf debugging. I got tired as hell of using formatting strings. It prints line, file, and function, expression and value. Since it stands out /greppable it's easier to delete then random printfs.
It has support for different colors(to make things stand out) and uses c99 _Generic to support different types with a single "function" and c initializers for named arguments. It supports basic c types and you can sort of extend it by modyfing the header(unfortunately I couldn't find anything cleaner using c).
C99 only (works on gcc>4.9 (due to use of designated initializers) and llvm/clang >3.3 (maybe earlier but I tested it with 3.3)
I was hoping to support msvc but it looks like Microsoft hasn't implemented _Generic yet.
Looks I'm wrong again, it's been a while since I wrote it it also uses typeof gcc/llvm extension, there might be a way to fix that but since msvc lacks _Generic I didn't bother.
I was also hoping to support c++ but c++ doesn't do designated initializers.
Uh, this stb library [1] seems to follow a horrible pattern: not only combining your code into a single .c file (which is fine, like the sqlite amalgamation), but putting the code into a single .h file. [2]
I've never seen a C or C++ codebase that does this. Maybe you can rely on link-time deduplication, but it will still cause duplicate compilation, and thus increased compile times. I haven't thought it about it lately, but I thought the common wisdom was not to depend on the linker to dedupe or strip unused symbols. Actually it should cause link-time errors, not just duplicated cause.
The justification is also somewhat ridiculous: Why single-file headers?
Windows doesn't have standard directories where libraries live. That makes deploying libraries in Windows a lot more painful than open source developers on Unix-derivates generally realize. (It also makes library dependencies a lot worse in Windows.)
Really? Why not just lay out your source normally, and then munge it with a simple script into something Windows can handle (similar to sqlite)? Because writing trivial scripts on Windows is also a pain?
On the contrary, this is the best way to make code reusable, because it is the only way that allows a C or C++ program to add more code solely via the compiler, without also requiring modification of external build scripts / tools / etc.
When you are shipping and maintaining code on 5 or 10 different platforms, this really matters, because the friction of adding new files becomes huge ... you have to go add that file to 5 or 10 different fuckity fuck build systems that are all uniquely terrible, and hey maybe Apple updated XCode to whatever the new lousy version is instead of the old lousy version, so you have to go through the rigamarole of installing that, which of course won't completely work, oh and the internet is slow today, and on some console platform that shall not be named our dev software didn't auto-renew its license and that is mysteriously timing out so now we get to deal with that for hours, blah blah blah.
This is not an exaggeration. You're lucky if you actually get to do any programming on the day you decide to add a cpp file.
Experienced programmers who ship on a lot of platforms really want the simplest and most straightforward way of using code, and this is what that is for C and C++.
More modern languages could be designed to be better at this, but they usually aren't. (A 'package manager' is not really the answer, it is a solution to a kind-of orthogonal problem and usually brings in way too many of its own complexities.)
In general, I find the stb libraries practical and technically excellent. I can't see how they could be improved by arbitratily splitting it into a header and c-file.
Best practices are best practises within a specific usage domain. Sometimes the best solution to particular problem constraints is not the one that follows an orthodoxy.
"it will still cause duplicate compilation"
No, there is nothing in general that would cause this. The implementation is written behind a specific preprocessor condition which is not supposed to be a project wide definition.
For example user can write one stb-*-impl.c file that includes the header after the library specific preprocessor definiton.
Note that the implementation code is conditionally compiled. By default you only compile the headers, so it's up to you instantiate the implementation somewhere by setting a preprocessor symbol before you include the library. This is described in the comment at the top of the file.
OK, I missed that... it makes it a slightly less crazy pattern, but still weird and offputting. If you really have problems adding one .h and one .c file to your project, I think the development environment is broken.
(Admittedly, I last used Visual Studio C++ in 2007... I installed it from a DVD, only to find it was broken, and I had to copy a .DLL from the Microsoft website to get it to work.)
C programmers certainly hate dependencies. So much that the most extreme class cannot even tolerate multi-file external dependencies. That is really extreme, but given the history of dealing (or coping) with dependencies such paranoia is rather justified.
> Why not just lay out your source normally, and then munge it with a simple script into something Windows can handle (similar to sqlite)? Because writing trivial scripts on Windows is also a pain?
Instead of making a big library with multiple files, they took an alternative to make a small well-defined library with a single file and small number of entry points.
And, separately, a small-single-file-but-correct XML parser: https://dev.yorhel.nl/yxml