Unfortunately I just spotted that I'm totally wrong, the huge pages apparently are only used by jemalloc: I just mis-read the outputs since this seemed so obvious. So on the contrary, it appears that the high latency is due to the huge pages thing for some unknown reason. So actually it is malloc that, while NOT using huge pages, is going much faster. I've no idea about what is happening here, so please disregard the blog posts conclusions (all the rest is hopefully correct).
EDIT: Oh wait... since the problem is huge pages, this is MUCH better, since we can disable it. And I just verified that it works:
echo never > /sys/kernel/mm/transparent_hugepage/enabled
This means, free latency upgrade for all the Redis users out there during fork... (more or less 20x in my tests) just with this line above.
Yes, never reported by Redis users so far, maybe since in Redis it does not have the same impact it has in other databases, but a more subtle latency issue apparently.
EDIT: Actually in the CouchBase case they talk about "page allocation delays" that looks related, potentially. I started the investigation mainly because the Stripe graph looked suspicious with a modern EC2 instance type that has good fork times, so there was something more indeed (unless the test was performed with tens of GB of data).
Replying to myself for an update. The hypothesis of a a few users here is correct, while hard to believe. The latency spike is due to the fact that even with the 50 clients of the benchmark, it is possible to touch all the huge pages composing the process in the space of a single event loop. This is why I was observing this initial spike and nothing more. This seemed unrealistic to me with 50 clients, but I remembered that one of the Redis optimizations is to try to serve multiple requests for the same client in the same event loop cycle if there are many in queue, so this is what happens: 50 clients x N-queued-requests = enough requests to touch all the memory pages, or at least a very significant percentage to block for a long time and never again.
Have you tried disabling only defrag? The performance issues I'm familiar with are generally related to defrag running at allocation time. I would be curious to see your results while enabling THP but:
echo never > /sys/kernel/mm/transparent_hugepage/defrag
Awesome hint, I'll try it. However if huge pages means 2MB copy-on-write for each fault, you want still them turned off for Redis. Also does not help with jemalloc memory reclaiming.
That makes sense. Since jemalloc is using huge pages, even if it's one byte inside a huge page changes, the kernel needs to copy an entire 2mb per modified entry over (worst case localilty). Whereas on a non-hugepage allocation it'll only be 4kb. Thats why you saw movs in the stack trace - the kernel was too busy copying over the entire huge page over. With small pages, it can be more granular about it's copying.
I don't think that's the reason actually, there is a big spike and then all the other COWs are ok, also copying 2MB is no way near to the 300 ms I was observing. Not sure if for some reason most pages where copied, or it was some other issue at all. It could be the code that fragments the huge page into multiple pages maybe? MOV -> fault -> split of page -> COW of a single 4k page.
You're probably right but a quick way to test this theory is to check the amount of memory private memory (non-cow) is being consumed by the child fork with and without THP. This should tell you how much data is actually being copied vs shared. Not sure how you get this # from Linux though.
Exactly! Doing this. There are ways to do this fortunately (Redis does this actually... in order to report the amount of copy-on-write performed during save, you may want to check the code if you are curious, it is pretty trivial, and it is possible thanks to /proc/<pid>/smaps info).
Disabling transparent hugepages has been common advice in the Postgres world for awhile, and for the same reason. I suspect anything that forks a lot should have them off.
Could your confirm that the MOV is copying PTEs? I doubt that maybe it's copying the heap, because of copy-on-write. After fork(), the parent write to heap randomly, so it will copy more data with jemalloc than libc (because of huge page)
I just tested - while running the redis-benchmark command you listed after a BGSAVE smaps lists all anonymous memory as AnonHugePages. So it ends up as THP, although I suppose its possible that it is broken into 4k pages and then adjacent 4k pages are merged back into huge pages. Will see if I can figure out how the lazy PTE copying works.
I do see it now... wasn't testing right before. Goes away as you say when disabling transparent THP. KSM has no effect, contrary to what I thought earlier.
But the immutable data structure is not that much different from the copy-on-write pages the kernel gives you, likely to involve the same amount of copying. And disk IO is blocking so you need at least a thread, so it is a natural strategy.
EDIT: Oh wait... since the problem is huge pages, this is MUCH better, since we can disable it. And I just verified that it works:
echo never > /sys/kernel/mm/transparent_hugepage/enabled
This means, free latency upgrade for all the Redis users out there during fork... (more or less 20x in my tests) just with this line above.