Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Unfortunately I just spotted that I'm totally wrong, the huge pages apparently are only used by jemalloc: I just mis-read the outputs since this seemed so obvious. So on the contrary, it appears that the high latency is due to the huge pages thing for some unknown reason. So actually it is malloc that, while NOT using huge pages, is going much faster. I've no idea about what is happening here, so please disregard the blog posts conclusions (all the rest is hopefully correct).

EDIT: Oh wait... since the problem is huge pages, this is MUCH better, since we can disable it. And I just verified that it works:

echo never > /sys/kernel/mm/transparent_hugepage/enabled

This means, free latency upgrade for all the Redis users out there during fork... (more or less 20x in my tests) just with this line above.



this is a recommended setting even on Oracle [1] [2] and Couchbase [3]

the doc recommends adding "transparent_hugepage=never" to the kernel boot line in the "/etc/grub.conf" file.

[1] http://oracle-base.com/articles/linux/configuring-huge-pages...

[2] https://support.oracle.com/epmos/faces/DocContentDisplay?id=...

[3] http://blog.couchbase.com/often-overlooked-linux-os-tweaks


Yes, never reported by Redis users so far, maybe since in Redis it does not have the same impact it has in other databases, but a more subtle latency issue apparently.

EDIT: Actually in the CouchBase case they talk about "page allocation delays" that looks related, potentially. I started the investigation mainly because the Stripe graph looked suspicious with a modern EC2 instance type that has good fork times, so there was something more indeed (unless the test was performed with tens of GB of data).


Replying to myself for an update. The hypothesis of a a few users here is correct, while hard to believe. The latency spike is due to the fact that even with the 50 clients of the benchmark, it is possible to touch all the huge pages composing the process in the space of a single event loop. This is why I was observing this initial spike and nothing more. This seemed unrealistic to me with 50 clients, but I remembered that one of the Redis optimizations is to try to serve multiple requests for the same client in the same event loop cycle if there are many in queue, so this is what happens: 50 clients x N-queued-requests = enough requests to touch all the memory pages, or at least a very significant percentage to block for a long time and never again.


Have you tried disabling only defrag? The performance issues I'm familiar with are generally related to defrag running at allocation time. I would be curious to see your results while enabling THP but:

echo never > /sys/kernel/mm/transparent_hugepage/defrag


Awesome hint, I'll try it. However if huge pages means 2MB copy-on-write for each fault, you want still them turned off for Redis. Also does not help with jemalloc memory reclaiming.


That makes sense. Since jemalloc is using huge pages, even if it's one byte inside a huge page changes, the kernel needs to copy an entire 2mb per modified entry over (worst case localilty). Whereas on a non-hugepage allocation it'll only be 4kb. Thats why you saw movs in the stack trace - the kernel was too busy copying over the entire huge page over. With small pages, it can be more granular about it's copying.


I don't think that's the reason actually, there is a big spike and then all the other COWs are ok, also copying 2MB is no way near to the 300 ms I was observing. Not sure if for some reason most pages where copied, or it was some other issue at all. It could be the code that fragments the huge page into multiple pages maybe? MOV -> fault -> split of page -> COW of a single 4k page.


You're probably right but a quick way to test this theory is to check the amount of memory private memory (non-cow) is being consumed by the child fork with and without THP. This should tell you how much data is actually being copied vs shared. Not sure how you get this # from Linux though.


Exactly! Doing this. There are ways to do this fortunately (Redis does this actually... in order to report the amount of copy-on-write performed during save, you may want to check the code if you are curious, it is pretty trivial, and it is possible thanks to /proc/<pid>/smaps info).

You were right! Please check: https://news.ycombinator.com/item?id=8553666


Disabling transparent hugepages has been common advice in the Postgres world for awhile, and for the same reason. I suspect anything that forks a lot should have them off.


MongoDB, too, which also requests disabling NUMA since MongoDB has no idea what NUMA is.


Apparently, similar issues (+ fixes) also apply for Hadoop workloads: http://structureddata.org/2012/06/18/linux-6-transparent-hug...


Could your confirm that the MOV is copying PTEs? I doubt that maybe it's copying the heap, because of copy-on-write. After fork(), the parent write to heap randomly, so it will copy more data with jemalloc than libc (because of huge page)


After the latency spike does the smaps output still list the same amount of AnonHugePages?


Can't test it right now, but seems smart, to check if they got split into smaller pages.


I just tested - while running the redis-benchmark command you listed after a BGSAVE smaps lists all anonymous memory as AnonHugePages. So it ends up as THP, although I suppose its possible that it is broken into 4k pages and then adjacent 4k pages are merged back into huge pages. Will see if I can figure out how the lazy PTE copying works.


as a followup, I don't see large latency with huge pages, but I have kernel samepage merging disabled.

I would be interested if this also decreases latency for you:

echo 2 > /sys/kernel/mm/ksm/run

Also check stats here: /sys/kernel/mm/ksm/*

And some documentation here: https://www.kernel.org/doc/Documentation/vm/ksm.txt


Strange... you should definitely see it if you are using jemalloc now that it's clear it's about mass-COW.


I do see it now... wasn't testing right before. Goes away as you say when disabling transparent THP. KSM has no effect, contrary to what I thought earlier.


Thanks a lot for independent confirmation of the issue.


Why do you fork?


See the "How it works" section under "Snapshotting" at http://redis.io/topics/persistence


Ok. A couple of points - you can do this kind of persistence with an immutable data structure without forking.

Secondly, huge pages help to reduce memory latency for large data sets quite significantly.


you can do this kind of persistence with an immutable data structure without forking

How?

huge pages help to reduce memory latency

Yup, but this is about _transparent_ huge pages which break databases all over the place: https://blogs.oracle.com/linux/entry/performance_issues_with... and http://dev.nuodb.com/techblog/linux-transparent-huge-pages-j... and http://scn.sap.com/people/markmumy/blog/2014/05/22/sap-iq-an... and http://www.percona.com/blog/2014/07/23/why-tokudb-hates-tran... and (a dozen more including varnish, mongo, hadoop, ...)


If you have a pointer to an immutable data structure, you know it won't change, so you can write it out to disk at your leisure.


But the immutable data structure is not that much different from the copy-on-write pages the kernel gives you, likely to involve the same amount of copying. And disk IO is blocking so you need at least a thread, so it is a natural strategy.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: