Unfortunately I just spotted that I'm totally wrong, the huge pages apparently a...

sandGorgon · on Nov 3, 2014

this is a recommended setting even on Oracle [1] [2] and Couchbase [3]

the doc recommends adding "transparent_hugepage=never" to the kernel boot line in the "/etc/grub.conf" file.

[1] http://oracle-base.com/articles/linux/configuring-huge-pages...

[2] https://support.oracle.com/epmos/faces/DocContentDisplay?id=...

[3] http://blog.couchbase.com/often-overlooked-linux-os-tweaks

antirez · on Nov 3, 2014

Yes, never reported by Redis users so far, maybe since in Redis it does not have the same impact it has in other databases, but a more subtle latency issue apparently.

EDIT: Actually in the CouchBase case they talk about "page allocation delays" that looks related, potentially. I started the investigation mainly because the Stripe graph looked suspicious with a modern EC2 instance type that has good fork times, so there was something more indeed (unless the test was performed with tens of GB of data).

antirez · on Nov 3, 2014

Replying to myself for an update. The hypothesis of a a few users here is correct, while hard to believe. The latency spike is due to the fact that even with the 50 clients of the benchmark, it is possible to touch all the huge pages composing the process in the space of a single event loop. This is why I was observing this initial spike and nothing more. This seemed unrealistic to me with 50 clients, but I remembered that one of the Redis optimizations is to try to serve multiple requests for the same client in the same event loop cycle if there are many in queue, so this is what happens: 50 clients x N-queued-requests = enough requests to touch all the memory pages, or at least a very significant percentage to block for a long time and never again.

evan_miller · on Nov 3, 2014

Have you tried disabling only defrag? The performance issues I'm familiar with are generally related to defrag running at allocation time. I would be curious to see your results while enabling THP but:

echo never > /sys/kernel/mm/transparent_hugepage/defrag

antirez · on Nov 3, 2014

Awesome hint, I'll try it. However if huge pages means 2MB copy-on-write for each fault, you want still them turned off for Redis. Also does not help with jemalloc memory reclaiming.

nonane · on Nov 3, 2014

That makes sense. Since jemalloc is using huge pages, even if it's one byte inside a huge page changes, the kernel needs to copy an entire 2mb per modified entry over (worst case localilty). Whereas on a non-hugepage allocation it'll only be 4kb. Thats why you saw movs in the stack trace - the kernel was too busy copying over the entire huge page over. With small pages, it can be more granular about it's copying.

antirez · on Nov 3, 2014

I don't think that's the reason actually, there is a big spike and then all the other COWs are ok, also copying 2MB is no way near to the 300 ms I was observing. Not sure if for some reason most pages where copied, or it was some other issue at all. It could be the code that fragments the huge page into multiple pages maybe? MOV -> fault -> split of page -> COW of a single 4k page.

nonane · on Nov 3, 2014

You're probably right but a quick way to test this theory is to check the amount of memory private memory (non-cow) is being consumed by the child fork with and without THP. This should tell you how much data is actually being copied vs shared. Not sure how you get this # from Linux though.

antirez · on Nov 3, 2014

Exactly! Doing this. There are ways to do this fortunately (Redis does this actually... in order to report the amount of copy-on-write performed during save, you may want to check the code if you are curious, it is pretty trivial, and it is possible thanks to /proc/<pid>/smaps info).

You were right! Please check: https://news.ycombinator.com/item?id=8553666

zwily · on Nov 3, 2014

Disabling transparent hugepages has been common advice in the Postgres world for awhile, and for the same reason. I suspect anything that forks a lot should have them off.

jsmthrowaway · on Nov 3, 2014

MongoDB, too, which also requests disabling NUMA since MongoDB has no idea what NUMA is.

xtacy · on Nov 3, 2014

Apparently, similar issues (+ fixes) also apply for Hadoop workloads: http://structureddata.org/2012/06/18/linux-6-transparent-hug...

daviesliu · on Nov 3, 2014

Could your confirm that the MOV is copying PTEs? I doubt that maybe it's copying the heap, because of copy-on-write. After fork(), the parent write to heap randomly, so it will copy more data with jemalloc than libc (because of huge page)

nteon · on Nov 3, 2014

After the latency spike does the smaps output still list the same amount of AnonHugePages?

antirez · on Nov 3, 2014

Can't test it right now, but seems smart, to check if they got split into smaller pages.

nteon · on Nov 3, 2014

I just tested - while running the redis-benchmark command you listed after a BGSAVE smaps lists all anonymous memory as AnonHugePages. So it ends up as THP, although I suppose its possible that it is broken into 4k pages and then adjacent 4k pages are merged back into huge pages. Will see if I can figure out how the lazy PTE copying works.

nteon · on Nov 3, 2014

as a followup, I don't see large latency with huge pages, but I have kernel samepage merging disabled.

I would be interested if this also decreases latency for you:

echo 2 > /sys/kernel/mm/ksm/run

Also check stats here: /sys/kernel/mm/ksm/*

And some documentation here: https://www.kernel.org/doc/Documentation/vm/ksm.txt

antirez · on Nov 3, 2014

Strange... you should definitely see it if you are using jemalloc now that it's clear it's about mass-COW.

nteon · on Nov 4, 2014

I do see it now... wasn't testing right before. Goes away as you say when disabling transparent THP. KSM has no effect, contrary to what I thought earlier.

antirez · on Nov 4, 2014

Thanks a lot for independent confirmation of the issue.

Ono-Sendai · on Nov 3, 2014

Why do you fork?

seiji · on Nov 3, 2014

See the "How it works" section under "Snapshotting" at http://redis.io/topics/persistence

Ono-Sendai · on Nov 3, 2014

Ok. A couple of points - you can do this kind of persistence with an immutable data structure without forking.

Secondly, huge pages help to reduce memory latency for large data sets quite significantly.

seiji · on Nov 3, 2014

you can do this kind of persistence with an immutable data structure without forking

How?

huge pages help to reduce memory latency

Yup, but this is about _transparent_ huge pages which break databases all over the place: https://blogs.oracle.com/linux/entry/performance_issues_with... and http://dev.nuodb.com/techblog/linux-transparent-huge-pages-j... and http://scn.sap.com/people/markmumy/blog/2014/05/22/sap-iq-an... and http://www.percona.com/blog/2014/07/23/why-tokudb-hates-tran... and (a dozen more including varnish, mongo, hadoop, ...)

Ono-Sendai · on Nov 3, 2014

If you have a pointer to an immutable data structure, you know it won't change, so you can write it out to disk at your leisure.

justincormack · on Nov 3, 2014

But the immutable data structure is not that much different from the copy-on-write pages the kernel gives you, likely to involve the same amount of copying. And disk IO is blocking so you need at least a thread, so it is a natural strategy.