The OOM score isn’t strictly ordered by memory usage. The oom score adjustment i...

herpderperator · on March 31, 2021

You're right that it's not _strictly_ memory consumption and that other criteria and overrides exist, but memory consumption is highly weighted.

Regarding SSH, if you enable sshd debug logging you can see that sshd sets its own score to the minimum possible [0] which is why your comment about sshd being targeted still doesn't make sense to me. I actually didn't know it was sshd doing this on its own till I ran this:

  server ~ # grep oom_score_adj /usr/sbin/sshd
  grep: /usr/sbin/sshd: binary file matches

...which is fascinating and clever. That's when I checked the source code linked at "[0]". I now finally have an answer as to why I've seen dmesg memory stat dumps display different oom_score_adj values for sshd. I always thought _something_ was smart enough to know that we don't want to risk killing sshd, but I didn't know what that _something_ was. It turns out it was the daemon itself.

  Oct 13 23:20:38 server kernel: Mem-Info:
  ...
  Oct 13 23:20:38 server kernel: [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
  Oct 13 23:20:38 server kernel: [ 2455]    60  2455  1778653   273114     710      10        0             0 mysqld
  ...
  Oct 13 23:20:38 server kernel: [12085]   207 12085    22574      673      32       3        0             0 tlsmgr
  Oct 13 23:20:38 server kernel: [ 4238]     0  4238     9234      518      20       3        0         -1000 systemd-udevd
  Oct 13 23:20:38 server kernel: [12278]     0 12278    88107     5597     136       4        0             0 apache2
  Oct 13 23:20:38 server kernel: [17222]     0 17222  1258983   142035     505       8        0             0 qemu-system-x86
  ...
  Oct 13 23:20:38 server kernel: [21069]     0 21069     5033      487      14       4        0             0 bash
  Oct 13 23:20:38 server kernel: [15935]     0 15935     7081      487      16       3        0         -1000 sshd
  ...

In retrospect it makes a lot of sense, especially considering sshd runs as root -- it has complete ability to do that. And it's not like anything else would know the importance of sshd except for itself.

However I still don't understand your comment about the Linux OOM killer wanting to kill sshd "first" (or _ever_ based on these renewed findings!) Can you elaborate?

[0] https://github.com/openssh/openssh-portable/blob/e51dc7fab61...

paulfurtado · on March 31, 2021

If you have a large amount of memory, are doing decent amounts of IO, and linux OOMs, the system becomes unresponsive for many minutes before killing any process. At which point, ssh sessions timeout endlessly. A serial console stands a chance.

There is then also any case where you're debugging AMI builds and need to fix grub, or the init system without waiting 20 minutes for a new AMI build each time.

Also, the existing console log feature in AWS is insultingly not real time. It doesn't typically update at all unless you're within minutes of boot or trigger a reboot and it only buffers something like 4kb so a reboot can easily fully replace the logs. This really sucks when you're trying to get the debug console output, so this feature finally solves that.

herpderperator · on March 31, 2021

Why would linux trigger OOM if you have a large amount of memory available? Or, what did you mean by "large amount of memory"?

Also, why would an SSH session, which is entirely in memory, time out because of I/O thrashing? You can disconnect the hard drive that sshd and/or the OS is running from and your SSH connections to that machine won't break. If you run some commands that aren't cached in memory you'll naturally get critical I/O errors, but it won't cause a disconnect on the SSH layer.

paulfurtado · on April 14, 2021

By large amount of memory, I meant systems that have large amounts of memory that is nearly full. For example, a server with 256GB of memory and 255.9GB in use.

SSH is purely in memory, however, in order to allocate memory for it, linux will pull "free" memory out of whatever heavily fragmented corners it can find them in. And, it may even need to perform disk I/O to free memory that was tied up in various disk caches.

People refer to this as a "livelock", where Linux is going crazy doing lots of stuff but from userspace the system is completely frozen.

Facebook developed OOMD, a userspace oom killer to deal with this issue, their release blog post references the 30 minute livelocks they face: https://engineering.fb.com/2018/07/19/production-engineering...

They actually have gone so far as to submit kernel patches for newer PSI (pressure stall information) interfaces which they use in oomd to better detect stalls due to this thrashing

michaelt · on March 31, 2021

> However I still don't understand your comment about the Linux OOM killer wanting to kill sshd "first" (or _ever_ based on these renewed findings!) Can you elaborate?

I suspect running low on memory can trigger symptoms that look like sshd failing.

sshd gets paged out (or something else you need for a successful login). Un-paging becomes incredibly slow, as there's lots of IO going on from all the paging. Anything garbage-collected starts running GC constantly, using 100% CPU.

Then your attempt to SSH times out - and with no access to list running processes, one naturally concludes sshd has failed.

boulos · on March 31, 2021

Yeah, maybe in addition to my survivorship bias from RHEL6-era images, I am mentally conflating samples from “we OOM killed sshd” and “we are swapping violently; we won’t get in via sshd”.

In fact, I would guess (especially given all this investigation!) that it’s much more likely that an inaccessible box is just under too much memory pressure for sshd to respond.

Amusingly, the answer is still the same: serial port! :).

Thanks again for all the pointers (to everyone in this thread).

ec109685 · on March 31, 2021

That code is 12 years old: https://github.com/openssh/openssh-portable/commit/c8802aac2...

So maybe the parent post was remembering a time before then.