I guess it depends on the complexity of your distributed system (assuming you’re operating one).
We prefer to have job OOM kill and get retried elsewhere (which could be a completely different machine) and we have plenty of infrastructure that makes this trivial. This infrastructure also deals with other types of partial failure, such as complete machine failure.
As mentioned above, paging introduces strange pref behaviour. Which may not always be important, but if your working under tight latency requirements then paging can push you over that boundary.
That sounds strange when we’re happy to see jobs die entirely (that screws latency). But the issue with paging is you have no idea when it’s gonna hit you, and may impact a job that’s behaving perfectly fine, except something got paged out by a badly behaving job.
Ultimately disabling paging is a really good tool for limiting the blast radius of bad behaviour, and making cause-and-effect highly correlated (oh look thing X just OOMed, probably means thing X consumed too much memory. Rather than thing Y has strange tail latency because thing X keeps consuming too much memory). It’s failing fast, but for memory rather than exceptions.
I get your point, and I think we agree here. I mainly wanted to argue against the original statement, which was quite broad “high performance machines running in a production environments will absolutely have swap disabled”. Would you agree to rephrase both our arguments by:
- paging incurs seemingly random performance degradation of processes and should be avoided
- if you have a form of task queue/job distribution system which handles automatic re-run, and can afford at no business cost to restart a process from scratch, then disabling swapping allows fail fast behaviour
- otherwise swapping can be used as a safety net for programs that would be better off slightly late than restarted from scratch
- both scenarios require sane monitoring of process behaviours, to catch symptomatic failures/restart in case 1) and recurring swap usage in case 2)
> if your working under tight latency requirements then paging can push you over that boundary
Even if you're not under tight requirements, swap can do strange things. I've actually seen situations where hitting swap, even trivially, can cause massive increases in latency.
I'm talking about jobs which took 10s of milliseconds to complete now taking multiple 10s of seconds.
I've even seem some absurdly bad memory management where Linux will make very very poor choices about what to page out.
> Ultimately disabling paging is a really good tool for limiting the blast radius of bad behaviour
We prefer to have job OOM kill and get retried elsewhere (which could be a completely different machine) and we have plenty of infrastructure that makes this trivial. This infrastructure also deals with other types of partial failure, such as complete machine failure.
As mentioned above, paging introduces strange pref behaviour. Which may not always be important, but if your working under tight latency requirements then paging can push you over that boundary.
That sounds strange when we’re happy to see jobs die entirely (that screws latency). But the issue with paging is you have no idea when it’s gonna hit you, and may impact a job that’s behaving perfectly fine, except something got paged out by a badly behaving job.
Ultimately disabling paging is a really good tool for limiting the blast radius of bad behaviour, and making cause-and-effect highly correlated (oh look thing X just OOMed, probably means thing X consumed too much memory. Rather than thing Y has strange tail latency because thing X keeps consuming too much memory). It’s failing fast, but for memory rather than exceptions.