The Jan/Feb updates to Fedora 27 really broke VMs

My specific environment is that the VM machines I am having issues with are all CentOS7 or Fedora27 VM guests running on Fedora27 host machines and all are managed by virsh.

I now have a novel and very irritating problem on the servers I use for VM hosts, probably specific to Redhat/CentOS/Fedora which has no available setting to limit the amount of cache assigned for IO, so it will use all memory available if it can.

The problem which has appeared since I last upgraded the kernel is that normal background disk IO being placed into memory cache now seems to take priority over running active processes wanting to use memory.

The visible symptoms are

  • KVM virtual machines are being swapped out to disk swapfile space, to the point they stop responding completely, even to ping
  • Processes running on the host machine are being killed by the operating system due to lack of memory
  • 50% of the real memory on the machine is allocated to disk IO cache

Excessive cache usage for IO is nothing new, I investigated that long ago to find out there is no way of controlling the percentage of memory used for cache on redhat based systems, to the point I had to include a daily “/usr/sbin/sysctl vm.drop_caches=3” command in cron to flush some of that cache.

However until a few months ago I could run five VMs on my main system which would run indefinately… now I can only run three VMs and eventually one or more of them is after two to three days moved far enough into swap space where they just freeze, not even a “virsh reboot” command can recover it… plus of course there is the additional complication that the OS on the VM host starts killing off some running processes which is also new.

I have altered my webserver instance to use 1Gb memory instead of the 2Gb it was origionally using to see if the host can keep it responding for longer.

A host with 8Gb of memory, the entire host was rebooted yesterday and I have already had a VM freeze, now running three VMs assigned a total of 3.5Gb of that memory (actually using 3.5Gb of real memory and two of them combined are also now using 100Mb of swap (the third I have just had to restart again a few minutes ago so hasn’t swapped yet)), 4Gb of memory used by “buff/cache” so running processes are moving into swap space in preference to IO being flushed from cache… in an ideal world it would flush cache to avoid swapping out running processes; RedHat based systems do not do that however; or at least do not do it on demand when memory is needed, I assume they do a reclaim periodically but that is no use at the time it is needed.

My test machine which is a host of 32Gb of memory with 23Gb assigned to two VMs seems better behaved in that “buff/cache” is using 3Gb (I have seen it over 10Gb often) but is needing to use 4Gb of swap (yes, the qemu-system-x86 processes are using 3Gb of swap combined between them as well as 25Gb of real memory, these have been running 11 days) which should not happen.

Anyway, this post is because this issue of VMs completely freezing is new behaviour that has occurred only during the last three months; and it annoys me.

I now have the “/usr/sbin/sysctl vm.drop_caches=3” running every two hours from cron in the hope it will keep the VMs running longer, even if it does cause a major system slowdown while cache is being flushed.

About mark

At work, been working on Tandems for around 30yrs (programming + sysadmin), plus AIX and Solaris sysadmin also thrown in during the last 20yrs; also about 5yrs on MVS (mainly operations and automation but also smp/e work). At home I have been using linux for decades. Programming background is commercially in TAL/COBOL/SCOBOL/C(Tandem); 370 assembler(MVS); C, perl and shell scripting in *nix; and Microsoft Macro Assembler(windows).
This entry was posted in Unix. Bookmark the permalink.