Yet again a simple “yum update” totally destroyed my system. That was over a week ago. Since then after trying a newer kernel and spending a little over 120 hours of fsck’ing disks I have everything back; including the website I am sure you all missed.
The issue was the “yum update” replaced stable kernel kernel-3.14.9-200.fc20.x86_64 with a newer kernel-3.15.5-200.fc20.x86_64; with the newer kernel there were disk I-O timeouts (on internal SATA and external USB disks) plus timeouts reported against qemu-kvm guests. The machine would consistenly crash within a few hours of booting.
After a day or so trying to figure out what was going on did another “yum update” which pulled in kernel-3.15.6-200.fc20.x86_64. Yippee I thought, they have fixed it (I was assuming that was why a new kernel had come out so quickly).
I was wrong. The machine would still only stay up for a couple of hours, although the reason seemed to have changed slightly; maybe they just put in more debugging.
Anyway the crashes had totally corrupted the internal SATA boot disk and one of the USB attached external disks, plus corrupted one of the virtual disk files used by one of the VM guests; a real mess.
It would have been nice if I could have rebooted with an automatic fsck of the boot disk, but … bugger me, the /.autofsck and “shutdown -F” (-F for force fsck) have been removed from Fedora 20. On searchinbg the forums this is apparently because journalling is supposed to be able to recover disks now; well it doesn’t.
So I carried one of the spare CRT screens littering my living room and plugged it into the previously headless server so I could see what I was doing, and booted off the Fedora 20 live DVD image (which I had handy only because I have needed to boot off it in the past to recover from updates) that I could use to start sshd so I could remote in and fsck’ed the boot disk that way, which let me boot without errors from that disk again.
Fortunately I had a backup of the VM guests virtual disks that was less than three hours old I could revert to when ready. But I wasn’t willing to bring the website (or any other guests) back online without fsck’ing all the other physical disks at the non-destructive read and re-write level which took over five days.
I did the fsck’s on the latest kernel kernel-3.15.6-200.fc20.x86_64 which kept the machine up the entire time… it did crash again as soon as the last fsck finished so maybe I was just lucky.
Or maybe it was just because I had shutdown all apps and the only I-O was the fsck running ?.
After rebooting on that latest kernel… if any guest qemu-kvm instance was started the machine would crash within two hours, likewise if I started a bacula backup (databases are on the VM host) the machine would soon crash.
The fix
- booted off the last stable kernel kernel-3.14.9-200.fc20.x86_64 that was fortunately still in the grub2 boot menu
- in /etc/yum.conf changed installonly_limit from 3 to 10 so if I accidentally install any more kernels I will have ten versions instead of three available so the stable kernel stays in the boot menu list (if I leave the CRT plugged in so I can select of course)
- used ‘rpm -e’ to remove the two buggy kernel (and kernel-devel) so I don’t have to worry about them being accidentally used in future
- used yum downgrade kernel-headers to get back to a prior version of those, the prior version selected was kernel-headers.x86_64 0:3.11.10-301.fc20 which I’m not too comfortable about, but thats apparently the latest previous version
- added exclude=kernel to the /etc/yum.conf file to stop any further kernel updates
- and I have downloaded the CentOS.7 install DVD; still tossing a coin as to whether that is the final fix
Summary
Since booting off the last stable kernel I had installed I have had no timeouts anywhere, all VM guests (I only have three currently on this machine) are able to be started and are running without causing crashes, and I have done a “full” bacula backup of all three VM guests plus another three physical machines without the slightest hiccup.
If anybody tries to tell me it was a hardware issue rather than a kernel that should never have been released (nor the one after it) I will happily call them a liar. It was definately a buggy kernel as the older one I have reverted to is perfectly stable; and using either of the two later ones discussed in this post will consistently cause both sd and scsi timeouts (and data damage).
More details ?
Anything after the last kernel I identifies as being stable for my environment gives me lots of messages file entries like
Jul 29 19:29:58 vmhost1 kernel: sd 1:0:0:0: [sda] abort completed Jul 29 19:29:58 vmhost1 kernel: scsi host1: uas_eh_task_mgmt: ABORT TASK failed (wrong tag 0/256) Jul 29 19:29:58 vmhost1 kernel: sd 1:0:0:0: [sda] uas_eh_abort_handler ffff880197635c80 tag 10, inflight: CM D Jul 29 19:29:58 vmhost1 kernel: [ 2336.691263] sd 1:0:0:0: [sda] abort completed Jul 29 19:29:58 vmhost1 kernel: [ 2336.691309] scsi host1: uas_eh_task_mgmt: ABORT TASK failed (wrong tag 0/ 256) Jul 29 19:29:58 vmhost1 kernel: [ 2336.691329] sd 1:0:0:0: [sda] uas_eh_abort_handler ffff880197635980 tag 1 1, inflight: CMD Jul 29 19:29:58 vmhost1 kernel: sd 1:0:0:0: [sda] abort completed Jul 29 19:29:58 vmhost1 kernel: scsi host1: uas_eh_task_mgmt: ABORT TASK failed (wrong tag 0/256) Jul 29 19:29:58 vmhost1 kernel: sd 1:0:0:0: [sda] uas_eh_abort_handler ffff880197635980 tag 11, inflight: CM D Jul 29 19:29:58 vmhost1 kernel: [ 2336.707448] sd 1:0:0:0: [sda] abort completed Jul 29 19:29:58 vmhost1 kernel: [ 2336.707521] scsi host1: uas_eh_task_mgmt: ABORT TASK failed (wrong tag 0/ 256) Jul 29 19:29:58 vmhost1 kernel: [ 2336.707536] sd 1:0:0:0: [sda] uas_eh_abort_handler ffff880197635b00 tag 1 2, inflight: CMD Jul 29 19:29:58 vmhost1 kernel: sd 1:0:0:0: [sda] abort completed Jul 29 19:29:58 vmhost1 kernel: scsi host1: uas_eh_task_mgmt: ABORT TASK failed (wrong tag 0/256)
After rolling back to the old stable kernel, not a single error message has been logged. It was definately a software issue in the kernel update and not a hardware issue.