On Random Yet Consistently Timed Crashes

The past few weeks, I’ve been dealing with hard crashes on my Hyper-V server. They all happened at around the same time. Essentially, the VM would stop responding to any services past pings. If I try to use the Hyper-V console to bring it up, it would just crash. If I tried to reboot or stop the VM it would crash the host.

So, I went through the event logs on the host, and came across a bunch of errors on my Highpoint 2720 controller relating to ports not responding and driver not responding. I have a scheduled drive pair verify that was running around the time of the crash, so I assumed that there may be a chance that I had an issue with the drives on the pair.

I ran a full drive scan on the two drive pairs, and both succeeded without errors, nor were there any crashes. After that, I ran a drive pair validate, but at a different time of the day. This one succeeded as well.

Feeling thoroughly confused, I went through the event logs again, and came across an error in the host log that also coincided with the same timestamps. This error was sourced from my PCIe network card, so at that point, I start trying to figure out what could cause two PCIe cards to stop responding at the same time.

I got through the logs on the VM again, and notice some errors with the ID 129 but no details given due to a missing component. I do some Googling, and find an ancient MS forum post about this error. It was traced to an issue with VSS and similar issues with crashing VMs.

I then remember that I had a Windows Server Backup running on the VM around the same time that this was running. Disabled that, and suddenly the crashes stop.

Whoops.

Leave a Reply