Recently I ran into a frustrating and subtle issue on a Proxmox node in my cluster.
VM running in Proxmox vanished silently from the network. No traffic. No pings. No errors. The Proxmox host was still reachable, and other VMs looked fine( but not on this exact node ). But VMs from this node were…. just… gone.
Aaaaaaand the one of VMs was….Zabbix :) ironically, it couldn’t report its own death.
But this wasn’t a Zabbix failure or Proxmox failure ( Deutsche Qualität !! )
This was a hardware-level networking failure caused by a buggy onboard NIC and a dangerous default virtual NIC.
e1000
The host motherboard used a built-in Intel NIC (Z590 chipset) running the e1000e
driver — a well-known troublemaker in virtualized environments.
The host kernel logs showed this repeating pattern:
kernel: e1000e 0000:00:1f.6 eth0: Detected Hardware Unit Hang:
TDH <39>
TDT <cc>
next_to_use <cc>
next_to_clean <38>
buffer_info[next_to_clean]:
time_stamp <12efdce1b>
next_to_watch <39>
jiffies <138b711c0>
next_to_watch.status <0>
MAC Status <40080083>
PHY Status <796d>
PHY 1000BASE-T Status <3c00>
PHY Extended Status <3000>
PCI Status <10>
This means the physical NIC entered a broken state — it stopped processing packets, but didn’t fail outright. The host still saw a link, but VM traffic was gone ( The cluster didn’t even fall apart)
Temp fix
Disable offload features on e1000e NICs
# /etc/network/interfaces
auto eth0
iface eth0 inet manual
post-up /usr/sbin/ethtool -K $IFACE tx off rx off tso off gso off gro off
post-down /usr/sbin/ethtool -K $IFACE tx off rx off tso off gso off gro off
Permanent fix
I just ordered a stable network card from Intel ( I210-T1 )