Workaround for ASRock random reboots |
Post Reply |
Author | |
johnwbyrd
Newbie Joined: 14 Oct 2024 Status: Offline Points: 40 |
Post Options
Thanks(0)
Posted: 10 Dec 2024 at 4:41am |
Here are some breadcrumbs for anyone debugging random reboot issues on Proxmox 8.3.1 or later.
tl:dr; If you're experiencing random unpredictable reboots on a Proxmox rig, try DISABLING (not leaving at Auto) your Core Watchdog Timer in the BIOS. I have built a Proxmox 8.3 rig with the following specs: - CPU: AMD Ryzen 9 7950X3D 4.2 GHz 16-Core Processor - CPU Cooler: Noctua NH-D15 82.5 CFM CPU Cooler - Motherboard: ASRock X670E Taichi Carrara EATX AM5 Motherboard - Memory: 2 x G.Skill Trident Z5 Neo 64 GB (2 x 32 GB) DDR5-6000 CL30 Memory - Storage: 4 x Samsung 990 Pro 4 TB M.2-2280 PCIe 4.0 X4 NVME Solid State Drive - Storage: 4 x Toshiba MG10 512e 20 TB 3.5" 7200 RPM Internal Hard Drive - Video Card: Gigabyte GAMING OC GeForce RTX 4090 24 GB Video Card - Case: Corsair 7000D AIRFLOW Full-Tower ATX PC Case ??Black - Power Supply: be quiet! Dark Power Pro 13 1600 W 80+ Titanium Certified Fully Modular ATX Power Supply ($448.72 @ Amazon) This particular rig, when updated to the latest Proxmox with GPU passthrough as documented at https://pve.proxmox.com/wiki/PCI_Passthrough , showed a behavior where the system would randomly reboot under load, with no indications as to why it was rebooting. Nothing in the Proxmox system log indicated that a hard reboot was about to occur; it merely occurred, and the system would come back up immediately, and attempt to recover the filesystem. At first I suspected the PCI Passthrough of the video card, which seems to be the source of a lot of crashes for a lot of users. But the crashes were replicable even without using the video card. After an embarrassing amount of bisection and testing, it turned out that for this particular motherboard (ASRock X670E Taichi Carrarra), there exists a setting Advanced\AMD CBS\CPU Common Options\Core Watchdog\Core Watchdog Timer Enable in the BIOS, whose default setting (Auto) seems to be to ENABLE the Core Watchdog Timer, hence causing sudden reboots to occur at unpredictable intervals on Debian, and hence Proxmox as well. The workaround is to set the Core Watchdog Timer Enable setting to Disable. In my case, that caused the system to become stable under load. Because of these types of misbehaviors, I now only use zfs as a root file system for Proxmox. zfs played like a champ through all these random reboots, and never corrupted filesystem data once. In closing, I'd like to send shame to ASRock for sticking this particular footgun into the default settings in the BIOS for its X670E motherboards. Additionally, I'd like to warn all motherboard manufacturers against enabling core watchdog timers by default in their respective BIOSes. |
|
RealPjotr
Newbie Joined: 2 hours 35 minutes ago Status: Offline Points: 15 |
Post Options
Thanks(0)
|
Thanks, I have the exact problem you describe, but on a SIENAD8-2L2T motherboard and Epyc CPU. I also run latest Proxmox and have had random reboots. First they were more frequent and often reported M.2 hardware errors. That turned out to be WD SN770 SSDs not playing well with ZFS. I have replaced them with Samsung 990 Pros, they've been stable for 2-3 weeks, much better.
But today I had a completely unexpected reboot with nothing in any logs; system log, journalctl or IPMI. Googling I found your post, so I've changed this setting and hope it helps. Time will tell! |
|
Post Reply | |
Tweet
|
Forum Jump | Forum Permissions You cannot post new topics in this forum You cannot reply to topics in this forum You cannot delete your posts in this forum You cannot edit your posts in this forum You cannot create polls in this forum You cannot vote in polls in this forum |