I'm currently on 2.27-4800. Is it possible we have a regression here? I use a laptop at work, and I'm seeing that my computer hangs with a WHEA 18 error when I'm using the laptop on a KVM (so the PC is idle). I'm seeing an uptick in these in the events log since about 8/17/22, and it's happened twice today. This is the exact same system as when we were diagnosing this issue last year.
I recently started hitting this issue of a WHEA Machine Check Exception (WHEA Event ID 18), citing a cache hierarchy error (which I'll assume is AMD's notation for a cache coherency problem). Each time it occurs on my system, it flags the APIC for two separate cores (but each time it's different cores), one on each CCD. What led me here is that in each of the 4 times this has happened in the last week, it's occurred when running HWInfo (I'm running the latest release). In all four cases, the problem occurred with the system under full load, running F@H on both CPU and GPU, loading all cores. The stop occurred as quickly as 3 minutes after boot, with the longest duration being three hours and 16 minutes. Two of the stops occurred with the machine in its default stock config, the other two occurred with PBO enabled. I'll list the specs below, but my machine has overkill levels of cooling and power delivery - Corsair HX1200 PSU, Corsair 360 AIO cooler, 7 Corsair maglev fans - even with PBO enabled, CPU package and Tdie temps never exceed 74c (and I have a platform override in place to prevent the system from exceeding 79c - but it never hits that temp anyway).
Now here is where things start to point to HWInfo as some form of trigger: so far, I absolutely
cannot recreate this WHEA 18 stop if HWInfo isn't running. I ran the machine for 24 hours, running the same F@H workload, without so much as a hiccup from Friday evening through Saturday (US-ET). I didn't launch HWInfo for the rest of the weekend and didn't not any issues even after the 24-hour stress test. When I hit the problem, in quick succession this afternoon, both times while running HWInfo, after a shutdown and full power-off and cool-down (to recreate the same conditions as my initial startup) I brought the machine back up, refrained from launching HWInfo, and began another stress test. As I'm typing this, we're now at just over 7 hours at continuous load, again, without a hiccup.
The frustrating part of this is that when hitting the Machine Check, versus a standard kernel mode bugcheck, there's no kernel dump generated, so I can't pull up windbg (I'm a software engineer and former driver developer on NT) and run a !analyze or look at a stack trace to get a hint as to what was happening at the time of the stop. I have no idea what aspect of reading sensor data would result in this situation, but whatever is happening feels like something is tripping up the cIOD interconnect fabric (regardless, at some level this is an AMD problem since this type of crash shouldn't happen under any circumstances). I'm going to attempt to create a more consistent set of repros tomorrow, but right now this doesn't feel like a run-of-the-mill bad Ryzen CPU issue (honestly, I wish I could get a reproduction without HWInfo, because then I could just chalk it up to a bunk Ryzen processor or motherboard, order a new board / CPU and call it day).
Specs:
Ryzen 5950x
Asus Tuf x570 plus
128GB Micro DDR4-3200 (CAS 16)
Asus Tuf RTX 3080
Samsung 2TB 980pro NVMe (PCIe 4 @ 4x via IO die)
Seagate 2TB Firecuda NVMe (PCIe 4 @ 4x via PCH)
Corsair HX1200 PSU
Corsair 360 AIO
Windows 11 Enterprise 22H2 Build 22621.607