Is HWiNFO causing the dreaded WHEA-Logger Event ID XX Cache Hierarchy Errors and sudden reboots on AMD Ryzen systems?

I'm currently on 2.27-4800. Is it possible we have a regression here? I use a laptop at work, and I'm seeing that my computer hangs with a WHEA 18 error when I'm using the laptop on a KVM (so the PC is idle). I'm seeing an uptick in these in the events log since about 8/17/22, and it's happened twice today. This is the exact same system as when we were diagnosing this issue last year.
I recently started hitting this issue of a WHEA Machine Check Exception (WHEA Event ID 18), citing a cache hierarchy error (which I'll assume is AMD's notation for a cache coherency problem). Each time it occurs on my system, it flags the APIC for two separate cores (but each time it's different cores), one on each CCD. What led me here is that in each of the 4 times this has happened in the last week, it's occurred when running HWInfo (I'm running the latest release). In all four cases, the problem occurred with the system under full load, running F@H on both CPU and GPU, loading all cores. The stop occurred as quickly as 3 minutes after boot, with the longest duration being three hours and 16 minutes. Two of the stops occurred with the machine in its default stock config, the other two occurred with PBO enabled. I'll list the specs below, but my machine has overkill levels of cooling and power delivery - Corsair HX1200 PSU, Corsair 360 AIO cooler, 7 Corsair maglev fans - even with PBO enabled, CPU package and Tdie temps never exceed 74c (and I have a platform override in place to prevent the system from exceeding 79c - but it never hits that temp anyway).

Now here is where things start to point to HWInfo as some form of trigger: so far, I absolutely cannot recreate this WHEA 18 stop if HWInfo isn't running. I ran the machine for 24 hours, running the same F@H workload, without so much as a hiccup from Friday evening through Saturday (US-ET). I didn't launch HWInfo for the rest of the weekend and didn't not any issues even after the 24-hour stress test. When I hit the problem, in quick succession this afternoon, both times while running HWInfo, after a shutdown and full power-off and cool-down (to recreate the same conditions as my initial startup) I brought the machine back up, refrained from launching HWInfo, and began another stress test. As I'm typing this, we're now at just over 7 hours at continuous load, again, without a hiccup.

The frustrating part of this is that when hitting the Machine Check, versus a standard kernel mode bugcheck, there's no kernel dump generated, so I can't pull up windbg (I'm a software engineer and former driver developer on NT) and run a !analyze or look at a stack trace to get a hint as to what was happening at the time of the stop. I have no idea what aspect of reading sensor data would result in this situation, but whatever is happening feels like something is tripping up the cIOD interconnect fabric (regardless, at some level this is an AMD problem since this type of crash shouldn't happen under any circumstances). I'm going to attempt to create a more consistent set of repros tomorrow, but right now this doesn't feel like a run-of-the-mill bad Ryzen CPU issue (honestly, I wish I could get a reproduction without HWInfo, because then I could just chalk it up to a bunk Ryzen processor or motherboard, order a new board / CPU and call it day).

Specs:
Ryzen 5950x
Asus Tuf x570 plus
128GB Micro DDR4-3200 (CAS 16)
Asus Tuf RTX 3080
Samsung 2TB 980pro NVMe (PCIe 4 @ 4x via IO die)
Seagate 2TB Firecuda NVMe (PCIe 4 @ 4x via PCH)
Corsair HX1200 PSU
Corsair 360 AIO
Windows 11 Enterprise 22H2 Build 22621.607
 
Are you running some other system monitoring or tweaking tools along with HWiNFO? Even RyzenMaster counts. This problem could be due to a collision between them and HWiNFO.
So if you're running any other tools, try to run HWiNFO only to see if this is the case.
 
Are you running some other system monitoring or tweaking tools along with HWiNFO? Even RyzenMaster counts. This problem could be due to a collision between them and HWiNFO.
So if you're running any other tools, try to run HWiNFO only to see if this is the case.
Hi Martin - thanks for responding. I'm not running Ryzen Master interactively. Lamentably, my system does run iCue and Asus' AISuite, both of which do consume, I assume, smbus information around temps and fan speeds. I need them to manage fan profiles and make all my aRGB pretty (for that 15% RGB perf gain).

I'd be interested in what you think are the mechanics of this situation. How could sensor monitoring lead to what (I believe amdppm.sys) buckets (for WHEA) as a cache hierarchy hardware fault? It feels like AMD's handling of processor state is borked, and that there is some other device in the system that's creating blocked state for the cIOD (e.g. a problem in handling a fault for something blocking one or more of the PCIe channels cascades into an infinity fabric issue that eventually just gets (inaccurately) handled with this "cache hierarchy" handler).

Is this possibly some issue that's actually an interaction problem with the GPU (and it's driver stack) but, as in other instances, it's just getting "handled" incorrectly upstream?

I feel that if we can build a reasonable understanding of what the real mechanics of this problem are, I might be able to pull some strings to get to someone at AMD who can make their reporting better (e.g. possibly more granular handling, better fault parametrization and documentation, or maybe just better naming - e.g. don't call it a cache hierarchy error if it's something more generic). I'd need some pretty precise data to get AMD's attention (as I'd be leveraging channels from a former life).

As it stands, the WHEA 18 "cache hierarchy error" has become nonsensically generic; it's as if everything were landing in an NMI or unrecoverable ECC hardware fault bucket - I highly doubt all these WHEA 18's in the world (including those you've previously fixed the triggers for in your product, as well as all those have totally unrelated vdroop issues, etc.) actually involve a cache or memory subsystem fault.

Happy to take this offline or to PM/email if that makes anything easier. Thanks for your help; I find HWInfo extremely useful and hope to get back to using it on this system.
 
The problem might be when multiple clients access shared resources at the same time without synchronized access. That can result in a collision.
Waiting for your test results without iCue and AISuite. Btw, many users try to avoid those applications as they are a huge bloat and can itself cause issues.
 
The problem might be when multiple clients access shared resources at the same time without synchronized access. That can result in a collision.
Waiting for your test results without iCue and AISuite. Btw, many users try to avoid those applications as they are a huge bloat and can itself cause issues.
It's going to take me a little time to verify stability sans iCue and AISuite, which are both absolute crapware (thus my previous "lamentably" comment), only made less noticeably bloaty by the fact that the machine has 16 cores and 128GB of RAM.

I still think AMD needs to address the issue of WHEA 18's acting as some kind of generic bucket for problems that have nothing to do with processor cache and memory subsystems - their bad handling is almost worse than no handling.
 
I had a Ryzen 9 5950X which experienced spontaneous reboots (crashes) in Windows due to WHEA 18 / Cache Hierarchy Error. I had the processor replaced, but the second 5950X had the same issue. Thinking that the problem must be in some other component then, I switched every other part in the computer, but WHEA 18 remained. They occurred very irregularly, e.g., twice a day or ten days between crashes. Most crashes occurred when running Folding@home, and I did not suspect they had anything to do with HWiNFO.

After acquiring a number of new components I had two computers which I could use for testing the suspected faulty 5950X:
  1. a regular desktop running Windows 11 and
  2. a "server" running Linux (TrueNAS SCALE, based on Debian).
Each machine would work fine with another Ryzen: the first computer was OK with a Ryzen 5 5600G and the second one with a Ryzen 5 5600. Both computers, however, exhibited crashes with the Ryzen 9 5950X (my second unit). So, the 5950X was replaced once again, and computer 1 is now working fine with Ryzen 9 5950X #3: not a single WHEA 18 (knocking on wood).

This was just my case. I am not sure if it helps anyone else who has WHEA 18 issues with Ryzen. Personally, in hindsight, I would only buy a 5950X when sold as part of a whole computer, in which case the seller would be responsible for diagnosing and correcting any hardware issue.
 
I had a Ryzen 9 5950X which experienced spontaneous reboots (crashes) in Windows due to WHEA 18 / Cache Hierarchy Error. I had the processor replaced, but the second 5950X had the same issue. Thinking that the problem must be in some other component then, I switched every other part in the computer, but WHEA 18 remained. They occurred very irregularly, e.g., twice a day or ten days between crashes. Most crashes occurred when running Folding@home, and I did not suspect they had anything to do with HWiNFO.

After acquiring a number of new components I had two computers which I could use for testing the suspected faulty 5950X:
  1. a regular desktop running Windows 11 and
  2. a "server" running Linux (TrueNAS SCALE, based on Debian).
Each machine would work fine with another Ryzen: the first computer was OK with a Ryzen 5 5600G and the second one with a Ryzen 5 5600. Both computers, however, exhibited crashes with the Ryzen 9 5950X (my second unit). So, the 5950X was replaced once again, and computer 1 is now working fine with Ryzen 9 5950X #3: not a single WHEA 18 (knocking on wood).

This was just my case. I am not sure if it helps anyone else who has WHEA 18 issues with Ryzen. Personally, in hindsight, I would only buy a 5950X when sold as part of a whole computer, in which case the seller would be responsible for diagnosing and correcting any hardware issue.

Thanks for your feedback. I'd like to ask if you found any correlation with HWiNFO but I suspect that since the Linux machine was also affected, the problem must have been somewhere else.
 
Thanks for your feedback. I'd like to ask if you found any correlation with HWiNFO but I suspect that since the Linux machine was also affected, the problem must have been somewhere else.
I don't usually run HWiNFO in the background for extended periods of time. I don't think my computer ever crashed when I was using HWiNFO. When switching components back and forth, I reinstalled Windows a few times. For testing, I kept the computer fairly clean, installing a very limited set of software. Crashes occurred when HWiNFO was not even installed. I can say my WHEA 18 problems and probably related Linux crashes were not caused by HWiNFO.
 
I just registered to add to this thread. I've been pulling my hair out for the last month trying to troubleshoot my new 7800x3D build that would spontaneously reboot like the reset switch had just been hit. There were no attendant WHEA errors like the Zen 3 era of this issue, but the idle reboots, and attendant Event ID 41 logs were similar.

After a lot of back and forth and RMAing of the CPU and motherboard with no fault found, I eventually tried a fresh install of Windows on a spare Sata SSD, and left all inbox drivers at defaults and didn't load anything else, and this stayed stable for several days. I then installed Adrenalin and all chipset drivers, and maintained stability for another 5 days.

I then ran across this thread, and then installed HWInfo (latest stable 7.42), and had an idle reboot after just over 4 minutes. I then uninstalled it, and was stable for another 24 hours.

I've now re-installed my NVME SSD and original Windows 11 install, and immediately uninstalled HWInfo, and I have been reset free overnight, which I've never achieved on this install.

So I'm hoping that anybody suffering similar issues on their newer Zen 4 builds might run across this and test to see if it also resolves their issues.
 
Test without hwinfo for a month watch issue re appear, but if its hwinfo then you wont have issue for a month, then it should happen within 24 hours as soon you start using hwinfo, i doubt its hwinfo tho it reminds of of issues such as running hwinfo + bad rgb software such as rgb software from gskill or fan control software that has conflict with hwinfo for example or anything else potentially that could cause issues, i kinda wish hwinfo had a feature to detect these potential issues and warn users about these conflicts in a one time notification, if seen some really odd stuff even my mainboard failing post cos it was stuck at detecting ram after all sensors froze which prompt me to reboot, ofcourse power cycle fixed it.
I would rather ditch rgb software then hwinfo honestly, altho Windows 11 23h2 is gonna have its own rgb software, hope that wont cause conflicts with hwinfo altho if not tested it yet, to bussy setting up mora radiator for my liquid devil 7900 XTX that is running to hot in triple rad setup with 5950x
 
I had nothing installed except chipset drivers via Adrenalin. 5 days of no reboots, then install HWInfo, and reset within 5 minutes.

I've already blown so much time on this issue, I have zero willingness to troubleshoot further and not using HWInfo fully resolves the issue for me. I'm just leaving the info here in case someone else has a similar issue and ends up here via Google. Everybody who isn't experiencing this very particular issue need not be worried.
 
I had nothing installed except chipset drivers via Adrenalin. 5 days of no reboots, then install HWInfo, and reset within 5 minutes.

I've already blown so much time on this issue, I have zero willingness to troubleshoot further and not using HWInfo fully resolves the issue for me. I'm just leaving the info here in case someone else has a similar issue and ends up here via Google. Everybody who isn't experiencing this very particular issue need not be worried.

If been in a rabit hole before trying to even pin issues to hwinfo only to find out it was't then figuring it was wallpaper engine or something else, i eventually found out troubleshooting AMD issues is imposible, i suspect something is dropping out on cpu that should't but does not generate an error instead just freezes
I hope you figure it out, cos i went even as far to run a bunch of 3D test for 40+ hours and no freezes at all only happens during game or wow the game has a deadlock issue which is a resource conflict think its related to that but i just quit, altho if you have AMD gpu its probably related to radeon drivers they can be a pain, if you do end up still having issues without hwinfo i can recommend trying linux as its more stable, just shame hwinfo is not on linux yet, i would switch over to linux right now if i had acces to hwinfo and infopanel on linux
Since you are on am5 tho i would recommend making sure you on latest bios especially considering cpu's having popped quite litterally in some cases motherboards burning up as well, AMD really needs to communicate better and put more strict rules disallow manipulation outside spec like some aib do like Asus or Gigabyte.

It may help to open own thread as well, anyway going to bed im sleepy.
 
I do very much have buyers remorse for jumping ship from Intel to AMD, especially now having done a ridiculous amount of research on idle reboot and WHEA issues and seeing this has been an ongoing thing for a few years. I also have zero issues when the machine is under load, stress testing and gaming, it's just at idle or very light use like web browsing.

RE the BIOS, I've been on 7 different BIOS versions now. None of them have made a difference, and the one loaded now is the latest stable (1616: https://www.asus.com/motherboards-c...k_bios/?model2Name=TUF-GAMING-B650M-PLUS-WIFI)
 
Do you maybe have multiple GPUs installed?
I'd have to see the HWiNFO Debug File of the crash to determine what exactly is causing this.
 
No, the onboard is disabled in BIOS so just the RTX 3080. I've reinstalled it and turned the debug mode on. When it crashes, I'll send you the dump file.
 
Just rebooted an hour and 20 mins ago. Debug file attached.

The file was showing zero bytes when logged back into desktop, then when I closed HWInfo, it grew to 1085KB.
 

Attachments

Back
Top