Is HWiNFO causing the dreaded WHEA-Logger Event ID XX Cache Hierarchy Errors and sudden reboots on AMD Ryzen systems?

I wrote on AMD help forums and I have the exact same thing, except that I dont run into problems with pre 1.2.0.0 agesa. Im more stable with older agesa. Even on stock/auto settings I cannot run karhu or TM5. But everytime I ran those 2 tests I had hwinfo64 (6.40-4330) open in the background to try and catch whea errors before reboot, but I was unable to log anything. It was straight reboots.


I have a ROG strix B450-F gaming + ryzen 5900x + 4x8gb gskill ripjaws 3200mhz Cl14 giving some whea error and reboots only in karhu mem benchmarkand TM5. Daily usage is fine, no crashes, bluescreen nor stuttering. I can even run prime95, membench, cb23, cb20 without errors:

https://www.mediafire.com/file/u1bkd9os3tk5rg1/applicationsevtx.evtx/file

https://www.mediafire.com/file/h0ib41896t1knqj/systemevtx.evtx/file

Please upgrade HWiNFO to v6.43 Beta.
 
Hello, can anybody vouch for the symptoms and fix for Ryzen with nVidia GPU? I have a 5900x/3080 setup. AMD approved my RMA yesterday, but I can't ship due to winter storm. Now I'm wondering if getting a 5900x replacement would be worth the effort (also if it'd do more harm than good). I do have HWInfo run at startup, and it (WHEA BSOD) happen at idle (after an intense session of Cyberpunk).

P.S.: I have ryzen master reporting my max clock to be 4900mhz. This correlate with HWInfo recording my max clock to hit the 4900~ range. This is a known "spike"? Can anyone with stock BIOS config mention of they got the same thing?

Thanks!
 
Such a WHEA BSOD is not caused by the issue in HWiNFO discussed here, it affected only RX 6xxx series GPUs.
 
Hey @Martin , thanks for being so hands on with this problem by the way, i do not run HWInfo but run FPSMonitor and CoreTemp, both of which also monitor temps i have swapped GPU from 2070s to 3090, psu from 650w to 850w and ram from 3600 cl17 to 3200 cl16, and have been having the EXACT same problem as described here in this issue.

My setup currently is a 5950x and 3090 (Nvidia, same as @KeiOthic) with a x570 aorus elite and a 850w corsair psu. I understand none of my problems have to do with hwinfo as i dont use it ( unless both these progrems end up using any HWInfo binaries which i dont know ) , but would you say that this problem could be related to cpu instead of gpu?

Second question, that maybe these 2 temp monitors have not implemented the same fix that amd suggested to you ? Was the fix stricly GPU based ?
I very much need an answer to this because iam starting the RMA process with AMD and if it turns out to be these programs i wouldnt have to go ahead with the RMA. Since i stopped running these 2 programs i havent gotten the error in the last 4 days. whereas before i would get it at least once in 2 or 3 days. An answer to the questions would be very appreciated, thank you very much.
 
Last edited:
@Martin Also as a warning, just got in touch with FPS Monitor support and seems like they are also using the hwinfo library. Probably CoreTemp aswell.

100% they are using a broken version of the hwinfo library because their latest update is from November 26, 2020. Nothing to do with you specifically just using this thread to let people know.
 
Hey @Martin , thanks for being so hands on with this problem by the way, i do not run HWInfo but run FPSMonitor and CoreTemp, both of which also monitor temps i have swapped GPU from 2070s to 3090, psu from 650w to 850w and ram from 3600 cl17 to 3200 cl16, and have been having the EXACT same problem as described here in this issue.

My setup currently is a 5950x and 3090 (Nvidia, same as @KeiOthic) with a x570 aorus elite and a 850w corsair psu. I understand none of my problems have to do with hwinfo as i dont use it ( unless both these progrems end up using any HWInfo binaries which i dont know ) , but would you say that this problem could be related to cpu instead of gpu?

Second question, that maybe these 2 temp monitors have not implemented the same fix that amd suggested to you ? Was the fix stricly GPU based ?
I very much need an answer to this because iam starting the RMA process with AMD and if it turns out to be these programs i wouldnt have to go ahead with the RMA. Since i stopped running these 2 programs i havent gotten the error in the last 4 days. whereas before i would get it at least once in 2 or 3 days. An answer to the questions would be very appreciated, thank you very much.

FPS Monitor is using the HWiNFO engine as well (CoreTemp not), nevertheless the problem you're experiencing should not be caused by this. But it's easy to verify - run without any other tool and see if the issue persists.
I can't speak for CoreTemp as this is a different product where I have no insight. FPS Monitor got an updated HWiNFO engine with the same fix as HWiNFO, but I'm not sure whether the author has released an update with this. Yes, the fix is strictly GPU related and only for RX 6xxx. The feature causing this issue is not present on other GPUs neither CPUs so there's no way how it could cause a problem anywhere else..
 
FPS Monitor is using the HWiNFO engine as well (CoreTemp not), nevertheless the problem you're experiencing should not be caused by this. But it's easy to verify - run without any other tool and see if the issue persists.
I can't speak for CoreTemp as this is a different product where I have no insight. FPS Monitor got an updated HWiNFO engine with the same fix as HWiNFO, but I'm not sure whether the author has released an update with this. Yes, the fix is strictly GPU related and only for RX 6xxx. The feature causing this issue is not present on other GPUs neither CPUs so there's no way how it could cause a problem anywhere else..
Alright, thanks for the heads up, just trying to diagnose if my issue could be related to this in any way, i went from crashing once at least every 3 days at best to maybe average once a day and for the last 4 days, i have not had the problem happen again since uninstalling these tools. Since you said that its stricly GPU related, iam bit sceptical that i might just be getting lucky with it not shutting down but i hope for one reason or another the issues is fixed.

I will go as far as saying that FPSMonitor does NOT have this update, as their latest version seems to be from November 26, 2020 and your fix was just pushed out. Regardless thanks a lot for answering, you'r the best.
 
Would GPU-Z be affected by this as well, at least potentially? If so, do you have a contact there at techpowerup to let them know?
 
Would GPU-Z be affected by this as well, at least potentially? If so, do you have a contact there at techpowerup to let them know?
I don't think it is affected, but can't say for sure as I have no insight into its internals.
This issue was caused by a certain register access required for advanced GPU monitoring, which AFAIK no other (public) tool supports.
But I can't rule out that some other tool isn't using a similar register access for other purposes.
 
I don't think it is affected, but can't say for sure as I have no insight into its internals.
This issue was caused by a certain register access required for advanced GPU monitoring, which AFAIK no other (public) tool supports.
But I can't rule out that some other tool isn't using a similar register access for other purposes.
OK, thanks for the info. I will try to relay that information to them. I primarily use hwinfo, but at times I'll use GPU-Z for some of the other GPU-specific functions it performs. Since I've loaded the hwinfo beta, I have not loaded GPU-Z at all, as I wanted to eliminate that variable. If gpu-z is at least theoretically vulnerable, then there could be a lot more WHEA-18 sufferers out there about to RMA their shiny new Zen3's for nothing, and that doesn't help anyone.
 
Hi, sorry for chiming in late but I think I have the same issue.


AMD Ryzen 7 5800x in combination with RX6800 on MSI B550-A Pro.

Yesterday for the first time I experienced a random reboot.

A fatal hardware error has occurred.

Reported by component: Processor Core
Error Source: Machine Check Exception
Error Type: Cache Hierarchy Error
Processor APIC ID: 8

The details view of this entry contains further information.

A fatal hardware error has occurred.

Reported by component: Processor Core
Error Source: Machine Check Exception
Error Type: Bus/Interconnect Error
Processor APIC ID: 0

The details view of this entry contains further information.


Before (Dec, Jan) I did not have the issue but recently I've updated BIOS, chipset drivers, GPU drivers to newest version.
I'm using HWiNFO v6.4-4330 and would like to keep on using HWiNFO so I'll update to newest version and hope it does not happen anymore.

EDIT: where can I found the newer version which would solve the issue?
 
My bad I missed the grey button. So my problem is most likely related to the old version of HWiNFO installed ?
Otherwise I would need to check a lot more settings as I can't reproduce it, it rathers comes randomly and only once.
 
It's quite likely if you have have an RX 6800, HWiNFO64 v6.42 was affected.
But there are tons of other things that can cause such failures, completely unrelated to HWiNFO. So you need to use v6.43 and see what will happen.
 
I just want to throw my hat in as someone potentially affected by this. Upgraded to a 6900xt in January, and started having issues. Ended up replacing everything in the system through RMA / Upgrades and that one issue still persisted. It only ever happened overnight, or under low utilization (e.g. a zoom call).

I just switched to the beta suggested and will report back.
 
It makes me wonder that this happens often (if not all times) under low utilization or under idle state. Maybe some low C-states are responsible for this? Like DF C-States (DataFabric = InfinityFabric C-States). The Cache Hierarchy errors are possibly DF/IF related.
There is an option in BIOS for disabling DF C-states. The fact that this happens mostly on Ryzen5000 with its resizable bar feature to access VRAM on the RX6000 series GPUs (and work ot like its own RAM) is pushing me more to this direction.
If I had such a system I would try to disable all power saving modes for all the memory subsystems.

1. DF C-states (IF setting) = Disabled
2. PowerDownMode (DRAM setting) = Disabled
3. SoC/Uncore OC Mode (I/O Die, SoC setting) = Enabled

...and also maybe trying different/all settings on the "Power Supply idle Control"
 
No, this issue was caused by a GPU low-power state in which the entire GFX domain is switched off and it concerns the RX 6xxx series only. When in such state someone attempts to access certain registers, there's a small probability of a system crash.
 
Well, the Cache Error happened again today, after upgrading to the latest Hwinfo beta.

I doubt there's anything useful in this error log, but I've attached it.

In regards to Zach's comments, I do know that global C-States didn't do anything, and neither did "Power Supply idle Control". And I had this error with the exact same system with a 3800XT installed instead.
 

Attachments

  • WHEA Error.txt
    2.4 KB · Views: 5
@brandorf I dont know if you had read the replies i posted above, after unninstalling both monitoring apps from from my computer i had NO MORE WHEAs, havent had a crash since. thats 15 days or so.
There is probably another issue causing these WHEAs.
 
Back
Top