Is HWiNFO causing the dreaded WHEA-Logger Event ID XX Cache Hierarchy Errors and sudden reboots on AMD Ryzen systems?

Jackalito

Member
Hello everyone:

A couple of users and myself have been suffering sudden reboots with our computers composed of Ryzen CPU systems (Ryzen 3000, but especially 5000) under different load conditions. The quickest way for us to trigger it, however, has been by using software designed to test RAM stability such as TM5 or Karhu RAM Test.

We have recently discovered that this problem only occurs if we have HWiNFO loaded in the background on Windows 10. Most of us also have AMD Radeon graphics cards, but we yet have to determine if that is a contributing factor. We don't know exactly where the conflict is, but the pattern is clear: we see the dreaded WHEA-Logger Event ID XX Cache Hierarchy error in the Event Viewer of Windows after those sudden reboots.

This has been tested by multiple users across different setups: motherboard manufacturers, AGESA/BIOS revisions, RAM brands and configurations, settings, and even after a fresh install of Windows (including different versions of the operating system). The only common denominator we have been able to find this far is the use of HWiNFO (we've only tested this using the latest versions - we still don't know if it can be solved by rolling back to a previous version specifically).

I'm sharing this information here with the hope that this problem can be reproduced and fixed accordingly. Perhaps, it will also require collaboration with AMD.

Thank you for your time.
 
Last edited:

Martin

HWiNFO Author
Staff member
I'm not aware of such issue yet and it might be quite tough to pinpoint.
We need to know more precise criteria when this is happening and what monitoring software is running. Is there perhaps also some other monitoring tool running with HWiNFO, i.e. AMD RyzenMaster?
 

Jackalito

Member
I'm not aware of such issue yet and it might be quite tough to pinpoint.
We need to know more precise criteria when this is happening and what monitoring software is running. Is there perhaps also some other monitoring tool running with HWiNFO, i.e. AMD RyzenMaster?
Hi Martin.

The only monitoring tool running in our case is HWiNFO. Let me add some specifics about my personal hardware/software setup:

Hardware (firmware):
Motherboard: ASUS Crosshair VIII Hero X570 (currently using BIOS 3003 with AGESA 1.1.8.0 - SMU 56.37.0 - although the problem persists with newer AGESA/SMU firmware as well)
CPU: AMD Ryzen 7 5800X (I have reproduced this exact problem using two different samples of the same processor)
CPU Cooler: Arctic Liquid Freezer II 280 AIO
RAM: 2x8GB DDR4 G.SKILL TRIDENTZ F4-3600C15D-16GTZ Single Rank (the issue persists regardless of its configuration, including XMP profile or even at 2133MHz)
GPU: Radeon RX 6800 XT 16GB XFX Speedster Merc319 Black (VBIOS: 020.001.000.044.000000) fed by two independent PCIe cables from the PSU - again the issue occurs with each compatible graphics driver version released by AMD so far
System drive: 1x SSD Samsung M.2 NVMe 970 Evo Plus updated to its latest firmware (2B2QEXM7)
PSU: Corsair RM850 Modular 850W Gold Plus

More information (software):
Third-party tools used in the background:
Logitech Software (G Hub updated to its latest version, v2020.12.9532) to control peripherals: keyboard (Logitech Gaming G513 Carbon), mouse (Logitech G903) and headset (Logitech G PRO X) - Note: Both, the keyboard and mouse, are also updated to their respective newest firmware
Radeon Software - (latest version: 2021.0118.2234.40618 - Drivers 21.1.1 - January 2021)
Internet Download Manager (updated to its latest version: 6.38 Build 16)

No ASUS software installed on Windows

Please, let me know if there's anything else I can share with you.

Thank you in advance for your time.
 

Martin

HWiNFO Author
Staff member
Thanks for the details.
This will probably need to be tested by changing various settings in HWiNFO to see if some of them has an effect on this.
First I'd recommend to enable the "Snapshot CPU Polling" mode and check if the problem persists there.
If yes, try to disable some sensors (entire ones by hitting Del over the sensor heading) to see if you can find one that is causing this. I'd perhaps start with the GPU one.
 

Jackalito

Member
Thanks for the details.
This will probably need to be tested by changing various settings in HWiNFO to see if some of them has an effect on this.
First I'd recommend to enable the "Snapshot CPU Polling" mode and check if the problem persists there.
If yes, try to disable some sensors (entire ones by hitting Del over the sensor heading) to see if you can find one that is causing this. I'd perhaps start with the GPU one.
Thanks for the suggestions.

I'll get back to you as soon as I can.
 

Zach

Well-Known Member
Past AGESA V2 1.1.0.0 Patch D a lot of users experience WHEA errors regarding cache hierarchy. AGESA V2 1.1.0.0D is the safest and most stable version that you can use as of now.
Are this 1.1.8.0 version of BIOS... beta or final? Also past 1800MHz on MEMCLK, UCLK and FCLK can produce such errors with the 1.1.8.0~1.2.0.0 versions that some (not all) vendors have published.
The same users can reach 1900~2000MHz for all MEM/U/F-CLKs without cache hierarchy errors on V2 1.1.0.0D. I dont think this is HWiNFO's fault even if the software's presence produce such errors.
IMHO It's the early and buggy AGESA version responsible. We who try new BIOS versions, and especially new AGESA microcode, can consider ourselves as beta testers.
 
Last edited:

Jackalito

Member
Past AGESA V2 1.1.0.0 Patch D a lot of users experience WHEA errors regarding cache hierarchy. AGESA V2 1.1.0.0D is the safest and most stable version that you can use as of now.
Are this 1.1.8.0 version of BIOS... beta or final? Also past 1800MHz on MEMCLK, UCLK and FCLK can produce such errors with the 1.1.8.0~1.2.0.0 versions that some (not all) vendors have published.
The same users can reach 1900~2000MHz for all MEM/U/F-CLKs without cache hierarchy errors on V2 1.1.0.0D. I dont think this is HWiNFO's fault even if the software's presence produce such errors.
IMHO It's the early and buggy AGESA version responsible. We who try new BIOS versions, and especially new AGESA microcode, can consider ourselves as beta testers.
I agree with you on what you're saying. I think that, ultimately and as you have just said, it could be a case where HWiNFO is triggering an inherent problem. However, if I upgrade the BIOS for my motherboard with most recent UEFI firmware, the problem is triggered even faster. Also, there was never a BIOS with AGESA 1.1.0.0 Patch D for my motherboard, and the ones with just the base code or Patch C are also exhibiting the same behavior. Moreover, the BIOS I'm currently using is not a BETA.

I will first test the suggestions made by @Martin, and if nothing works I may try the new BIOS from ASUS with AGESA 1.2.0.0 non-beta, which was released just a few days ago. But, trust me, this thing has been driving my crazy because otherwise the system is completely stable and crash-free, and I've been testing a wide array of UEFI firmwares and AGESA revisions.

And, yeah, you're right that we've been beta-testing for AMD ultimately. It's a shame because, otherwise, the platform is versatile, powerful and power efficient.
 
Last edited:

Zach

Well-Known Member
I think the most stable BIOS for your board is v2702, either it’s patch C (same as v2502) or patch D.
Yeah, I wouldn’t expect ASUS to call anything her’s as beta.

Gigabyte on the other hand is following a different path.
The latest final version is F32 (V2 1.1.0.0 D) and the latest beta is F33a (V2 1.2.0.0). Didn’t bother with 1.1.8.0 at all.
The a on F33a indicates an early beta and it is very rare to an a to appear on a beta. Usually starts with b/c and goes up to f or even j until final (non letter) version published. Gigabyte users also experience cache hierarchy errors on V2 1.2.0.0 (F33a). If all of them are using HWiNFO or not, I do not know.

AM4 platform (on ZEN2/3) is pretty solid as of now. Let’s not forget that the official speed for MEM/U/F-CLK is 1600MHz. User expectations though may have grown unproportionally after the (some)ZEN2‘s ability to clock the aforementioned subsystems to 1900MHz. Few of them though with rock-like stability even without user knowledge. A lot of blame was thrown towards Navi1/2 GPUs as well (drivers mostly) for random crashes.

After ZEN2 release back on July 2019 and after public saw the way boost was implemented it was somehow clear to me that the era of OC is coming to an end. Yes, Ryzen5000 has introduced the curve optimizer but again I see some users overuse it under the same over-expectation. I saw some 5900X/5950X hitting 200~250W PPT.... and I really hope AMD does not regret that. Users may regret this sooner.

So, does your board has any issues under v2702? If not, then stay there for a while. As most Gigabyte users are staying on F32 until next stable AGESA. But of course industry needs beta testers too. I’m one of them, willingly.
 

Bloot

Member
Hello,

I just registered to your forums to confirm it happens the same on my system(s):

B550 Tomahawk (with any bios version)
5800X (got another one from AMD RMA and it happens the same)
Reference 6800 XT
32GB RAM either with my previous G.Skill Trident Z 3866 F4-3866C18Q-32GTZ 4x8GB kit or my new one I bought thinking it was a RAM problem Crucial Ballistix 3600MHz CL16 BL2K16G36C16U4B 2x16GB
Seasonic Prime Ultra Platinum 1000W

Latests hwinfo versions makes the cpu crash and I guess it has to do with the gpu sensors, because it does not crash with no graphics drivers installed (hwinfo shows no gpu sensors with no graphics drivers on the system). The system just freezes and then after some seconds it just reboots with a black screen, no bsod. You get a WHEA 18 cache hierarchy error on the windows event viewer after the system boots again. And it happens with every graphics drivers fom 20.11.2 to the most recent.

Fact is it didn't happen before, I think it started happening on early January, it has to be something it's been changed in recent hwinfo builds.

I also tested it on the plattform I had before the 5800X+B550 combo replaced it:

Crosshair VI Hero (with any bios version)
3900X
Reference 6800XT
32GB RAM G.Skill Trident Z 3866 F4-3866C18Q-32GTZ
Seasonic Prime Ultra Platinum 1000W

The same problem happens, recent hwinfo builds makes the cpu crash with a WHEA 18 cache hierarchy error

I'll be testing stable 6.40 as I guess it does not behave this way, I've been using hwinfo since I got the 6800XT on late november and have had no problems until early January. But at the moment I am not sure, have to test it, but as I said I encountered no problems when I was using it for a month more or less.

And I can confirm with no hwinfo running my system does not crash anymore.

Greetings
 
Last edited:

Martin

HWiNFO Author
Staff member
Thanks for your feedback.
The GPU was my suspect as well and you both guys have the Navi21 (RX 6800 XT). HWiNFO version 6.40 added enhanced support of these GPUs, so maybe it has something to do with that.
Would be interesting to see if the problem happens with the GPU sensor monitoring disabled in HWiNFO.
 

kr0mka

New Member
Hey there,

I just wanted to +1 this. I've been trying to troubleshoot cache hierarchy wheas in my 5900X&6800XT combo for over a month now, even RMA'd the CPU and had a 3200G in the system in the meantime. Experienced wheas with all of the CPUs (old & new 5900x and the 3200g) when running the 6800XT with them (aside from 3200G, it was just freezing the whole system on idle without a whea).

When I ran 3200G without the 6800XT in the system the crashes stopped. And yesterday I noticed a thread about this on reddit and realized I've been running HWiNFO64 in the background the whole time.

Since yday I disabled hwinfo and had no WHEA reboots anymore, but it's not been long enough to conclude it was 100% hwinfo's fault here, will test more.
 

Clock

New Member
Hi

It's interesting to stumble across this and just thought I'd post about mine to.

I had a 5900X paired with a GTX 1080ti for a few weeks and always used HWInfo without any problems. However the very day I installed an RX 6800 with the 5900X I started to get WHEA cache hierarchy errors and idle black screen reboots. Initially thought the GPU was at fault but since event viewer lists the WHEA errors and the many threads about WHEA issues with Ryzen I started to just assume it was coincidental timing.

I can't be certain but it's possible HWInfo was open each reboot. I hadn't thought about that until reading this thread. I will close it completely and see how it goes. Often I could randomly reboot three times a day but ironically I would have had HWInfo open trying to identify a pattern or anything unusual. Will report back if I notice anything noteworthy. Thanks for the suggestion OP
 
Last edited:

PoMpIs

New Member
Hello Martin :)

I am another of those affected ... This is the error:

eUW3i8G.png


And it comes out when the system is idle, if I close hwinfo it doesn't happen, and everything works fine

My system:

Asus B550 E gaming

Ryzen 5950x

Powercolor RX6800XT

32GB Ram G.Skill Trident Z RGB DDR4-4400MHz CL18-19-19-39 1.40V @ 3800Mhz

EVGA Supernova 750 G2

Greetings
 

Martin

HWiNFO Author
Staff member
OK, so this seems to be a sufficient proof that the Navi21 GPU (RX 6800) is the culprit. Can anyone try to run with monitoring of the GPU sensor disabled?
 

Bloot

Member
OK, so this seems to be a sufficient proof that the Navi21 GPU (RX 6800) is the culprit. Can anyone try to run with monitoring of the GPU sensor disabled?
Thing is, I was using hwinfo for a while and it showed all of my 6800 XT sensors without issues, It started happening on the builds released on early January (maye first or second week). Maybe it's related to the added memory junction sensor on the RTX 3000 series? The 6800 XT also showed this sensor since I bough it, maybe it's interfering with the new one?

I don't know, I will test 6.40 an let you know if this problem reproduces or not.
 
Last edited:

Martin

HWiNFO Author
Staff member
Thanks, but disabling the GPU in Device Manager is not the best solution :D
Please try to disable monitoring of the GPU sensor in HWiNFO by hitting Del key over its heading.
 

Jackalito

Member
Thanks for the details.
This will probably need to be tested by changing various settings in HWiNFO to see if some of them has an effect on this.
First I'd recommend to enable the "Snapshot CPU Polling" mode and check if the problem persists there.
If yes, try to disable some sensors (entire ones by hitting Del over the sensor heading) to see if you can find one that is causing this. I'd perhaps start with the GPU one.
Martin, I enabled "Snapshot CPU Polling" last night and then I left my computer running Karhu RAM Test and went to sleep.
The computer was rebooted again, however, so that did not work.

WHEA.png

It must have been triggered about 30 minutes into running RAM Test.
Any other suggestions?

Thanks!
 

Jackalito

Member
I think the most stable BIOS for your board is v2702, either it’s patch C (same as v2502) or patch D.
Yeah, I wouldn’t expect ASUS to call anything her’s as beta.

Gigabyte on the other hand is following a different path.
The latest final version is F32 (V2 1.1.0.0 D) and the latest beta is F33a (V2 1.2.0.0). Didn’t bother with 1.1.8.0 at all.
The a on F33a indicates an early beta and it is very rare to an a to appear on a beta. Usually starts with b/c and goes up to f or even j until final (non letter) version published. Gigabyte users also experience cache hierarchy errors on V2 1.2.0.0 (F33a). If all of them are using HWiNFO or not, I do not know.

AM4 platform (on ZEN2/3) is pretty solid as of now. Let’s not forget that the official speed for MEM/U/F-CLK is 1600MHz. User expectations though may have grown unproportionally after the (some)ZEN2‘s ability to clock the aforementioned subsystems to 1900MHz. Few of them though with rock-like stability even without user knowledge. A lot of blame was thrown towards Navi1/2 GPUs as well (drivers mostly) for random crashes.

After ZEN2 release back on July 2019 and after public saw the way boost was implemented it was somehow clear to me that the era of OC is coming to an end. Yes, Ryzen5000 has introduced the curve optimizer but again I see some users overuse it under the same over-expectation. I saw some 5900X/5950X hitting 200~250W PPT.... and I really hope AMD does not regret that. Users may regret this sooner.

So, does your board has any issues under v2702? If not, then stay there for a while. As most Gigabyte users are staying on F32 until next stable AGESA. But of course industry needs beta testers too. I’m one of them, willingly.

Yes, 2702 is exhibiting the same issue. I went back as far as 2502 and the problem still persisted. Again, even with the RAM at 2133 and the IF at 1:1 for that frequency, the reboots are happening as long as I load HWiNFO. I have no idea why, which is why I opened this thread.
 

Martin

HWiNFO Author
Staff member
Please try my suggestion to test without GPU sensor enabled.
Meanwhile I'm working on a test build to check whether that might fix it.. Stay tuned..
 
Top