Hi,
I'm trying to figure out what is causing these WHEA errors on my Dual Xeon build. In HWiNFO, it specifies the errors as:
"CPU Memory Controller Errors"
I thought I fixed this by disabling NUMA, but now they are back. It seems that whenever my CPUs stay above X% usage for some time (what percent is unclear, but they're practically never above 50%, so it must be below 50%), these errors are thrown. However, I have a wattage meter plugged into this PC, and it never goes above 300w at 100% load... and the PSU is a 750w EVGA Supernova G3.
No BSoD's, no memory dumps, no program crashes...
Here is the error:
<System>
<Provider Name="Microsoft-Windows-Kernel-WHEA" Guid="{7B563579-53C8-44E7-8236-0F87B9FE6594}" />
<EventID>20</EventID>
<Version>0</Version>
<Level>4</Level>
<Task>0</Task>
<Opcode>0</Opcode>
<Keywords>0x4000000000000800</Keywords>
<TimeCreated SystemTime="2018-08-15T04:32:53.637014900Z" />
<EventRecordID>77</EventRecordID>
<Correlation />
<Execution ProcessID="4" ThreadID="72" />
<Channel>Microsoft-Windows-Kernel-WHEA/Errors</Channel>
<Computer>DualXeon-PC</Computer>
<Security UserID="S-1-5-18" />
</System>
<EventData>
<Data Name="Length">873</Data>
<Data Name="RawData">435045521002FFFFFFFF0300020000000200000069030000352004000F0812140000000000000000000000000000000000000000000000000000000000000000BDC407CF89B7184EB3C41F732CB57131B18BCE2DD7BD0E45B9AD9CF4EBD4F8908948C48E3E34D401000000004552000000000000000000000000000000000000580100004900000001020000010000001411BCA5646FDE4EB8633E83ED7C83B100000000000000000000000000000000020000000000000000000000000000000000000000000000A1010000C00000000102000000000000ADCC7698B447DB4BB65E16F193C4F3DB0000000000000000000000000000000002000000000000000000000000000000000000000000000061020000080100000102000000000000011D1E8AF94257459C33565E5CC3F7E8000000000000000000000000000000000200000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000057010000000000000002000000000000E4060300000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000010000000000000000000000000000000000000000000000000000000000000000000000000000000100000001000000E5BB45085134D40101000000000000000000000000000000000000000D000000C00008004900008C00A4A574000000008C121000104008090000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000</Data>
</EventData>
</Event>
The CPU's are cooled by 2x H80i v2's, and never go above 45-50C even under high-load. I run this same program on i9-7940x, and Xeon E5-2680 (single) with no problem, I've never even seen a WHEA error before this as I'm not an overclocker.
The voltages in HWiNFO are all showing the proper numbers. The voltages per socket differ slightly (very slightly), but I'm assuming that is to be expected on this type of setup. The back VRM's are a bit hotter because they're not getting much air flow, but I mean... we're talking like 45-50C max. Also the DIMM slots stay reasonably cool, usually below 40C.
Both CPUs:
Package/Ring Thermal Throttling - No
Package/Ring Critical Temperature - No
Package/Ring Power Limit Exceeded - No
I can't seem to find any information on what causes Event ID 20 besides some vague language from Microsoft that doesn't really make sense. I know the guys at HWiNFO are really knowledgeable about this stuff so I was hoping someone could help.
Note that there is no overclocking BCLK or anything, everything is 100% stock. Both Xeon's are identical, E5-2695 v2 running on ASUS Z9PA-D8 LGA2011 dual socket ATX board. Memory is 64gb Micron DDR3 1600mhz ECC. All memory is properly recognized in both BIOS and Windows.
- - - - - - - - - - - - - - - - -
Running Intel® Processor Diagnostic Tool and will report back with results.
100% Pass on both CPU's.
Memory Stress Test Passed (CPU1)
Memory Stress Test Passed (CPU2)
Integrated Memory Controller Stress Test Passed (CPU1)
Integrated Memory Controller Stress Test Passed (CPU2)
Last thing, I went to device manager and did show hidden devices and it seems the MEI NULL HECI isn't installed (only showing in hidden devices). Installed it although somehow I doubt it matters, we'll see.
Please let me know if there's any way you could help me figure this out. Full test results attached to this post.
Happening a lot:
[img=500x500]https://i.imgur.com/EdXr1oK.jpg[/img]
Also just found this thread:
https://www.hwinfo.com/forum/Thread-WHEA-Count
I'm starting to wonder if it's the motherboard NIC... Because the program it's running is running many threads simultaneously. I have HWiNFO debug mode on, waiting for the next WHEA error and will post.
I'm starting to wonder if it's
I'm trying to figure out what is causing these WHEA errors on my Dual Xeon build. In HWiNFO, it specifies the errors as:
"CPU Memory Controller Errors"
I thought I fixed this by disabling NUMA, but now they are back. It seems that whenever my CPUs stay above X% usage for some time (what percent is unclear, but they're practically never above 50%, so it must be below 50%), these errors are thrown. However, I have a wattage meter plugged into this PC, and it never goes above 300w at 100% load... and the PSU is a 750w EVGA Supernova G3.
No BSoD's, no memory dumps, no program crashes...
Here is the error:
<System>
<Provider Name="Microsoft-Windows-Kernel-WHEA" Guid="{7B563579-53C8-44E7-8236-0F87B9FE6594}" />
<EventID>20</EventID>
<Version>0</Version>
<Level>4</Level>
<Task>0</Task>
<Opcode>0</Opcode>
<Keywords>0x4000000000000800</Keywords>
<TimeCreated SystemTime="2018-08-15T04:32:53.637014900Z" />
<EventRecordID>77</EventRecordID>
<Correlation />
<Execution ProcessID="4" ThreadID="72" />
<Channel>Microsoft-Windows-Kernel-WHEA/Errors</Channel>
<Computer>DualXeon-PC</Computer>
<Security UserID="S-1-5-18" />
</System>
<EventData>
<Data Name="Length">873</Data>
<Data Name="RawData">435045521002FFFFFFFF0300020000000200000069030000352004000F0812140000000000000000000000000000000000000000000000000000000000000000BDC407CF89B7184EB3C41F732CB57131B18BCE2DD7BD0E45B9AD9CF4EBD4F8908948C48E3E34D401000000004552000000000000000000000000000000000000580100004900000001020000010000001411BCA5646FDE4EB8633E83ED7C83B100000000000000000000000000000000020000000000000000000000000000000000000000000000A1010000C00000000102000000000000ADCC7698B447DB4BB65E16F193C4F3DB0000000000000000000000000000000002000000000000000000000000000000000000000000000061020000080100000102000000000000011D1E8AF94257459C33565E5CC3F7E8000000000000000000000000000000000200000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000057010000000000000002000000000000E4060300000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000010000000000000000000000000000000000000000000000000000000000000000000000000000000100000001000000E5BB45085134D40101000000000000000000000000000000000000000D000000C00008004900008C00A4A574000000008C121000104008090000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000</Data>
</EventData>
</Event>
The CPU's are cooled by 2x H80i v2's, and never go above 45-50C even under high-load. I run this same program on i9-7940x, and Xeon E5-2680 (single) with no problem, I've never even seen a WHEA error before this as I'm not an overclocker.
The voltages in HWiNFO are all showing the proper numbers. The voltages per socket differ slightly (very slightly), but I'm assuming that is to be expected on this type of setup. The back VRM's are a bit hotter because they're not getting much air flow, but I mean... we're talking like 45-50C max. Also the DIMM slots stay reasonably cool, usually below 40C.
Both CPUs:
Package/Ring Thermal Throttling - No
Package/Ring Critical Temperature - No
Package/Ring Power Limit Exceeded - No
I can't seem to find any information on what causes Event ID 20 besides some vague language from Microsoft that doesn't really make sense. I know the guys at HWiNFO are really knowledgeable about this stuff so I was hoping someone could help.
Note that there is no overclocking BCLK or anything, everything is 100% stock. Both Xeon's are identical, E5-2695 v2 running on ASUS Z9PA-D8 LGA2011 dual socket ATX board. Memory is 64gb Micron DDR3 1600mhz ECC. All memory is properly recognized in both BIOS and Windows.
- - - - - - - - - - - - - - - - -
Running Intel® Processor Diagnostic Tool and will report back with results.
100% Pass on both CPU's.
Memory Stress Test Passed (CPU1)
Memory Stress Test Passed (CPU2)
Integrated Memory Controller Stress Test Passed (CPU1)
Integrated Memory Controller Stress Test Passed (CPU2)
Last thing, I went to device manager and did show hidden devices and it seems the MEI NULL HECI isn't installed (only showing in hidden devices). Installed it although somehow I doubt it matters, we'll see.
Please let me know if there's any way you could help me figure this out. Full test results attached to this post.
Happening a lot:
[img=500x500]https://i.imgur.com/EdXr1oK.jpg[/img]
Also just found this thread:
https://www.hwinfo.com/forum/Thread-WHEA-Count
I'm starting to wonder if it's the motherboard NIC... Because the program it's running is running many threads simultaneously. I have HWiNFO debug mode on, waiting for the next WHEA error and will post.
I'm starting to wonder if it's