WHEA errors "Memory controller" what does it mean?

highpec

Member
Hi,

I'm trying to figure out what is causing these WHEA errors on my Dual Xeon build.  In HWiNFO, it specifies the errors as:

"CPU Memory Controller Errors"

I thought I fixed this by disabling NUMA, but now they are back. It seems that whenever my CPUs stay above X% usage for some time (what percent is unclear, but they're practically never above 50%, so it must be below 50%), these errors are thrown. However, I have a wattage meter plugged into this PC, and it never goes above 300w at 100% load... and the PSU is a 750w EVGA Supernova G3.

No BSoD's, no memory dumps, no program crashes...

Here is the error:

  <System>
  <Provider Name="Microsoft-Windows-Kernel-WHEA" Guid="{7B563579-53C8-44E7-8236-0F87B9FE6594}" /> 
  <EventID>20</EventID> 
  <Version>0</Version> 
  <Level>4</Level> 
  <Task>0</Task> 
  <Opcode>0</Opcode> 
  <Keywords>0x4000000000000800</Keywords> 
  <TimeCreated SystemTime="2018-08-15T04:32:53.637014900Z" /> 
  <EventRecordID>77</EventRecordID> 
  <Correlation /> 
  <Execution ProcessID="4" ThreadID="72" /> 
  <Channel>Microsoft-Windows-Kernel-WHEA/Errors</Channel> 
  <Computer>DualXeon-PC</Computer> 
  <Security UserID="S-1-5-18" /> 
  </System>
  <EventData>
  <Data Name="Length">873</Data> 
  <Data Name="RawData">435045521002FFFFFFFF0300020000000200000069030000352004000F0812140000000000000000000000000000000000000000000000000000000000000000BDC407CF89B7184EB3C41F732CB57131B18BCE2DD7BD0E45B9AD9CF4EBD4F8908948C48E3E34D401000000004552000000000000000000000000000000000000580100004900000001020000010000001411BCA5646FDE4EB8633E83ED7C83B100000000000000000000000000000000020000000000000000000000000000000000000000000000A1010000C00000000102000000000000ADCC7698B447DB4BB65E16F193C4F3DB0000000000000000000000000000000002000000000000000000000000000000000000000000000061020000080100000102000000000000011D1E8AF94257459C33565E5CC3F7E8000000000000000000000000000000000200000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000057010000000000000002000000000000E4060300000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000010000000000000000000000000000000000000000000000000000000000000000000000000000000100000001000000E5BB45085134D40101000000000000000000000000000000000000000D000000C00008004900008C00A4A574000000008C121000104008090000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000</Data> 
  </EventData>
  </Event>

The CPU's are cooled by 2x H80i v2's, and never go above 45-50C even under high-load. I run this same program on i9-7940x, and Xeon E5-2680 (single) with no problem, I've never even seen a WHEA error before this as I'm not an overclocker.

The voltages in HWiNFO are all showing the proper numbers. The voltages per socket differ slightly (very slightly), but I'm assuming that is to be expected on this type of setup. The back VRM's are a bit hotter because they're not getting much air flow, but I mean... we're talking like 45-50C max. Also the DIMM slots stay reasonably cool, usually below 40C.

Both CPUs:
Package/Ring Thermal Throttling - No
Package/Ring Critical Temperature - No
Package/Ring Power Limit Exceeded - No

I can't seem to find any information on what causes Event ID 20 besides some vague language from Microsoft that doesn't really make sense. I know the guys at HWiNFO are really knowledgeable about this stuff so I was hoping someone could help.

Note that there is no overclocking BCLK or anything, everything is 100% stock. Both Xeon's are identical, E5-2695 v2 running on ASUS Z9PA-D8 LGA2011 dual socket ATX board. Memory is 64gb Micron DDR3 1600mhz ECC. All memory is properly recognized in both BIOS and Windows.

- - - - - - - - - - - - - - - - -

Running Intel® Processor Diagnostic Tool and will report back with results.

100% Pass on both CPU's.
Memory Stress Test Passed (CPU1)
Memory Stress Test Passed (CPU2)
Integrated Memory Controller Stress Test Passed (CPU1)
Integrated Memory Controller Stress Test Passed (CPU2)

Last thing, I went to device manager and did show hidden devices and it seems the MEI NULL HECI isn't installed (only showing in hidden devices). Installed it although somehow I doubt it matters, we'll see.

Please let me know if there's any way you could help me figure this out. Full test results attached to this post.

Happening a lot:

[img=500x500]https://i.imgur.com/EdXr1oK.jpg[/img]

Also just found this thread:

https://www.hwinfo.com/forum/Thread-WHEA-Count

I'm starting to wonder if it's the motherboard NIC... Because the program it's running is running many threads simultaneously.  I have HWiNFO debug mode on, waiting for the next WHEA error and will post.

I'm starting to wonder if it's
 

Attachments

  • Full-Test-Results.txt
    34.6 KB · Views: 2
Further investigation of

Process ID = 4 = "System"

In Processor Explorer, viewing the threads.

ThreadID is a recent WHEA is 48

ThreadID 48 =  ntoskml.exe!KeGetCurrentProcessorNumberEx+0x34

Update: HWiNFO debug file captured after a WHEA error. It won't let me attach it here... It says "post empty" after it finishes uploading...

"There has been an error due to your post data being empty. This could be due to a browser page refresh or direct access to this page. We recommend you press the browser back button and begin again."
 
Here is what appears to be the relevant part of the debug file, hopefully this helps.

WHEA ERR
WHEA EventID = 20
WHEA Rec (873)
      00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F   0123456789ABCDEF
------------------------------------------------------------------------
0000: 43 50 45 52 10 02 FF FF FF FF 03 00 02 00 00 00   CPERÿÿÿÿÿÿÿÿ
0010: 02 00 00 00 69 03 00 00 3A 29 06 00 0F 08 12 14   ÿÿÿiÿÿ:)ÿ
0020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
0030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
0040: BD C4 07 CF 89 B7 18 4E B3 C4 1F 73 2C B5 71 31   ½Äω·N³Äs,µq1
0050: B1 8B CE 2D D7 BD 0E 45 B9 AD 9C F4 EB D4 F8 90   ±‹Î-×½E¹­œôëÔø
0060: 27 DD 6D 05 5D 34 D4 01 00 00 00 00 45 52 00 00   'Ým]4ÔÿÿÿÿERÿÿ
0070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
0080: 58 01 00 00 49 00 00 00 01 02 00 00 01 00 00 00   XÿÿIÿÿÿÿÿÿÿÿ
0090: 14 11 BC A5 64 6F DE 4E B8 63 3E 83 ED 7C 83 B1   ¼¥doÞN¸c>ƒí|ƒ±
00A0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
00B0: 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
00C0: 00 00 00 00 00 00 00 00 A1 01 00 00 C0 00 00 00   ÿÿÿÿÿÿÿÿ¡ÿÿÀÿÿÿ
00D0: 01 02 00 00 00 00 00 00 AD CC 76 98 B4 47 DB 4B   ÿÿÿÿÿÿ­Ìv˜´GÛK
00E0: B6 5E 16 F1 93 C4 F3 DB 00 00 00 00 00 00 00 00   ¶^ñ“ÄóÛÿÿÿÿÿÿÿÿ
00F0: 00 00 00 00 00 00 00 00 02 00 00 00 00 00 00 00   ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
0100: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
0110: 61 02 00 00 08 01 00 00 01 02 00 00 00 00 00 00   aÿÿÿÿÿÿÿÿÿÿ
0120: 01 1D 1E 8A F9 42 57 45 9C 33 56 5E 5C C3 F7 E8   ŠùBWEœ3V^\Ã÷è
0130: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
0140: 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
0150: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
0160: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
0170: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
0180: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
0190: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
01A0: 00 57 01 00 00 00 00 00 00 00 02 00 00 00 00 00   ÿWÿÿÿÿÿÿÿÿÿÿÿÿ
01B0: 00 E4 06 03 00 00 00 00 00 00 00 00 00 00 00 00   ÿäÿÿÿÿÿÿÿÿÿÿÿÿ
01C0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
01D0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
01E0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
01F0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
0200: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
0210: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
0220: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
0230: 00 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00   ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
0240: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
0250: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
0260: 00 01 00 00 00 01 00 00 00 9D 34 74 10 63 34 D4   ÿÿÿÿÿÿÿ4tc4Ô
0270: 01 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
0280: 00 00 00 00 00 0D 00 00 00 C0 00 08 00 49 00 00   ÿÿÿÿÿ ÿÿÿÀÿÿIÿÿ
0290: 8C 00 A4 A5 74 00 00 00 00 8C 12 10 00 10 40 08   Œÿ¤¥tÿÿÿÿŒÿ@
02A0: 09 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00    ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
02B0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
02C0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
02D0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
02E0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
02F0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
0300: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
0310: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
0320: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
0330: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
0340: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
0350: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
0360: 00 00 00 00 00 00 00 00 00                        ÿÿÿÿÿÿÿÿÿ

IPMI Resp[0,4]:
WHEA MEM: 0 0 0
      00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F   0123456789ABCDEF
------------------------------------------------------------------------
0000: 28 C0 C0 00                                       (ÀÀÿ

WHEA CPU Gen: 0
IPMI sens[31] = 40.00
WHEA MCA: C0 8
IPMI Req [NET:04 LUN:00 CMD:2D]:
      00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F   0123456789ABCDEF
 
Such error can have several reasons and I don't think it's thermal-related. I'd rather say it's related to configuration of the memory controller, perhaps the timings.
In such case best would be to try a BIOS upgrade to latest version or contact the mainboard manufacturer.
 
Martin said:
Such error can have several reasons and I don't think it's thermal-related. I'd rather say it's related to configuration of the memory controller, perhaps the timings.
In such case best would be to try a BIOS upgrade to latest version or contact the mainboard manufacturer.

Can you take a look at the debug file? Lets see if I can upload a ZIP. I don't know why it won't let me attach this file even as a zip.

I have the latest ASUS BIOS.
 
Sorry, but such problem can't be diagnosed using such data.
I suggest to check the compatibility of memory modules with your BIOS and contact ASUS support.
 
Back
Top