WHEA Count

Ganesh_AT

Member
Martin,

Can you provide more info on WHEA and what it means exactly?

I have a system [ ECS LIVA One : http://www.ecs.com.tw/ECSWebSite/Product/Product_LIVA_SPEC.aspx?DetailID=1645&LanID=0 : running Windows 10 Pro latest RTM build ] that I am monitoring via HWiNFO (latest beta), and I see that the WHEA count is keeping on increasing every second.

The system is quite stable in all our benchmark tests, but I am a bit worried about this WHEA count.

[attachment=1766]

Thanks!
Ganesh
 

Attachments

  • whea.png
    whea.png
    35.9 KB · Views: 129
Well, that seems pretty much !
WHEA counts the amount of Windows Hardware Error Architecture errors that occurred. Such errors are serious - hardware failures of various components, noticeable ones are usually correctable errors (like correctable cache ECC errors), the more serious ones are uncorrectable and cause a shutdown.
Please check in the Event Viewer under:
Application and Service Logs - Microsoft - Windows - Kernel-WHEA - Errors
what you see there and let me know.
 
The screenshot from the Event Viewer is attached below.

[attachment=1767]

It seems to be an 'Information' level error - probably pointing to something non-fatal. I copied over one of them in the XML view. It just seems like a random string of numbers to me in the raw data. Is there some way to parse it and determine what is causing these kernel errors?

[attachment=1768]
 

Attachments

  • ecs-liva-one-whea.png
    ecs-liva-one-whea.png
    241.2 KB · Views: 98
  • whea-report.txt
    whea-report.txt
    2.4 KB · Views: 36
I have tried to find information how to decode these errors to determine what exactly is causing this (so that HWiNFO could give more insight), but I couldn't.
I believe these are serious problems even though they are marked as "Information".
I have implemented this feature per request from NASA which is using HWiNFO to monitor machines that are being tested under high radiation fields which result in serious hardware failures (sometimes permanent damages), i.e.: http://nepp.nasa.gov/workshops/etw2...Bel_ AMD Processor Radiation Test Results.pdf
And checking their logs shows those serious failures as "Information" too.
 
  • Like
Reactions: zzm
I don't have access to the attachment (new member, just signed up to try and help). Could you DM me the log?
 
mikinho said:
I don't have access to the attachment (new member, just signed up to try and help). Could you DM me the log?

Still cannot download it? You should have access.
 
Yes, I'll do so when home later. Just from looking at it that is a warning for AMD64 northbridge machine check. Which is probably nothing. I highly recommend breaking up the counts into warnings versus errors. I can provide more info later today, on my MacBook at a coffee shop at the moment.
 
  • Like
Reactions: zzm
Thanks for the information so far. Looking forward to more details.
It would also be helpful if I could simulate MCA/WHEA errors for testing. I have tried multiple error injection methods, but neither of them worked. Maybe you know more about this too?
 
So I glanced at the results earlier and I was wrong, that is an error, not a warning. 21 is a warning, 20 is the error.

The RawData is just a binary representation of WHEA_ERROR_RECORD_HEADER (https://msdn.microsoft.com/en-us/library/windows/hardware/ff560487(v=vs.85).aspx) if I remember correctly.

I wrote a parser for WHEA a few years ago, I'll dig it up.

And in terms of generating those errors, yes. You can use EventCreate.exe

I'll provide more info when home.
 
mikinho said:
So looks like its well documented now: https://msdn.microsoft.com/en-us/li...537(v=vs.85).aspx?f=255&MSPPError=-2147217396

In Ganesh's case its Event ID 20 which is WHEALOGR_XPF_AMD64NB_MCA_ERROR and correlates to https://msdn.microsoft.com/en-us/library/windows/hardware/ff559493(v=vs.85).aspx

I'll push my parser to GitHub if you are interested but now that MSDN has it documented now it isn't as helpful as it used to be :)

mikinho, Thanks for the pointers. Glad to have some more information on this.

Now, I am wondering what would cause this problem - a botched update?

I am not sure if it is related, but the OS never allows the processor to go into lower clocked states - sustained 3.1 GHz and ~50% CPU loading (as shown in the task manager picture below):

[attachment=1769]

Stopping the Diagnostic Policy Service entry brings it back to normal CPU usage, but restart brings this process up again.

I am going to try reinstalling Windows on the machine and check.
 

Attachments

  • dps-cpuload.png
    dps-cpuload.png
    52.4 KB · Views: 34
I've seen a bad update cause this, or a "good" update that patched something in the kernel that a bad driver used as a short cut.
 
The Diagnostic Policy Service high CPU usage and WHEA are definitely interrelated.

I had the HWINFO window open and the task manager on the side. With the DPS service on, the WHEA count keeps on increasing. Once the service is stopped, the WHEA count freezes. I see now that the DPS service restarts automatically after some time, and along with that the WHEA resumes increasing.
 
Thanks mikinho for the details. Now it makes more sense, though parsing is not trivial. Your WHEA parser might be really useful for me.
The EventCreate tool is really simple and I don't think it can be used to simulate real WHEA errors with the full set of data.

Ganesh - I think the high load caused by the Diagnostic Policy Service is just the result of those errors, not the reason. As you have seen there are tons of errors occurring and just to receive, process and log so many errors can cause a higher load. So if you disable the DPS I think they are just not logged, but still happen. The problem must be somewhere else. I'll try to decode your dump according to the details known and see more exactly what it is.
 
I just found out that a stock install of Windows 10 Pro x64 10586 (latest RTM) also exhibits the WHEA issue.

If I set DPS to Manual startup and restart, then, the system seems to be running OK with low CPU load / power consumption. At this point, I am not sure if it is a Microsoft problem, or there is something seriously wrong with the PC itself (it is, after all, a pre-production sample).
 
I think I made some progress in decoding the error data, so I can now report a bit more detailed stats.
Is that machine still producing them and you could run there a new test build ?
 
I should also be able to decode all details from those errors and might tell you more about it, but for that it would be much better if you could export at least a few of those errors into *.evtx format (much easier to simulate those here).
 
And BTW, the fact that those errors are marked as "Information" means they are of the correctable type. Uncorrectable ones should be marked as Error, but these obviously cannot be catched by an application during runtime, because they result in a BSOD (or reset).
For some pre-production samples, such errors can also be caused by an erratum in the CPU or BIOS that might be fixed later - either an erratum in CPU functionality, or in the MCA reporting logic.
 
Back
Top