WHEA Count

Ganesh_AT · Jan 26, 2016

Martin,

Can you provide more info on WHEA and what it means exactly?

I have a system [ ECS LIVA One : http://www.ecs.com.tw/ECSWebSite/Product/Product_LIVA_SPEC.aspx?DetailID=1645&LanID=0 : running Windows 10 Pro latest RTM build ] that I am monitoring via HWiNFO (latest beta), and I see that the WHEA count is keeping on increasing every second.

The system is quite stable in all our benchmark tests, but I am a bit worried about this WHEA count.

[attachment=1766]

Thanks!
Ganesh

Martin · Jan 26, 2016

Well, that seems pretty much !
WHEA counts the amount of Windows Hardware Error Architecture errors that occurred. Such errors are serious - hardware failures of various components, noticeable ones are usually correctable errors (like correctable cache ECC errors), the more serious ones are uncorrectable and cause a shutdown.
Please check in the Event Viewer under:
Application and Service Logs - Microsoft - Windows - Kernel-WHEA - Errors
what you see there and let me know.

Ganesh_AT · Jan 26, 2016

The screenshot from the Event Viewer is attached below.

[attachment=1767]

It seems to be an 'Information' level error - probably pointing to something non-fatal. I copied over one of them in the XML view. It just seems like a random string of numbers to me in the raw data. Is there some way to parse it and determine what is causing these kernel errors?

[attachment=1768]

Martin · Jan 26, 2016

I have tried to find information how to decode these errors to determine what exactly is causing this (so that HWiNFO could give more insight), but I couldn't.
I believe these are serious problems even though they are marked as "Information".
I have implemented this feature per request from NASA which is using HWiNFO to monitor machines that are being tested under high radiation fields which result in serious hardware failures (sometimes permanent damages), i.e.: http://nepp.nasa.gov/workshops/etw2...Bel_ AMD Processor Radiation Test Results.pdf
And checking their logs shows those serious failures as "Information" too.

mikinho · Jan 26, 2016

I don't have access to the attachment (new member, just signed up to try and help). Could you DM me the log?

Martin · Jan 26, 2016

mikinho said:
I don't have access to the attachment (new member, just signed up to try and help). Could you DM me the log?

Still cannot download it? You should have access.

mikinho · Jan 26, 2016

I had to sign out and back in. Working now. Thanks!

Martin · Jan 26, 2016

How can you help, are you able to decode those RawData from log?

mikinho · Jan 26, 2016

Yes, I'll do so when home later. Just from looking at it that is a warning for AMD64 northbridge machine check. Which is probably nothing. I highly recommend breaking up the counts into warnings versus errors. I can provide more info later today, on my MacBook at a coffee shop at the moment.

Martin · Jan 26, 2016

Thanks for the information so far. Looking forward to more details.
It would also be helpful if I could simulate MCA/WHEA errors for testing. I have tried multiple error injection methods, but neither of them worked. Maybe you know more about this too?

mikinho · Jan 26, 2016

So I glanced at the results earlier and I was wrong, that is an error, not a warning. 21 is a warning, 20 is the error.

The RawData is just a binary representation of WHEA_ERROR_RECORD_HEADER (https://msdn.microsoft.com/en-us/library/windows/hardware/ff560487(v=vs.85).aspx) if I remember correctly.

I wrote a parser for WHEA a few years ago, I'll dig it up.

And in terms of generating those errors, yes. You can use EventCreate.exe

I'll provide more info when home.

mikinho · Jan 27, 2016

So looks like its well documented now: https://msdn.microsoft.com/en-us/li...537(v=vs.85).aspx?f=255&MSPPError=-2147217396

In Ganesh's case its Event ID 20 which is WHEALOGR_XPF_AMD64NB_MCA_ERROR and correlates to https://msdn.microsoft.com/en-us/library/windows/hardware/ff559493(v=vs.85).aspx

I'll push my parser to GitHub if you are interested but now that MSDN has it documented now it isn't as helpful as it used to be

Ganesh_AT · Jan 27, 2016

mikinho said:
So looks like its well documented now: https://msdn.microsoft.com/en-us/li...537(v=vs.85).aspx?f=255&MSPPError=-2147217396

In Ganesh's case its Event ID 20 which is WHEALOGR_XPF_AMD64NB_MCA_ERROR and correlates to https://msdn.microsoft.com/en-us/library/windows/hardware/ff559493(v=vs.85).aspx

I'll push my parser to GitHub if you are interested but now that MSDN has it documented now it isn't as helpful as it used to be

mikinho, Thanks for the pointers. Glad to have some more information on this.

Now, I am wondering what would cause this problem - a botched update?

I am not sure if it is related, but the OS never allows the processor to go into lower clocked states - sustained 3.1 GHz and ~50% CPU loading (as shown in the task manager picture below):

[attachment=1769]

Stopping the Diagnostic Policy Service entry brings it back to normal CPU usage, but restart brings this process up again.

I am going to try reinstalling Windows on the machine and check.

mikinho · Jan 27, 2016

I've seen a bad update cause this, or a "good" update that patched something in the kernel that a bad driver used as a short cut.

Ganesh_AT · Jan 27, 2016

The Diagnostic Policy Service high CPU usage and WHEA are definitely interrelated.

I had the HWINFO window open and the task manager on the side. With the DPS service on, the WHEA count keeps on increasing. Once the service is stopped, the WHEA count freezes. I see now that the DPS service restarts automatically after some time, and along with that the WHEA resumes increasing.

Martin · Jan 27, 2016

Thanks mikinho for the details. Now it makes more sense, though parsing is not trivial. Your WHEA parser might be really useful for me.
The EventCreate tool is really simple and I don't think it can be used to simulate real WHEA errors with the full set of data.

Ganesh - I think the high load caused by the Diagnostic Policy Service is just the result of those errors, not the reason. As you have seen there are tons of errors occurring and just to receive, process and log so many errors can cause a higher load. So if you disable the DPS I think they are just not logged, but still happen. The problem must be somewhere else. I'll try to decode your dump according to the details known and see more exactly what it is.

Ganesh_AT · Jan 27, 2016

I just found out that a stock install of Windows 10 Pro x64 10586 (latest RTM) also exhibits the WHEA issue.

If I set DPS to Manual startup and restart, then, the system seems to be running OK with low CPU load / power consumption. At this point, I am not sure if it is a Microsoft problem, or there is something seriously wrong with the PC itself (it is, after all, a pre-production sample).

Martin · Jan 27, 2016

I think I made some progress in decoding the error data, so I can now report a bit more detailed stats.
Is that machine still producing them and you could run there a new test build ?

Martin · Jan 27, 2016

I should also be able to decode all details from those errors and might tell you more about it, but for that it would be much better if you could export at least a few of those errors into *.evtx format (much easier to simulate those here).

Martin · Jan 27, 2016

And BTW, the fact that those errors are marked as "Information" means they are of the correctable type. Uncorrectable ones should be marked as Error, but these obviously cannot be catched by an application during runtime, because they result in a BSOD (or reset).
For some pre-production samples, such errors can also be caused by an erratum in the CPU or BIOS that might be fixed later - either an erratum in CPU functionality, or in the MCA reporting logic.

WHEA Count

Member

Attachments

HWiNFO Author

Member

Attachments

HWiNFO Author

Member

HWiNFO Author

Member

HWiNFO Author

Member

HWiNFO Author

Member

Member

Member

Attachments

Member

Member

HWiNFO Author

Member

HWiNFO Author

HWiNFO Author

HWiNFO Author

Similar threads