RX480 Problem

Darr

Member
Hello Martin, when you have the chance, please take a look at the following system report and DBG files and tell what it happening with my MSI RX480 Armor 8gb OC card. Sometimes HHinfo64 loads correctly, sometimes in crashes the system during the ATI checks, and like now everything is loaded, but the CPU Tcl value is not showing. Also sometimes when I have my jump drive plug in (SanDisk 32MB Cruzer) then the SMART drive readings on the HDD disappear. The report file and the two DBG files are without the jump drive.

System Report File
Crash on ATI detection
Running no CPU Tcl reading

In the Running DBG, the header for the CPU Tcl temp shows, but the following line showing the temperature reading is blank.

Also should note the that when I get the GPU VRM temp and GPU VRM power readings to show, I get two sets of the same readings
 
I see that you have the "GPU I2C via ADL" option enabled, which is not recommended. So please try to disable this and see if it fixes the crashing. If not, you might need to disable the entire GPU I2C Support.
As for the other sensor discrepancies, I'm not sure if you use a custom order of sensor items. You might try to do a "Restore Original Order" or revert to "Fixed order" in sensor settings / layout to see if that helps.
 
Martin said:
I see that you have the "GPU I2C via ADL" option enabled, which is not recommended. So please try to disable this and see if it fixes the crashing. If not, you  might need to disable the entire GPU I2C Support.
As for the other sensor discrepancies, I'm not sure if you use a custom order of sensor items. You might try to do a "Restore Original Order" or revert to "Fixed order" in sensor settings / layout to see if that helps.

It is having that option checked that gives the GPU VRM readings. The attached screenshots show the settings used to get everything and the results when it works correctly. The GPU apparently is very sensitive to polling. Once enough failed starts happen and the program learns all the sensors, it will usually startup ok, until the cache is reset or a new version is put in place.


P.S. I really don't think there is anything wrong with the program, and the the postings were just to make sure. I think the problem lies with the GPU in that I think I have fried one of the VRMs, because after April 19th, BOINC OpenCL and tasks started reporting the GPU's global memory at 7536MB, even though AMD Radeon Settings was showing 8192MB. However, after a re-install of the drivers yesterday, now even AMD Radeon Setting is only reporting 7536MB of memory.


[attachment=2324][attachment=2325][attachment=2326]
 

Attachments

  • HWinfo64_SMB.png
    HWinfo64_SMB.png
    35.1 KB · Views: 4
  • HWinfo64_Safety.png
    HWinfo64_Safety.png
    39.9 KB · Views: 4
  • HWinfo64_All _Readings.png
    HWinfo64_All _Readings.png
    213 KB · Views: 4
I don't think that frying a VRM would have such an effect. It would rather make the GPU unusable.
Are you sure you don't get the VRM data with disabled "GPU I2C via ADL" option ? From my experience that should work much better than the ADL method. Just make sure to reset the GPU I2C cache after the switch..
As for the mixed sensor values, that should be resolved by one of the methods I described above, i.e. "Restore Original Order".

BTW, which project do you run on BOINC ? My GPUs are currently fully dedicated to Einstein@Home :)
 
Martin said:
I don't think that frying a VRM would have such an effect. It would rather make the GPU unusable.
Are you sure you don't get the VRM data with disabled "GPU I2C via ADL" option ? From my experience that should work much better than the ADL method. Just make sure to reset the GPU I2C cache after the switch..
As for the mixed sensor values, that should be resolved by one of the methods I described above, i.e. "Restore Original Order".

BTW, which project do you run on BOINC ? My GPUs are currently fully dedicated to Einstein@Home :)

Restart HWinfo64 and unselect ADL and reset cache: https://1drv.ms/i/s!ArIvftV8roEagSTVQp4aT-IttBc-

Produces the following result:https://1drv.ms/i/s!ArIvftV8roEagSXArnvCEorlnJeh One set of VRMs, but missing half of the GPU readings

Re-selecting ADL gets: https://1drv.ms/i/s!ArIvftV8roEagSZiO2dsw2QI6_7v No VRMs, but full set of GPU readings

Reset cache and ensuring Afterburner is running gets us back to normal:
s!ArIvftV8roEagSdLCLF_hwok9Gnp


In between all the above were system restarts due to crashes when polling ATI.

P.S. Here is a screenshot of my Boinc Projects: https://1drv.ms/i/s!ArIvftV8roEagSiim7J2OE92brQn Currently have 7 projects with GPU apps.
 
Thanks. Those crashes happened with "GPU I2C via ADL" enabled, disabled or both ?
Try to do a "Restore Original Order" and/or enable "Fixed order" in sensor settings/layout. I think that will also resolve the problem with reduced GPU info when "GPU I2C via ADL" is enabled.

You have a huge set of BOINC projects :) I was also running Milkyway in the past, made it to #33 world-wide, now Einstein at #41 :)
 
Martin said:
Thanks. Those crashes happened with "GPU I2C via ADL" enabled, disabled or both ?
Try to do a "Restore Original Order" and/or enable "Fixed order" in sensor settings/layout. I think that will also resolve the problem with reduced GPU info when "GPU I2C via ADL" is enabled.

You have a huge set of BOINC projects :) I was also running Milkyway in the past, made it to #33 world-wide, now Einstein at #41 :)
They happen with both. Right now as long as I have Afterburner running before starting HWinfo64, HWinfo64 comes up ok with all sensors operating and two sets of the VRM readings.. I guess the thing to bear in mind that this a nine year old motherboard, cpu, memory, hard-drive, and power-supply trying to drive a new GPU. What  I probably should do, is go to the storage unit and dig thru a box and get my Windows 7 Disk and do fresh install, but re-installing all of programs and development environments would be a pain . What I was hoping from supplying the DBG files was that they might give an indication for a quick fix for inclusion in the next build, but sounds like it would be more involved, and with a system this old, who knows what may be inter-fearing. Since getting the new GPU, I have fought with graphic driver resets, HWinfo, crashes, etc. It has been a lot  of fun playing with the GPU, I wish I could show you how the screen would jump when a MooWrapper task would start and stop (I have since found the right setting for it to start smoothly), I think a may even have a screenshot somewhere of HWinfo showing a Moo task having pulled a 160W and 200 amps from the GPU.

sig.png
 
Did a restore original order on the layout screen, the shutdown Boinc, HWinfo64, and Afterburner.

Restarted HWinfo64 with reorder and no Afterburner, no ADL, no crash: [attachment=2327] Full GPU, no VRM readings
Restarted HWinfo64 with reorder and no Afterburner, with ADL, no crash: [attachment=2328] Full GPU, two sets of VRM readings

Restarted HWinfo64 with reorder, cache reset, no Afterburner, no ADL, no crash [attachment=2329] Partial GPU and VRM
Restarted HWinfo64 with reorder, cache reset, no Afterburner, with ADL, no crash [attachment=2330] Full GPU and two sets of VRM readings

Inserted SansDisk Cruzier Glide jump drive and restarted HWinfo64, experienced a set of crashes while polling ATI until I started Afterburner before starting HWinfo64

Restarted HWinfo64 with reorder, cache reset, with Afterburner, with ADL, with jump disk,  no crash: [attachment=2331] No SMART readings on HDD
Restarted HWinfo64 with reorder, cache reset, with Afterburner, with ADL, no jump disk, no crash: [No picture, too many attachments] SMART readings restored

Very strange that after four successful restarts, that just inserting the jump drive into the system, would cause HWinfo64 to crash when polling the ATI, even more strange is that it then starts with Afterburner running first. Does Afterburner shield  something from being polled?

I have the crash DBGs and some successful start DBGs in a backup folder for HWinfo if you would like to see them.
 

Attachments

  • HWinfo64_ReOrdered_No_ADL.png
    HWinfo64_ReOrdered_No_ADL.png
    67.1 KB · Views: 3
  • HWinfo64_ReOrdered_With_ADL.png
    HWinfo64_ReOrdered_With_ADL.png
    82.7 KB · Views: 2
  • HWinfo64_ReOrdered_CR_No_ADL.png
    HWinfo64_ReOrdered_CR_No_ADL.png
    68.7 KB · Views: 1
  • HWinfo64_ReOrdered_CR_ADL.png
    HWinfo64_ReOrdered_CR_ADL.png
    83.9 KB · Views: 1
  • HWinfo64_ReOrdered_CR_ADL_AB_JD.png
    HWinfo64_ReOrdered_CR_ADL_AB_JD.png
    78.5 KB · Views: 3
Yes, please give me the DBG files for analysis.
Have you also enabled the "Fixed order" option in sensor settings / layout ?
 
Martin said:
Yes, please give me the DBG files for analysis.
Have you also enabled the "Fixed order" option in sensor settings / layout ?

Uploaded the dbg+reports backup folder to one-drive folder HWinfo64DBGs and here is the link to that folder:

https://1drv.ms/f/s!ArIvftV8roEagS8XYrdKjdUlaTr9

Hopefully this will give you access to all of them. The crashes mentioned above are HWinfo64 (31).DBG, (32), and (33)

For the tests above, I did not have fixed order checked in the layouts. Will retry the jump drive test with the fix enable tomorrow after some sleep. The system is running smoothly at the moment and I would like Boinc to run for awhile. (Hopefully the thunderstorms rolling thru the area tonight won't upset anything).

The crashes in the folder go back to May 6th, when I started saving them.  The  file HWinfo64.DBG should actually be HWinfo64 (15).DBG, but I goofed in backing it up and it overwrote the original from May 6th.
 
Another question. The program OCCT used for testing over-clocks, reports individual core temps for my Phenom II, but I have never seen any indication of this in HWinfo. Is this normal? Or is it because I don't have something checked in the settings?
 
Thanks, I analyzed the DBG files and all crashes happened while accessing the VRM. Also the last dumps were all with I2C via ADL.
 
Martin said:
Thanks, I analyzed the DBG files and all crashes happened while accessing the VRM. Also the last dumps were all with I2C via ADL.

So this brings the questions:

Is my VRM bad or just very sensitive to polling?
Can you find away to access VRM without ADL?
Why does Afterburner APPEAR to have a shielding effect at times?
Is this just a problem with my RX480 from MSI, or is it the RX480 platform as a whole?
Do you know of any VRM memory testing programs that can check the VRM memory like Prime95 or Memtest86 do for desktop memory? I have found one called memtestCL, but then this only would check the 3072MB of OpenCL memory, need something that checks the entire 8192MB of memory on the card.
 
I don't think your VRM is bad. I have seen cases when polling chips on GPU I2C (like VRMs) can cause such problems, but so far I haven't seen such problem on the RX 480.
HWiNFO does access the VRMs without ADL when you disable the "GPU I2C via ADL" option.
Sorry, but I have no idea why running Afterburner makes a difference. I have no insight into its internals.
HWiNFO (as the only tool) has a sensor counter called "GPU Memory Errors" which reports all ECC errors in GPU memory. Note, that this counter works only if the GPU is under load.
 
Hi Martin, boy do I feel like a fool, and I apologize for wasting your time. Restoring order, and CHECKING Fixed order appears to have done it.

Restored order, Fixed order, No ADL, No Jump Drive: [attachment=2332] No crash!!! Full GPU readings, one set of VRM readings.

Restored order, Fixed order, No ADL, With Jump Drive: [attachment=2333] No crash!!! Full GPU and VRM, JD readings, HDD readings, and SMART readings.

Please, please forgive a feeble, old fool. I think what happened is that when I got the MSI and started playing with the HWinfo settings that in order to get a custom layout I must have unchecked Fixed Order without realizing what a powerful setting it is and what not having it checked would do.

Again let me ask for your forgiveness.

P.S. Still have the question about the individual core temps on the Phenom II X6 1100T Black and I still have to figure out why Radeon Setting, GPU_CAPs_Viewer, GPU_Z all report the card as having 8192MB of memory (which BOINC also did until April 19th) but BOINC now says that the card only has 7536MB of memory.
 

Attachments

  • HWinfo64_ReOrdered_FO_No_ADL_No_AB_No_JD.png
    HWinfo64_ReOrdered_FO_No_ADL_No_AB_No_JD.png
    83.8 KB · Views: 1
  • HWinfo64_ReOrdered_FO_No_ADL_No_AB_JD.png
    HWinfo64_ReOrdered_FO_No_ADL_No_AB_JD.png
    77.6 KB · Views: 1
Martin said:
I don't think your VRM is bad. I have seen cases when polling chips on GPU I2C (like VRMs) can cause such problems, but so far I haven't seen such problem on the RX 480.
HWiNFO does access the VRMs without ADL when you disable the "GPU I2C via ADL" option.
Sorry, but I have no idea why running Afterburner makes a difference. I have no insight into its internals.
HWiNFO (as the only tool) has a sensor counter called "GPU Memory Errors" which reports all ECC errors in GPU memory. Note, that this counter works only if the GPU is under load.
Yes reading about the memory errors on your forum I understand that there is no way to distinguish between an actual error and a request for re-transmission of data and like you believe it should be zero, especially when running Boinc. I have seen it go for hours with no errors, and then start to accumulate 1 or 2 at a time. I suspect that the problem is that my 1TB HDD is nearly 75% full leading to slow response times in getting data from the HDD thru the CPU to the GPU. That and the fact that the card is design for PCIE 3.0 slot but I only have a PCIE 2.0 slot to run it.
 
No problem, I'm glad that this is resolved now :) Custom sensor order can sometimes cause troubles to layout, especially when additional sensor items appear during runtime.
Sorry, but I don't know what could be causing the misreporting of GPU memory. Perhaps try to ask on AMD forums.
 
Martin said:
No problem, I'm glad that this is resolved now :) Custom sensor order can sometimes cause troubles to layout, especially when additional sensor items appear during runtime.
Sorry, but I don't know what could be causing the misreporting of GPU memory. Perhaps try to ask on AMD forums.

I'll figure that one out eventually. Any thoughts about the individual core temps on a Phenom II?
 
AMD CPUs cannot measure per-core temps. There are some tools reporting per-core values, but this is a fake - they just copy a single temperature value across all cores.
 
Back
Top