GPU or PCI bad slot - monitoring and report

AL13N

Member
My 2nd 2080ti is going dead during gaming, all LED's go dark on the card and the NVLink and the whole system freezes up and I have to reboot. This only happens in SLI games, Rendering Benchmark files with Blender or using TimeSpy or Heavens Benchmark are all fine, no issues. I'm talking to EVGA on an RMA with the card but they want me to disassemble the whole system, reinstall the Air Cooling on both cards, swap the cards in the slots to determine a card or slot issue. They even say it could be a bad NVLink bridge. As you can see from the pic, that would be a tedious job. I have a Log file from HWInfo but it's full of millions of numbers in Note Pad, that to me don't make any sense. So what I'm asking is, is there a way to read this file to determine the problem? Maybe I need the SDK version which I am willing to get. Any help would be greatly appreciated to keep me from tearing the whole thing apart.

[attachment=3511]
Thx,
Dave

Case:               ThermalTake The Tower 900
MOBO:             ASUS x399 Prime
Processor:       ThreadRipper 1950X
Cooling:           Phanteks Glacier Series PH-C399A_CR01 GLACIER C399A  - Mirror Chrome
GPU:                EVGA GeForce RTX 2080 Ti BLACK EDITION GAMING (x2)
Cooling Block:  EVGA Hydro Copper Waterblock (x2)
Cooling Tower: ThermalTake Pacific PR22-D5 (x2)
Radiator:          Pacific CL360 (x2)
Fans:               Thermaltake Pure Plus 12 RGB TT Premium Edition (x6)
LED:                 Pacific Lumi Plus 3 Pack
Drive:               SAMSUNG 970 EVO M.2 2280 1TB
Memory:           Corsair Platinum 32GB (4x8 GB)
PSU:                 EVGA SuperNOVA 1200 P2
Monitor:            Acer Predator Gaming X34
OS:                  Windows 10Professional (64-bit)
 

Attachments

  • Alien.jpg
    Alien.jpg
    162.8 KB · Views: 6
  • HWM_Log.CSV
    561.8 KB · Views: 1
You can use the GenericLogViewer to analyze the sensor log in a more convenient way.
But there's no guarantee that the problem can be determined based on sensor values. It might be software (driver) related as well.
 
Thx for the reply,

Anything I should look for when opening the file on the graphs that is typical of causing crash/dead card?
 
Overheating would be the most common issue, but as I said there can be plenty of other reasons that cannot be captured by a sensor.
 
Hi AL13N,

first of all: WOW, what a system !

I had a quick look at your logs and see the following (on my screenshot):

[attachment=3529]

  • Red values:    Data 1
  • Green values: Data 2
So I assume "GPU ... Data 2" is your second 2080Ti, the diagrams show
  • Row 1: GPU 2 has "all time" a higher voltage
  • Row 2: GPU 2 consumes in the average 40W more than GPU 1
  • Row 2: GPU 2 has a max of 254 W (!), GPU 1 only 188W
  • Row 3: but temperature of GPU 1 is higher than GPU 2 (???)
  • Row 4: Data 2 runs often into "Performance Limit - Reliability Voltage"
Conclusion:
1) Maybe temperature sensor of GPU 2 is failing (?)
    - How can it be, that GPU 2 consumes 40W more with lower temperature ?
    - Have both GPUs same (symmetric) cooling ?
2) GPU is running at higher voltage -> instable ?
3) Data 2 has a lot of "Performance Limit - Reliability Voltage" -> instable ?

Hope this helps a little bit ... Regards
Tom
 

Attachments

  • Screenshot.jpg
    Screenshot.jpg
    486.3 KB · Views: 7
Back
Top