GPU or PCI bad slot - monitoring and report

AL13N

New Member
My 2nd 2080ti is going dead during gaming, all LED's go dark on the card and the NVLink and the whole system freezes up and I have to reboot. This only happens in SLI games, Rendering Benchmark files with Blender or using TimeSpy or Heavens Benchmark are all fine, no issues. I'm talking to EVGA on an RMA with the card but they want me to disassemble the whole system, reinstall the Air Cooling on both cards, swap the cards in the slots to determine a card or slot issue. They even say it could be a bad NVLink bridge. As you can see from the pic, that would be a tedious job. I have a Log file from HWInfo but it's full of millions of numbers in Note Pad, that to me don't make any sense. So what I'm asking is, is there a way to read this file to determine the problem? Maybe I need the SDK version which I am willing to get. Any help would be greatly appreciated to keep me from tearing the whole thing apart.

[attachment=3511]
Thx,
Dave

Case:               ThermalTake The Tower 900
MOBO:             ASUS x399 Prime
Processor:       ThreadRipper 1950X
Cooling:           Phanteks Glacier Series PH-C399A_CR01 GLACIER C399A  - Mirror Chrome
GPU:                EVGA GeForce RTX 2080 Ti BLACK EDITION GAMING (x2)
Cooling Block:  EVGA Hydro Copper Waterblock (x2)
Cooling Tower: ThermalTake Pacific PR22-D5 (x2)
Radiator:          Pacific CL360 (x2)
Fans:               Thermaltake Pure Plus 12 RGB TT Premium Edition (x6)
LED:                 Pacific Lumi Plus 3 Pack
Drive:               SAMSUNG 970 EVO M.2 2280 1TB
Memory:           Corsair Platinum 32GB (4x8 GB)
PSU:                 EVGA SuperNOVA 1200 P2
Monitor:            Acer Predator Gaming X34
OS:                  Windows 10Professional (64-bit)
 

Attachments

Martin

HWiNFO Author
Staff member
You can use the GenericLogViewer to analyze the sensor log in a more convenient way.
But there's no guarantee that the problem can be determined based on sensor values. It might be software (driver) related as well.
 

AL13N

New Member
Thx for the reply,

Anything I should look for when opening the file on the graphs that is typical of causing crash/dead card?
 

Martin

HWiNFO Author
Staff member
Overheating would be the most common issue, but as I said there can be plenty of other reasons that cannot be captured by a sensor.
 

TomWoB

Well-Known Member
Hi AL13N,

first of all: WOW, what a system !

I had a quick look at your logs and see the following (on my screenshot):

[attachment=3529]

  • Red values:    Data 1
  • Green values: Data 2
So I assume "GPU ... Data 2" is your second 2080Ti, the diagrams show
  • Row 1: GPU 2 has "all time" a higher voltage
  • Row 2: GPU 2 consumes in the average 40W more than GPU 1
  • Row 2: GPU 2 has a max of 254 W (!), GPU 1 only 188W
  • Row 3: but temperature of GPU 1 is higher than GPU 2 (???)
  • Row 4: Data 2 runs often into "Performance Limit - Reliability Voltage"
Conclusion:
1) Maybe temperature sensor of GPU 2 is failing (?)
    - How can it be, that GPU 2 consumes 40W more with lower temperature ?
    - Have both GPUs same (symmetric) cooling ?
2) GPU is running at higher voltage -> instable ?
3) Data 2 has a lot of "Performance Limit - Reliability Voltage" -> instable ?

Hope this helps a little bit ... Regards
Tom
 

Attachments

Top