NVIDIA PCI Express Error Counters

In the release notes, it says:
  • Added monitoring of NVIDIA PCI Express Error Counters.
but I don't see this anywhere. Where is the option to monitor this?
 
Please attach the HWiNFO Debug File for analysis. It's quite possible that 50 series work different.
 
Looks like the 50 series require adjustment here. Will need to look into that...
 
Could you please explane me what does it mean "Recovery Count"? And why the value of it rises a little by little?
 
Could you please explane me what does it mean "Recovery Count"? And why the value of it rises a little by little?
 
Doing some testing with my 4090, it seems that Recovery Count is tied to power management as it increases every time power states increase or decrease, probably due to the PCI-E link needing to be reset as it changes link speeds. It also seems that Bad TLP Count and NAKs Sent are tied together as they always match, but I haven't been able to see any events that match. Hopefully more documentations will be forth coming.
 
Sorry, I posted this in the other thread, so copying here too:
The "Recovery Count" counts the number of changes from L0 to Recovery. It triggers for example during a change in speed, width, or other possible reasons that usually don't mean a PCIe error occured.
The other counters however might indicate a problem on PCIe interface.
 
1751841620353.png
Just throwing some data on my 5090 out there...this is after 8 hours of use. Recovery Count = 5,257 and NAKs Sent Count = 8. I'm assuming this is all normal because I haven't encountered any issues...
 
View attachment 12860
Just throwing some data on my 5090 out there...this is after 8 hours of use. Recovery Count = 5,257 and NAKs Sent Count = 8. I'm assuming this is all normal because I haven't encountered any issues...
Since we have the same GPU and you also get NAKs sent, I'm hoping to get some insight and help us both to say if it's normal:
- Do you use PCIe 5?
- Do you use DirectX11 or 12?

Do these settings help you to get rid of the NAKs sent?
- Setting ASPM off in BIOS
- Setting PCIe link state power management off in Windows power plan settings
- Setting nvidia power mode to "prefer maximum performance"

Could you reproduce the errors by setting nvidia to power mode "normal" and playing any game in DirectX12, during loading screens or main menus and startup?

It would be really helpful if you can share your experience with this, thanks!
 
Since we have the same GPU and you also get NAKs sent, I'm hoping to get some insight and help us both to say if it's normal:
- Do you use PCIe 5?
- Do you use DirectX11 or 12?

Do these settings help you to get rid of the NAKs sent?
- Setting ASPM off in BIOS
- Setting PCIe link state power management off in Windows power plan settings
- Setting nvidia power mode to "prefer maximum performance"

Could you reproduce the errors by setting nvidia to power mode "normal" and playing any game in DirectX12, during loading screens or main menus and startup?

It would be really helpful if you can share your experience with this, thanks!
It seems that NAKs Sent is also related to the memory clock changing state. When the memory clock is maxed at 1750mhz, both recovery count and naks sent don't increase. The moment the memory clock is not maxed, recovery count starts going up and naks sent increases every once in a while as well. Not sure if that's anything to really be concerned about, but there are some other bugs with the 50 series related to the memory clock being too low (gray flickering on the desktop for example in some configurations).
 
It seems that NAKs Sent is also related to the memory clock changing state. When the memory clock is maxed at 1750mhz, both recovery count and naks sent don't increase. The moment the memory clock is not maxed, recovery count starts going up and naks sent increases every once in a while as well. Not sure if that's anything to really be concerned about, but there are some other bugs with the 50 series related to the memory clock being too low (gray flickering on the desktop for example in some configurations).
That's something I need to monitor, thanks. But it makes sense, since the clock speed also reduces if the link speed is reduced. Maxing out prevents the GPU to power down and thus avoids getting NAKs. It's interesting though, that you are getting NAKs every once in a while when the clock speed is not maxed. 8 NAKs in 8 hours does seem pretty low to me, though. I usually get much more when I have problems – usually 20-60 NAKs during one loading screen and about 40 more recovery counter; only in DirectX12 though. In DirectX11, I only get recovery counter during loading screens.
So you do get NAKs as well if the link speed is fixed at 32Gt/s?

Doing some testing with my 4090, it seems that Recovery Count is tied to power management as it increases every time power states increase or decrease, probably due to the PCI-E link needing to be reset as it changes link speeds. It also seems that Bad TLP Count and NAKs Sent are tied together as they always match, but I haven't been able to see any events that match. Hopefully more documentations will be forth coming.

They only seem to match on the 4090 - on my 5090, I usually get a lot higher NAKs count and only sometimes Bad TLP (if the NAKs occurred very fast, mostly). Also sometimes replay counter, if there were many recovery counters in a short time as well.
 
Last edited:
Back
Top