Possible incorrect PROCHOT readout

Vaporizer

Member
Hi again, I'm noticing some strange PROCHOT behaviour on a Supermicro system with AMD Genoa (same one as in this topic). The sensor readings "Thermal Throttling (HTC) and Thermal Throttling (PROCHOT CPU)" appear to be incorrect. At around 45°C core temperature the value changes from No to Yes, but no actual throttling occurs.
To verify the value I booted to Ubuntu and tried the e_smi_tool from AMDs EPYC System Management Interface Library (https://github.com/amd/esmi_ib_library) which has a "showprochotstatus" option. Using this tool the prochotstatus is always "inactive", even if the core temperature is far above or below 45 degrees.

1684943618972.png

The specs of the system are as follows, and attached is a reportfile and debug log.
HWiNFO64 7.46
Motherboard: Supermicro Super H13SSW
CPU: AMD EPYC 9354P
OS: Windows Server 2022

If any more info is required or if I'm interpreting this wrong, just let me know.
 

Attachments

  • WIN-JBIXVWL1ACB.zip
    852.3 KB · Views: 2
This throttling is most likely not triggered by the CPU but an external component like the BMC or SIO.
 
I'm also getting this reading with TDie at 120C under load and each CCD at 70C or less. FCLK and MC are both at 1200MHz when ram is 2400MHz (4800). Are my FCLK and MC throttling or is this normal for Genoa? Epyc 9654 + SuperMicro H13SSL
 
I'm also getting this reading with TDie at 120C under load and each CCD at 70C or less. FCLK and MC are both at 1200MHz when ram is 2400MHz (4800). Are my FCLK and MC throttling or is this normal for Genoa? Epyc 9654 + SuperMicro H13SSL

This might rather be this problem:
 
This might rather be this problem:
My issue is a constant temperature of 120C-130C, not spikes. I was able to determine the cpu is not throttling. I locked FCLK to 1800MHz and CPU changed from 3GHz to 2.8GHz and performance for my workload dropped, so it was automatically dropping FCLK for more performance for the power envelope (400W). It's been under 100% load for a week and running flawless. It appears to be only a reporting issue. IPMI is not reporting any overheating, only tdie/tctl.
 

Attachments

  • 2023-12-16 (1).png
    2023-12-16 (1).png
    385.7 KB · Views: 6
It's also interesting that Linus Tech Tips encountered the same issue in his review of the 128-core Genoa.
 
Well, HWiNFO shows what the CPU reports and I have to admit that I don't know why this is so. AMD has to answer this.
 
Well, HWiNFO shows what the CPU reports and I have to admit that I don't know why this is so. AMD has to answer this.
A guy on another forum stumbled upon a possible reason for this. He upgraded his H13SSL bios to 1.6 and now he has 100C+ tctl/tdie. but it's only occurring with windows. 1.6 + Linux is not giving the erroneous reading. Would explain why AMD is clueless too.
 
 
Back
Top