Missing DIMM temp data

duanet

Member
Hi:

We have two identical systems, but one shows the DRAM DIMM temp data, but the other one does not. I'm attaching the debug and log files from the system that's not reporting the temp data.

I read in other posts that a driver might be blocking the SMBbus data. Isn't the DIMM temp data coming from the SPD chip via I2C?

Thanks,
Duane
 

Attachments

  • HWiNFO64.DBG
    1.4 MB · Views: 6
  • SEGPKG-LAYOUT-2.HTM
    44.5 KB · Views: 2
SMBus is a layer above I2C and DIMM temperature comes from a dedicated TSOD sensor on the module (if present). Do both machines contain the same memory modules?
The attached Debug File however doesn't contain sensor data. Please make sure to open sensors too before closing HWiNFO and then attach the new Debug File.
 
Thanks for getting back to me so quickly. Yes, we have the same modules in both systems.

I'm attaching the new debug file with the sensor data.
 

Attachments

  • HWiNFO64.DBG.7z
    84.4 KB · Views: 0
Thanks. I think I know where the problem is, but need to confirm my assumption. Can you please also attach a similar Debug File for the machine where DIMM temperature is shown?
 
OK. Here's the Debug File from the system that reports the DIMM temperatures.

It's interesting that I see the opposite behavior in IPMI. There is no DIMM info in the "good" system, but there is DIMM info in the "bad" system.
 

Attachments

  • HWiNFO64.DBG.7z
    51.5 KB · Views: 0
OK, so the situation is following. On such systems (Skylake Server) the CPUs have memory module EEPROM (SPD) and DIMM thermal sensor (TSOD) connected straight to the CPU via a dedicated SMBus.
This allows them to support continuous monitoring of memory module temperatures, which is called CLTT (Closed Loop Thermal Throttling).
When this mode is activated in the BIOS, the CPU/PCU takes ownership of the SMBus and periodically queries DIMM temperatures. This is a nice feature for power management, but the down side of this is that any other application cannot access the SMBus to retrieve SPD or TSOD data. This is what you see on the first machine and you will also notice that memory module information is missing there.
So I believe the difference between both machines is a BIOS setting called CLTT (or something similar related to memory thermal management).
Fortunately this is not a showstopper. Even with CLTT activated, you should be able to see memory module temperatures in HWiNFO shown as "Memory Controller X Channel N Rank Max" - this is the same value as read from TSOD.
 
That's amazing, Martin! Thank you very much!

You're right about the temps showing up in "Memory Controller X Channel N Rank Max". Why does HWiNFO64 report differently? Is it because CLTT takes over SMBus? I would think the TSOD data would be available in either case. Also, there's only 2 channels/3 ranks reported for each CPU.
 
Yes, CLTT takes over SMBus and in that case HWiNFO cannot access the SPD or TSOD anymore.
 
Because that's an internal CPU register where the CPU/PCU stores the data read via CLTT.
 
Hi Martin:

What’s different between the two systems? Why can one read the SMB and the other can’t? Different CPU revisions or motherboard revisions?

Thanks
Duane
 
As I wrote earlier, I believe the difference is a BIOS setting. Not sure how it's called there, might be something like DIMM Thermal Management, CLTT...
 
Sorry, I forgot to reply fully to that earlier. Both systems have the same BIOS versions. (Please see attached.) But there is no CLTT control in the "good system", and the CLTT controls are there in the "no DIMM temp" system. It's just the opposite of what I expected.
 

Attachments

  • good_sys_bios.png
    good_sys_bios.png
    12.1 KB · Views: 10
  • good_sys.png
    good_sys.png
    74.4 KB · Views: 10
  • no_dimm_temp_bios.png
    no_dimm_temp_bios.png
    12.4 KB · Views: 10
  • no_dimm_temp.png
    no_dimm_temp.png
    174.5 KB · Views: 10
The setting is most likely called differently.. But I don't know how it's called there.
Enter the BIOS menu locally and check all settings related to memory thermal management and look for differences.
 
Checked the IPMI but the BIOS are the same rev. The no temp mobo is a later rev Maybe?

can you recommend any texts? This stuff is really cool
 
I have the same problem on Supermicro H11SSL-i + AMD Epyc 7502P + Micron 18ASF1G72PDZ-2G6E1.
SPD cannot be read (while DIMM temperatures are visible through IPMI), and unfortunately there is no such settings like "Closed Loop Thermal Throttling" in BIOS :(
Any other solutions?

1.png
 
AMD boards are different, they don't support CLTT/OLTT, so here it's a different issue.
This is a Supermicro feature, these boards usually contain SMBus mux chips, which require special board-specific functions to unlock access to SPD. Unfortunately I don't have the required information for this board :(
Would you be able to have a look at the mainboard and see if there's a PCA9545 or PCA9545A chip?
A HWiNFO Debug File with sensor data would also be useful to check if this can be done.
 
Last edited:
AMD boards are different, they don't support CLTT/OLTT, so here it's a different issue.
This is a Supermicro feature, these boards usually contain SMBus mux chips, which require special board-specific functions to unlock access to SPD. Unfortunately I don't have the required information for this board :(
Would you be able to have a look at the mainboard and see if there's a PCA9545 or PCA9545A chip?
A HWiNFO Debug File with sensor data would also be useful to check if this can be done.
Hi Martin,

Thanks for your quick reply! The board is in use currently and I would explore the board details in 2~3 days, so I can only provide the report and debug files right now, sorry!
Here are they (because the files are a bit large, I compressed them into a zip archive to reduce the size):
 

Attachments

  • debuginfo.zip
    601.2 KB · Views: 2
Back
Top