Core effective clocks at 100% C0 state

Timur Born

Well-Known Member
Hello.

I wonder on what basis does HWiNFO calculate "Core Effective Clocks" to be lower than core multiplier based clocks when the CPU is 100% in C0 state?

And why are "Core Effective Clocks" listed as considerably lower (300 MHz) for CCX0 on my 5900X when CCX1 is considered the worse CCX with both dies processing the very same load (P95 at 100% C0 state)?
 

Martin

HWiNFO Author
Staff member
 

Timur Born

Well-Known Member
Thanks for the link and reminder that RyzenMaster can be used to verify the results. Unfortunately the linked explanation and following discussion kept emphasizing halted C-states, which I specifically disabled to better understand effective clock. Only after several read-overs did I understand that the following statement remains true even at 100% C0 state (no HALT):

This method relies on hardware's capability to sample the actual clock state (all its levels) across a certain interval...

So HWiNFO purely relies on a hardware CPU counter/register when no C-states are involved?
 

Timur Born

Well-Known Member
Slap me, I am stupid. Unknowingly I still had custom per core CO values set in AGESA from an older test-run. So this explains the different CCD frequencies under P95 load.

If HWiNFO really relies on a hardware register then I have to find out what is the limiting factor. I fear it will be EDC, which is the one limiter I do not fully grasp (described as "peak" but 100% limited under constant load).
 

Timur Born

Well-Known Member
Here is another source of confusion that is easy to misunderstand:

...it has become insufficient to provide a good overview of CPU dynamics especially when parameters are fluctuating with a much higher frequency than any software is able to capture.

This seems to suggest that core multiplier measurements fail to provide correct numbers due to polling restrictions. But even at 50 ms (20x per second) and *constant* CPU limitations the minimum multiplier numbers don't catch a single drop that corresponds to anything close to "Effective Clocks". This leads me to believe that this is a restriction of the sensor not reporting any such multiplier drops due to limits instead of a polling problem?! Could you clarify this?

Furthermore I did a lot of testing measuring "Effective Clocks" with C-states being disabled. As it turns out I was not that stupid after all, because my CCDs (or CCX) indeed are limited to different effective clocks under certain load scenarios. At first I thought this to be due to CCD/CCX0 being about 1-2°C hotter than CCD/CCX1 and thus following a different internal curve. But different load does not induce the same effective clocks differences despite temps still being 2°C apart.

Even more curious I noticed that *non* AVX P95 SmallEST FFT load leads to both CCDs becoming nearly equal in limits and even found one PPT/TCD/EDC setting where CCD0 reached a higher effective clock than CCD1, despite all other tests being the opposite. Either way the differences are always per *whole* CCD, not whole CPU or single cores. This leads me to believe that PPT/TCD/EDC limits may *not* be CPU limits, but CCD (or CCX) limits instead?! What I mean is that the same limits are set for all CCDs, but the measurements/consequences of reaching said limits seem to be applied per CCD (or CCX)?!
 

Martin

HWiNFO Author
Staff member
The minimum clock capturing is not possible in classic polling mode and low-power states because the operation to read the actual clock/ratio of each core/thread will wake the core up to run code performing the readout.
Analyzing the idle state at different load requires also taking several operating system threads into account that cause uneven load.
 

Martin

HWiNFO Author
Staff member
Current clock is determined as the actual multiplier * BCLK. So determining actual clock is in fact reading the multiplier per core/thread.
 

Timur Born

Well-Known Member
What I mean is: Even at 50 ms polling the minimum Multiplier reading never hits as low as the Effective Clock reading. That is even with constant 100% load, C-states disabled and effective clock constantly decreased due to hitting one of the PBO limits.

EVyZgyA.png


At such high polling rate and such constant clock decrease I would expect the Multiplier sensor to catch at least some down-clocks, but it does not. Because of this I assume that this is not a problem of polling the sensor, but a problem of the sensor not reporting any of these limiter induced down-clocks at all?! On various forums people call this "clock stretching", but I do not know if this is indeed the proper technical term. I assume that the CPU just lowers the multiplier internally without reporting the change to its Multiplier register?!

Furthermore you can see that my CCDs are affected differently by hitting the same PBO limit, with CCD0 decreasing further than CCD1 under several specific loads that use all logical cores. This in turn leads me to believe that PPT/TBC/EBC limits may be per CCD (or CCX) limits instead of CPU limits. As in, the same limits affect different CCDs differently.
 
Last edited:

Zach

Well-Known Member
@Timur Born
First, what I realize using sensors window for quite some time now, is setting polling rate below 1000ms messes with CPU state. It keeps it above what its supposed to fall into for multiplier, effective clocks and CPU voltage.
Second, did you try enabling "Snapshot CPU Polling" from main settings? It will disable all T1 effective readings but what will remain is more accurate.

Last but not least, Ryzen CPUs (as mentioned already) are highly dynamic chips. Meaning that the actual changing rate of the CPU states/frequencies and voltage is between 1 and 20ms (AMD statement) depending on the power plan you're using. I think there is no way (by software) to see such a changing rate, and be able to add up all CPU parameters without introducing the observer effect, even with snapshot CPU polling enabled.
 

Martin

HWiNFO Author
Staff member
Again - the classic mode of polling the clock is affected by the observer effect. Reading the actual multiplier induces load on the core that drives the active state and can also increase the P-State/clock.
As Zach already mentioned - enable the "Snapshot CPU Polling" which reduces the observer effect to a minimum possible.
 

Zach

Well-Known Member
Also, its been observed on dual chiplet CPUs to have differences between the 2 CCDs/CCXs. Its not clear why, but for me it has to be different binned CCDs/CCXs = different electrical characteristics = different frequencies.
 

Timur Born

Well-Known Member
Snapshot CPU polling makes no difference. Using different polling intervals makes no difference.

If the CPU changes multipliers between 1 and 20 ms then I still would expect to catch at least one such decreased multiplier at a polling rate of 50 ms (or longer polls), even more so if measurements run long enough.

If the observer effect is to be blamed then HWiNFO - doing a single-threaded poll - pulls the whole CPU out of its 100% TDC + EBC limit while 24 threads of P95 load keep running on all other cores. Indeed the average effective clock increases by 10 Hz when HWiNFO polling is changed from 5000 ms to 50 ms, but that still seems rather unexpected.

I will try to induce a temperature limit instead and measure again, because I can keep temps up even while HWiNFO does its short poll.
 

Timur Born

Well-Known Member
About the different binned CCDs: This is to be expected, with CCD0 being the "better" one and CCD1 being the "worse" one. According to CPPC the best core of CCD1 is worse than the worst core of CCD0. What is surprising, though, is that CCD0 is the one that is limited further down in effective clocks, even though it should have "better" characteristics.

And with Prime95 load we cannot even explain this by CPPC scheduling more load to CCD1, because P95 sets fixed affinities for its threads (contrary to what CB does).

One guess would be that the electrical characteristics are not so different between CCDs and that the more aggressive internal curves of CCD0 cause this more aggressive limiting. I will do some tests with different Curve Optimizer settings (including positive offsets) to see how this changes the limiting behavior.
 

Zach

Well-Known Member
That's interesting indeed.
Can we see a fully expanded sensors window while running P95. Or at least one with the following marked sections fully visible.
Showing "perf # x/x" numbering order and individual "Core Powers" among the others.

HWiNFO_16_05_2021.png
 

Timur Born

Well-Known Member
I will do that tomorrow, but I can already tell you that there are no C1 or C6 residencies at any time, because I specifically turned off C-states for these tests. It is 100% C0 state. And my screenshot already shows core usage to be 100%. "Perf # x/x" is not available here (also not just hidden).
 
Last edited:

Zach

Well-Known Member
I'm not so much interested in C-state residencies. Even if you didn't disabled them a P95 run on nCores would still have all cores on 100% C0 anyway.
What I wanted to see the most is the discrete clock, perf numbering order (AGESA and Win scheduler core selection, hence = x/x), effective clock, Power Reporting Deviation, individual core powers and CPU related temps (CCD1/2) all at once.

What exactly do you mean by:
"Perf # x/x" is not available here

You don't have these?

HWiNFO_16_05_2021_b.png
 
Last edited:

Timur Born

Well-Known Member
Nope, I don't have these, but CTR lists CPPC order and I already confirmed that Windows thread scheduler makes use of the correct order (both with and without core parking). I also mentioned earlier that Prime95 fixes affinities of its threads to specific cores (likely for cache coherency) and that CPPC is effectively overruled by this. So perf numbering order does not matter for P95.

Zsim423.png


Meanwhile I checked "Core Power" for each core to verify the following theory from an Anandtech article:

Some users might be scratching their heads – why is the second chiplet in both of these chips using less power, and therefore being more efficient? Wouldn’t it be better to use that chiplet as the first chiplet for lower power consumption at low loads? I suspect the answer here is nuanced – this first chipet likely has cores that enable a higher leakage profile, and then could arguably hit the higher frequencies at the expense of the power.

As it turns out, though, my weakest core of CCD1 is *not* the most power efficient, it's rather the least efficient of the CCD. Looking at all cores there does not seem to be a direct correlation between power efficiency and CPPC. But: the sum of cores on CCD1 indeed uses less power than the cores of CCD0, so that may contribute to limits applying differently to CCDs. It might be helpful if HWiNFO offered a power sum for each CCD/CCX instead of just all cores of the CPU, in order to compare CCD efficiencies/consumption more easily.
 

Martin

HWiNFO Author
Staff member
How come that HWiNFO doesn't show the CPPC order, is this shown in the main window under the CPU node?
It's not possible to show per-CCD power consumption as the rails are shared between CCDs and don't feature dedicated per-CCD current monitors.
 

Timur Born

Well-Known Member
Sorry, I found the perf sensors collapsed under "Core Clocks". I did not expect them to be there and usually don't care for clocks but prefer to watch multipliers instead.

What I meant with the per-CCD power was a simple sum of the per-core powers in blocks of CCDs. It's likely not worth the effort, though, as it may not come up often.

gXN68eu.png
 
Top