Core effective clocks at 100% C0 state

Timur Born

Well-Known Member
So let me simplify my main question: Are the effective clocks reduced by means of the multiplier changing quickly or by other means (with C-states disabled)? If it is the multiplier, shouldn't we be able to catch that unless the multiplier register does not report it?
 

Timur Born

Well-Known Member
So at 100% C0 state the effective clock is lowered by the core being turned off in between, akin to low C-states (C1-C3 non deep)? Wouldn't that cause on/off current spikes? Does this reduce voltages or just stop the clock?

Thanks for pointing to custom sensors, that looks useful. :)
 

Martin

HWiNFO Author
Staff member
It reflects the AVERAGE clock during the refresh interval.
If the clock between (t) and (t - 100 ms) was 1000 MHz for 50 ms and 500 MHz for the remaining 50 ms, the resulting effective clock is 750 MHz.
If the clock between (t) and (t - 100 ms) was 1000 MHz for 50 ms and the core entered a stopped state (higher C-State) for the remaining 50 ms, the resulting effective clock is 500 MHz.
 

Timur Born

Well-Known Member
This is understood. The question is how is the clock lowered when all C-states are disabled? If it is by multiplier, should we not be able to catch that with a high enough polling rate and long enough measurement time?
 

Zach

Well-Known Member
How much of a difference are we talking about between discrete clock and effective clock when all higher than C0 state are disabled?
 

Martin

HWiNFO Author
Staff member
It should be by multiplier, but I wouldn't take the discrete clock values into account as those might not catch the right peak.
 

Timur Born

Well-Known Member
Difference varies by load profile, but in the screenshot on page 1 it is about 0.25 MHz on CCD0.

 
Last edited:

Timur Born

Well-Known Member
It should be by multiplier, but I wouldn't take the discrete clock values into account as those might not catch the right peak.
But no effective minimums are measured at all by the multiplier sensors at all. At 50 ms polling rate and long enough measuring we should expect at least some multiplier drops to be measured even if those are done within 1-20 ms.

This seems to suggest that either the multiplier register (?) does not reflect the internal drops or that the drops are done my other means than multiplier (whatever "multiplier" means in technical practice). Forum people keep calling this "clock stretching", but the whole thing is rather whishy-washy on various forums. Thus my questions here to get more insight from HWiNFO's point of view.
 

Timur Born

Well-Known Member
Over at the Overclock.net forum, someone posted a link to the "Adaptive Clocking in AMD's Steamroller" article describing how AMD's clock-stretching works.

So based on even the small cost of C1 (latency + enter/exit power cost), the explanation of AMD's clock-stretching and our discussion here my current understanding is: "Effective Clocks" (or RyzenMaster) read out a CPU counter register, which in turn is affected by both (optionally) enabled C-states and clock-stretching. In the absence of C-states (or 100% C0 CPU load) only clock-stretching is accounted for.

This has some implications: when C-states are enabled then "Effective Clocks" are to be taken with a grain of salt for performance/vcore measurements, because the counter may account for pauses (C-states) instead of clock-stretching.
 
Last edited:

Martin

HWiNFO Author
Staff member
I heard about clock stretching before, but what applied to Steamroller might be different for Zen. I tried to gather more information about this even in higher-grade documents, but no luck yet.. Will try to check around even more...
 

Timur Born

Well-Known Member
The main reason why I am currently preferring clock stretching as an explanation for non C-state induced frequency limiting is that C1 likely is still slower and more costly than clock stretching. And even changing the multiplier could be slightly more costly, which may be the reason why AMD even came up with clock stretching in the past.

I found this: https://patents.google.com/patent/US8941420B2/en

In practice, latency (delay) can be incurred at each frequency transition as frequency-multiplier circuitry stabilizes the system clock at its new frequency following each frequency change.
This pertains to system with broad input frequency ranges, though, while our CPUs only have to deal with one input frequency (100 MHz).

Conversely, injection-locked oscillators exhibit fast lock times, but tend to have a narrow input frequency range and thus limited frequency agility.
So unfortunately still not entirely clear for those of us who don't know how frequency multiplying is implemented in modern CPUs.
 
Last edited:

Timur Born

Well-Known Member
Here are my custom "sensors" summing up the already existing "Core X Power" sensors. This more easily demonstrates how CCD0 consumes more power under the equally distributed load (P95) compared to CCD1.

2j8GkFY.png


Since CCD0's effective clock is lower while consuming more power I will add average effective clocks and multipliers per CCD for even better visualization.

PS: I was surprised about the 2 ms profiling time under load, which drops to 1 ms idle. Does the custom sensor not just sum up the already available values, but poll each existing sensor again?
 
Last edited:

Zach

Well-Known Member
I guess its common for 2-CCD ZEN3 CPUs to have differently binned chiplets.
Here is an example of a 5950X with ~25% difference in power consumption between the 2 CCD cores.
 
Last edited:

Martin

HWiNFO Author
Staff member
The custom sensor does not poll the sensors again, the increased time is most likely due to polling of registry data.
 

Timur Born

Well-Known Member
Thanks Martin. I am currently trying to create a "Clock Stretching" sensor that calculates C-states out of effective clocks, but polling latency and rounding errors seem to be a problem.

@Zach Yes, it's common, but one would expect the "better" CCD to be more power efficient, instead of the other way around. Anandtech also pondered about that.

https://www.anandtech.com/show/16214/amd-zen-3-ryzen-deep-dive-review-5950x-5900x-5800x-and-5700x-tested/8 said:
Some users might be scratching their heads – why is the second chiplet in both of these chips using less power, and therefore being more efficient? Wouldn’t it be better to use that chiplet as the first chiplet for lower power consumption at low loads? I suspect the answer here is nuanced – this first chipet likely has cores that enable a higher leakage profile, and then could arguably hit the higher frequencies at the expense of the power.
This seems to be a per CCD correlation, though, not a per Core one (my "worst" core is not the most efficient).
 

Timur Born

Well-Known Member
Is it expected that HWiNFO displays different effective clocks than RyzenMaster when cores are not 100% in C0 state? RM does not offer averages over time, so it's not easy to compare, but the current readings are different. Sometimes RM is higher, sometimes lower, but for a specific load it seems consistent (mostly higher or mostly lower, not shifting much).
 
Last edited:

Martin

HWiNFO Author
Staff member
It can be different if the Snapshot Polling mode is not enabled. If it is, then difference is only due to a different interval and fluctuations.
 

Timur Born

Well-Known Member
HWiNFO reacts stronger to C-states than RM (which hardly reacts at all). In the following screenshot and animated GIF (attachment) both are set to 1 second polling interval.

C0 Residency, Core 0 Effective Clock, RM Core 0

8SLLenE.png


That being said, I even prefer that, but only if HI counts times of C-states as 0 MHz (zero). Else it's impossible to calculate out real times of clock stretching (non C-state induced). But it does not seem like HI does that, or does it?
 

Attachments

  • HI_vs_RM.gif
    HI_vs_RM.gif
    96.2 KB · Views: 6
Last edited:

Martin

HWiNFO Author
Staff member
Sorry, I can't follow up what you're asking.
HWiNFO is highly optimized for sensor polling and best results are obtained with the Snapshot Polling mode enabled. I also performed some additional optimizations for polling latency in the last version 7.04 released a few hours ago.
 
Top