[SOLVED] Please Help: Random reboots and crashes

jhelmer25 · Dec 3, 2023

Hello all, I am turning to this community because I am at a total loss.

I recently built a new PC that randomly reboots and/or crashes fairly frequently. It reboots/crashes without warning (i.e. no BSOD, and (seemingly) no Windows Event Viewer logs that point to a problem outside of the "Critical Kernel-Power (41)" error as a result of the crash itself). There is no consistency to when it crashes. I went 5 days with no crash, and just had 2 crashes today.

These crashes do not appear to be related to system load. they will occur when I am gaming, but also when I am just browsing the web (i.e. watching YouTube). I cannot say 100% that I have had it crash at idle, or when there wasn't some form of video playing. This would be difficult to validate because I don't use the PC for other purposes very frequently, or for a long enough usage period to cause a crash.

Things I have tried:
- Disabling automatic restarts on system failure, in hopes to get a better read (still restarts w/o notification)
- Undoing/redoing all cable connections
- Getting all OS + driver updates (latest drivers direct from manufacturer)
- Updating to the lastest BIOS
- Replacing cables and reseating hardware
- Running a complete memtest86 (100% pass with 0 failures)
- Disabling all "performance" features in BIOS for vanilla boot configuration
- Replacing PSU with a brand new unit

I have been actively monitoring the temperatures and load for the CPU, GPU, memory and storage. Everything is running within a reasonable range. I have attached a log file after the latest crash, which occurred after a gaming session (no games were running during the crash). If you go back ~30m you will see the system under load while playing Cyberpunk at max settings. The metrics are still all good, and since the crash occurred in even more optimal ranges, it is even more mysterious.

I used GenericLogViewer (v6.4) to look at the graphs from 1 hour prior to the crash (see screenshot), and nothing looks anomalistic - clicking through every sensor metric.

I am truly hoping that someone in the community can spot something I am not, or offer other suggestions. The only remaining things I can think to do are, replacing the MOBO, GPU and CPU, but I am truly hoping it doesn't come to that.

System specs:
- Motherboard: ASUS TUF Gaming B650-PLUS WiFi Socket AM5
- CPU: AMD Ryzen 7 7800X3D
- GPU: ZOTAC Gaming GeForce RTX 4070 Ti AMP Extreme AIRO
- RAM: CORSAIR VENGEANCE RGB DDR5 RAM 64GB (2x32GB) 6000MHz CL30
- Storage: Corsair MP700 2TB PCIe Gen5 x4 NVMe 2.0 M.2 SSD
- PSU: be quiet! Dark Power 13 1000W Quiet Performance Power Supply | 80 Plus Titanium Efficiency | ATX 3.0 | PCIe 5
- OS: Windows 11 Home Edition (although this issue also occurred with W10 before I upgraded).

PLEASE HELP! I am going crazy because the $$$ for this system is causing so much anxiety and I don't know what to do. Whoever solves this, I will buy you lunch ;-)

Thanks in advance.

Fr0stX76 · Dec 4, 2023

Firts thing that came to mind was transient load reboot. But then I noticed something when looking at your specs + troubleshooting steps.

7000X3D chips are sentive to SOC voltage. XMP/DOCP and EXPO RAM kits when their profile are activated will in most case have the SOC Voltage increaed by the motherboard firmware (Asus board do). For a 64gb 6000Mhz CL30 kit, the bios could very well have increased this voltage to the AMD 7000X3D limit of 1.30v. If for some reason, the actual SOC voltage fed to the CPU by the VRM goes above 1.30v, the newer versions of the bios would trigger the OCP protection (over current)... Which is a good thing because the alternative of the older bios (as in the SOC voltage debacle) was that the CPU chip would litterally melt. When OCP protection kick in, you get a reboot just like you described.

What is your SOC Voltage set the bios when DOCP is activated? If it is at 1.3v, lower it to 1.25v manually.

If that does not fix the issue, I would test the system with DOCP/XMP deactivated just to see if the RAM "overclock" is the issue, or if its something else.

jhelmer25 · Dec 4, 2023

@Dom Thanks for the quick reply. I am impressed by your wealth of knowledge for such a niche thing.

When I disable/restart/enable the DOCP profile (I had it enabled, but wanted to cycle to get a better understanding of what BIOS settings are changed), the following settings are modified:

Although there is nothing about SOC voltage mentioned, I do see the VDD/VDDQ voltages changed from "Auto" to 1.40v -- which is above the limit of 1.30v you mentioned for the AMD 7000X30. I apologize for being so ignorant, but I don't know whether this is related - but thought it worth sharing. I did not want to modify these voltages without first getting a better understanding of the impact. I am intentionally trying to keep things mostly vanilla (the DOCP profile is the only setting changed in all the BIOS), because I want to eliminate as many variables as possible.

When I then restart and load into the BIOS again, I can see the SOC voltage reported as 1.272v (or 1.232v, depending on the number you look at). This is below the 1.30v limit. Again, my knowledge is essentially zero here - but I am unaware if this is constant or would fluctuate to something higher (and dangerous).

In comparison, w/o the DOCP profile, the voltage is reported at < 1.1v.

In the interim, I will disable the DOCP profiles in favor of "Auto" timings. Unfortunately these timings don't really come close to what the RAM is capable of (40-40-40-77 vs 30-36-76), but I much prefer system stability to squeezing more juice.

If the system is stable without any unexpected reboots or crashes with DOCP disabled, then your intuition is spot-on.

--------------------------------------------

On a related note, I am surprised there is no way, within the BIOS, to know if the OCP protection was triggered? It seems like some BIOS logs would be super helpful here to identifying the root cause.

Once again, thanks a ton for your input here. I have spent many stressful hours on this, and your response is the first thing suggestion has shown any promise of resolving this.

--------------------------------------------

EDIT: I am not sure if this is relevant, but the CPU VDDCR_SOC Voltage (SVI3 TFN) reported by HWiNFO in the attached log file from the original post, indicates that the voltage never went above 1.235v during the reporting period. Now I am not sure if there happened to be a "spike" right at the end that was not recorded (since likely HWiNFO just writes to disk on some interval and would not have captured the last "second" before the crash), but the general range seems to fall within a healthy threshold, and aligns with what I see for the SOC Local Voltage Reading in the BIOS (above) after enabling DOCP:

Fr0stX76 · Dec 4, 2023

Hi JHelmer,

DRAM VDD/VDDQ are your ram voltages and expected to go up to 1.4v based on your XMP profile. CPU VDDIO/MC is the CPU memory controller and is expected to go up, controlled by the Bios who will up it to match the DRAM VDD/VDDQ voltages. The 3 are expected to be in sync (you can run them in async, but this is not to be expected by default). 1.4v is on the high side, but not in danger zone.

Over current protection is basically the motherboard telling the system to shut down NOW to prevent damage, so there is no way it would take the time to log it when it happens. Symptom is pretty simple. Hard reboot out of nowhere.

My bad, I just realized there is a log viewer for those CSV. Well 1.235v is within specs, and the bios setting it to 1.272v is within safe range. Ideally it would be between 1.22v and 1.250v during operation, which is pretty much what your HWinfo log reads.

I know it sucks, but I would recommand running your ram without XMP for a couple days and see if this solve the issue. If it does, then you know you either have an unstable memory kit (it happens, regardless the specs they are sold) or memory controller. Seing the XMP profile already upped the VDD/VDDQ/VDDIO-MC voltages to 1.4v, I would not recommand uping them even higher.

I would either recommand you to lower the ram speed manually (you would need to do some thinkering) or swapping them for something else (on that note I would recommand EXPO kit instead of XMP (https://www.corsair.com/us/en/explorer/diy-builder/memory/amd-expo-vs-docp): 6000Mhz 1.35v kits are the sweet spot, This is the kit I have (2*16): https://www.amazon.com/G-Skill-Trident-288-Pin-CL30-38-38-96-F5-6000J3038F16GX2-TZ5NR/dp/B0BF8FVLSL) but they also make 2*32 (1.4v) kits https://www.amazon.com/G-Skill-Trident-288-Pin-CL30-40-40-96-F5-6000J3040G32GX2-TZ5N/dp/B0BJP3MRW1.

Also, I do see your storage temp climbing pretty fast in idle. Not sure why this is and going to 56c is still well within specs, I would guess it t is located under the motherboard heatsink right behind the graphics card which fan are stopping in idle. 56c in idle is a bit high for my taste but most likely not the issue here.

Appart from this. Are you using a riser for the graphics card? Most likely not, but just asking.

jhelmer25 · Dec 4, 2023

@Dom - I appreciate the detailed summary of information for my edification. This all makes sense.

In the interest of gathering more data and comparing w/ and w/o DOCP enabled:

Last night I ran a Prime95 (blended) test to put load on the CPU and RAM, with DOCP disabled. It ran 10 hours and produced 0 errors across all workers for this period.
This morning I ran the same test with DOCP enabled. Although it didn't result in a system crash, sure enough 4/8 of the workers have already encountered fatal errors at 5 hours in (less than half the time of the previous run):

FATAL ERROR: Rounding was X, expected less than 0.4
Hardware failure detected [...]

In my mind, this 100% concludes that your intuition was right - and that the faster RAM timings are causing the issues (in some way).

Here is a summary of my understanding (Please correct me if I am missing something): The faster timings (i.e. CL30) require a higher SOC voltage. The BIOS is (correctly) increasing the voltage from 1.0XX to 1.2XX, and (correctly) keeping it below the 1.3v limit of the Ryzen chip. This likely means that OCP is not being triggered, and something else is going on.

The possibilities you mentioned in this case are:

Unstable memory chip (need new RAM hardware)
Unstable memory controller (need new MOBO hardware)

Is it also possible that the stability issues are because the the RAM w/ the faster timings requires a higher SOC voltage? Or should it only be a factor of VDD/VDDQ & VDDIO/MC?

If the low SOC voltage is a potential issue, it (naively) seems like there is also a possibility (3): The RAM and MOBO are doing everything right (i.e. neither (1) or (2) are the issue), but the Ryzen chip voltage limit is causing too low of SOC voltages for the RAM to be stable at the faster timings. In other words, it is essentially impossible to get CL30 timings with the Ryzen chip w/o craches. If this is the case, it seems like swapping out the RAM could result in the same issue. WDYT?

Also: Does the Prime95 test run give you any more intuition as to whether it is (1), (2), or something else? For example, if these errors definitively point to the memory controller as the primary issue, then it likely would not help to swap out the RAM.

As a next step, I have purchased the RGB version of the RAM you recommended, which should arrive by tomorrow night. Once it comes, I will run the Prime95 blended test again with the new DIMMs and EXPO enabled. Then if it still looks unstable, and there are no other possibilities to consider, it sounds like we are looking at a full rebuild with a new mobo :-(

------------------------------

Thanks for pointing out the storage temp concerns. You are right it is under a heatsink - and although not directly behind the GPU, it is near enough that it might be affected by ambient temperatures:

And no riser is being used.

I will keep monitoring the M.2 temps - and add some more case fans if it continues to be an issue. Since all the other temps are looking real good even under load, this is not my immediate concern - one thing at a time

But I will keep this in the front of my mind.

------------------------------

Thanks so much for the continued dedication to help me solve this. You are a true gentleman.

Fr0stX76 · Dec 4, 2023

Hi Jhelmer,

Ok so this is definitely a step forward.

The possibilities you mentioned in this case are:

Unstable memory chip (need new RAM hardware) Correct, since VDD/VDDQ ram voltage are pretty much at safe limit.
Unstable memory controller (need new MOBO hardware) Incorrect, memory controller is part of the CPU chip itself. So new CPU required with better binning or more CPU VDDIO/MC voltage, which is not an option since pretty much at the safe limit also.

SOC voltage is a whole different thing, it has to do with your cores (another part of the CPU).

I would stick to option 1 since 2 is a rabbit hole.

Its all interconnected: Depending on how fast, how tight the timing, the size (32 vs 64) and the qty of ram stick... The more voltage they require to run and the more pressure it put on the controller (that is inside the CPU). Thus why the kit you just purchased need more voltage to run (1.4v vs the 32 kit who run at 1.35v). This is why all 3 voltages are in sync (VDD/VDDQ & VDDIO/MC). Because all 3 need to be raised depending on the 4 factors I listed (speed timing size qty of sticks). So my educated guess is that changing the ram to a kit that work at its advertized speed should most likely fix the issue. If not then... Yeah you guessed it, CPU. Mobo less likely. Speaking of mobo, I just realized I forgot to mention it. You could try different RAM slots on it. It could also be a defective ram slot.

As for your Prime test, looking at the FTT size that failed (I assume you were runing Prime95 in blend). It looks like it failed during FTT ranging from 2k to 4k, which is pretty much on the large size. So again, pointing at the memory controller or RAM.

Last but not least, it could be a defective GPU. But I very doubt it. When a GPU fails, it gives blue screen or black screen. Reboots can happen but it is most of the time due to the PSU with insuficient power or defective. And you ruled that one out since it is new.

jhelmer25 · Dec 4, 2023

Thanks for clearing up my misunderstanding. Glad things are all pointing to RAM right now! It feels like it is being narrowed down each time you post

I just realized I forgot to mention it. You could try different RAM slots on it. It could also be a defective ram slot.

Are you suggesting this is possible even though the slower timings resulted in 0 errors? i.e. the defective ram slot could be sensitive to the timings? If so, I will try reducing to a single stick and run a bunch of blended tests with each individual stick in the different RAM slots.

I will report back after I run tests with the new RAM.

Fr0stX76 · Dec 4, 2023

jhelmer25 said:
Thanks for clearing up my misunderstanding. Glad things are all pointing to RAM right now! It feels like it is being narrowed down each time you post

Are you suggesting this is possible even though the slower timings resulted in 0 errors? i.e. the defective ram slot could be sensitive to the timings? If so, I will try reducing to a single stick and run a bunch of blended tests with each individual stick in the different RAM slots.

I will report back after I run tests with the new RAM.

I doubt it would matter since it seems like disabling DOCP fixed the issue (well as far as 10hrs of blend Prime 95 can be called stable after 10hrs, but that is another debate I don't feel like going into). But who knows, it might be a possibility.

I would run Prime in large FTT instead of blend. Blend is a good general stress test, but you need to run it for much longer to pin point an issue. In your case, you want to stress the RAM and controller, so I would go with large FTT.

jhelmer25 · Dec 6, 2023

I have some compelling news to share.

I popped in the new RAM last night, reset CMOS, and then enabled the EXPO profile.

I then ran a test overnight using the large FTT test as you recommended. It ran for 10.5 hours with no failures*!

Here's where the asterisk comes in: When I turned on my monitors this morning, I noticed everything the system was locked up. The timestamps on the worker thread tests were about an hour in the past. About 11.75 hours had actually passed, but it locked up at 10.5 hours. I was able to move the mouse, but nothing was clickable and my keyboard didn't even work to open up task manager, for example.

SO - the fact that there was no issue for 10.5 hours is very compelling. The fact that it was then locked up for the following ~1.25 hours is less compelling. But since there were no errors, I am going to chalk it up to an anomaly. Maybe from the windows night light or something turning on at 5:30am

Tonight I will run one more test (blended this time, just for the sake of an A/B test) and run with these settings for about a week. If there are no crashes and no lock ups, then the issue is solved!

EDIT: One thing that is very suspicious and I don't like at all, is that the CPU is running much hotter now - roughly 20c hotter than before - even under very low load. For example, in this screenshot, all I am doing is verifying the integrity of game files in steam - and the temp jumps up to 65:

Meanwhile, the CPU load is only ~15%. Compare that to the original data I posted where even under high load (gaming), the CPU temp averaged around 45c and never topped 60c.

Any ideas what would cause the temp to be so high even when the usage is low? This seems really funky. Now really launching any program causes these high temps and my chassis fans to spin up to nearly max. I understand that this is still within a "safe range", but the fact that it was only 45c prior tells me something is up. I doubt it is any of the standard stuff (dust, thermal paste, etc) because I have had this computer for only 1 month, and it was running in much more reasonable ranges right before swapping the RAM. I don't think the RAM itself would create the heat - hence my confusion

Fr0stX76 · Dec 6, 2023

Very hard to tell.

Higher temp usually is either cooling issues (mounting of cooler, fans, thermal paste) or increase in power (voltage or current higher = higher power consumption).

What are your voltages in the bios after enabling the EXPO profile?

Also, the freezing after 10hrs is not good. Have you looked at your Event Viewer logs for errors? Is this a fresh install of Windows? Are you running anything else in the background?

jhelmer25 · Dec 6, 2023

Good questions.

Higher temp usually is either cooling issues (mounting of cooler, fans, thermal paste) or increase in power (voltage or current higher = higher power consumption).

I can do an A/B test with the old RAM, but the cooler/fans/thermal paste are likely in the same state they were 3 days ago when I opened this thread. Maybe all the stress testing wore down the thermal paste in some way? I can try reapplying but its such a new system and it was fine just a few days ago, that I am skeptical.

What are your voltages in the bios after enabling the EXPO profile?

Based on what you taught me, they are looking good:

Also, the freezing after 10hrs is not good. Have you looked at your Event Viewer logs for errors?

There is nothing suspicious around the time of the "freeze", which I am assuming to be at ~5:35am because that is the last timestamp I saw on the P95 output - except for these 7 eerily timed info messages:

Is this a fresh install of Windows?

I intentionally haven't touched the OS since the start of this debugging journey on Sunday (although automatic updates are on), specifically because I am trying to limit the variables for A/B testing.

Are you running anything else in the background?

Nothing comes to mind, other than the night light. There are also some Asus/Corsair/Nvidia services that I may be able to disable (for example iCUE comes to mind as a potential issue). I could kill Armoury Crate, iCUE, and everything else possible to benchmark moving forward - but this feels like grasping at straws.

-----------------------------------------------------------------------

Stepping back a level, I think there are now 3 active issues at play that we have discussed:

1. The random reboots/crashes - this is my ultimate/main concern, and I think the 10 hours of testing w/ no errors (vs the 5 hours of testing with 4/8 errors) proves that the previous RAM was faulty. So there may not be further cause for concern here. If I am able to use the system for a week with no issue and EXPO enabled, then this confirms your intuition and debugging that the Corsair Vengance RAM was bad.

2. Much hotter temperatures at low load since replacing the RAM. This is definitely as separate, net-new problem and is cause for concern. I would not have expected the temps to climb so much higher on average from changing the RAM. As a test, I will put the old RAM back in and see if it indeed drops back down to the mid 40s instead of mid 60s. It seems unlikely, but it is the only factor that I know to have changed since the last run (outside of resetting CMOS)
EDIT: For congruency, I went ahead and swapped out the ram with the old (broken) RAM to see if the temps went back down. They did not; the idle and load temps are (roughly) the same with both sets of RAM. Although this is not ideal, the fact that it is consistent is good. The timing of the temperature spikes therefore seem coincidental (maybe the stress tests also stress tested the thermal paste). Even the "spikes" are well within operating temperature; the system idles at around 45c and spikes at around 70c - which is not ideal, but acceptable. For the sake of converging on a solution in this thread as opposed to branching out and creating more problems, I will investigate this separately and apply the standard tricks (more/better fans, different PWM profiles, reapply thermal paste, etc). It is just so odd that it went up so significantly through this debugging effort, but I digress.

3. The random freeze after 10 hours of Prime95 with large FTT. This is certainly bad, but seems real hard to track down. Since it has only happened once, I don't want to take too much of your time bottoming out on this.

Fr0stX76 · Dec 10, 2023

jhelmer25 said:
1. The random reboots/crashes - this is my ultimate/main concern, and I think the 10 hours of testing w/ no errors (vs the 5 hours of testing with 4/8 errors) proves that the previous RAM was faulty. So there may not be further cause for concern here. If I am able to use the system for a week with no issue and EXPO enabled, then this confirms your intuition and debugging that the Corsair Vengance RAM was bad.

2. Much hotter temperatures at low load since replacing the RAM. This is definitely as separate, net-new problem and is cause for concern. I would not have expected the temps to climb so much higher on average from changing the RAM. As a test, I will put the old RAM back in and see if it indeed drops back down to the mid 40s instead of mid 60s. It seems unlikely, but it is the only factor that I know to have changed since the last run (outside of resetting CMOS)
EDIT: For congruency, I went ahead and swapped out the ram with the old (broken) RAM to see if the temps went back down. They did not; the idle and load temps are (roughly) the same with both sets of RAM. Although this is not ideal, the fact that it is consistent is good. The timing of the temperature spikes therefore seem coincidental (maybe the stress tests also stress tested the thermal paste). Even the "spikes" are well within operating temperature; the system idles at around 45c and spikes at around 70c - which is not ideal, but acceptable. For the sake of converging on a solution in this thread as opposed to branching out and creating more problems, I will investigate this separately and apply the standard tricks (more/better fans, different PWM profiles, reapply thermal paste, etc). It is just so odd that it went up so significantly through this debugging effort, but I digress.

3. The random freeze after 10 hours of Prime95 with large FTT. This is certainly bad, but seems real hard to track down. Since it has only happened once, I don't want to take too much of your time bottoming out on this.

Hi Jhelmer,

Sorry it took me a bit of time to answer. I`ve been quite busy. Ok so here are my answers:

1- Agreed, real use is by far the biggest/most reliable stability test. In real use, transient spikes (load going from low to high to low in very fast changes) will occur more than any synthetic load can simulate (OCCT has some options, but it can be tedious to use if you don't know how to). Make it a month of gaming in CPU intensive games and call it a day... Enjoy your system.

2- Reviewing the temps you originally posted, they are quite normal for a 7800X3D CPU. AM5 IHS (the CPU heatsink you apply thermal paste on) is very thick in order to preserve compatibility with AM4 coolers. See This video. This is even more true for X3D CPUs since the extra Vcache gets it even warmer. Nothing to worry about. My 7800X3D has similar temps and I have a full custom loop with two 280mm radiators. The only way to solve this would be delidding and going direct die cooling, which unless you really know what you are doing (and whiling to lose your warranty), I advise against.

3- Up to you, but yeah could be unrelated.

At this point, if you still have issues, I suspect the CPU memory controller might be in cause (sadly, not all CPU have the same silicon quality, just google "silicon lottery"). Dialing a bit back on the memory timings (ex. CL30 to CL32) might be something to look into, or even better, learning to manually adjust memory speed/timmings could be a project of interrest

jhelmer25 · Dec 10, 2023

@Dom - Please no worries at all about the delay!

This all sounds great.

I will mark this thread as RESOLVED

[SOLVED] Please Help: Random reboots and crashes

jhelmer25

Member

Attachments

Fr0stX76

Well-Known Member

jhelmer25

Member

Fr0stX76

Well-Known Member

jhelmer25

Member

Fr0stX76

Well-Known Member

jhelmer25

Member

Fr0stX76

Well-Known Member

jhelmer25

Member

Fr0stX76

Well-Known Member

jhelmer25

Member

Fr0stX76

Well-Known Member

jhelmer25

Member

Similar threads