A Stability Problem Is Brewing With Nvidia RTX 3080, 3090 GPUs

It’s not unusual to see some users posting problems when a new GPU or CPU launches, but there’s early data suggesting that some RTX 3080 and RTX 3090 GPUs have a stability problem when they push near-to or above 2GHz. Most reports have focused on 2GHz, but at least one user said his GPU was below that clock.

Reports have begun popping up online of stability problems likely tied to GPU boost frequencies. Known-affected models include the Zotac RTX 3080 Trinity and the MSI RTX Ventus 3X OC. MSI’s Gaming Trio is mentioned, as is EVGA’s RTX 3080 XC. These reports make approximately the same claim: The GPU crashes in one or more titles, often at around 2GHz. Reducing clock speeds can resolve the problem.

Igor of Igor’s Lab believes he has an explanation for the problem. After referencing Nvidia’s documents for the RTX 3080’s PCB design, he writes:

The BoM and the drawing from June leave it open whether large-area POSCAPs (Conductive Polymer Tantalum Solid Capacitors) are used (marked in red), or rather the somewhat more expensive MLCCs (Multilayer Ceramic Chip Capacitor). The latter are smaller and have to be grouped for a higher capacity.

Bottom-POSCAP-vs-MLCC

The areas blocked off with boxes and the set of 10 green rectangles are all power rails. If the RTX 3080 and RTX 3090 are being fed dirty power, it would explain why the cards destabilize and crash at high frequencies.

Zotac, for example, used POSCAPS for all six rails:

Nvidia used four POSCAPS and two MLCCs, as shown below. Igor notes he’s been unable to crash his Founders Edition GPU, implying that this may be the cause of the crashing bugs. We don’t know for certain that the 3080 FE doesn’t have this problem, but so far nobody reporting an issue appears to have an RTX 3080 FE.

Founders-1

We don’t know how many cards are affected by this issue. Even if using POSCAPs instead of MLCCs is the cause of the problem, it doesn’t automatically follow that every POSCAP-equipped device has a problem. It seems more likely that a certain percentage of POSCAP devices would have issues than that Nvidia would fail to notice it had misinformed all of its AIBs and handed them bad manufacturing documents. The former could cause some elevated repair rates and grumbling, while the latter would require the repair of every RTX 3080 and RTX 3090 manufactured to-date.

If you run into this problem, the first thing to try is lowering the base clock on your GPU. Users are reporting that 80-100MHz always works. While overclocking your GPU is dangerous if you don’t know what you’re doing, it’s hard to muck up a part by running it slower.

This situation is going to evolve over the next few days as Nvidia discusses it with AIBs and investigates how best to fix the problem. I advise keeping an open mind as far as what the potential cause might be, not because I doubt Igor, but because sometimes manufacturing analysis reveals additional problems that weren’t previously known and couldn’t be seen from surface-level examination.

No matter what the problem turns out to be, Nvidia or its partners have a mess they’ll need to clean up while also dealing with extreme card shortages. We’ll keep you posted on whatever the cause turns out to be.