How Do You Cool a 2300 W GPU? The cooling problem of NVIDIA and what it would mean to the design of GPUs.
There are rumors that the next generation Rubin Ultra AI GPUs will reach a total board power (TGP) of up to 2300 W, which will necessitate a complete change of approach to cooling. With such power levls, conventional heat sinks, vapor chambers, or even a simple liquid cooling might not be able to offer safe and efficient thermal conditions.
Why the traditional cooling process could fail.
At power densities of multi-kilowatts, thermal silicon die to coolant resistance is the enemy of the day. The higher the temperature increase, the more intervening layers (TIMs, spreaders, cold plates, piping). Common liquid cooling systems are based on cooling blocks that are attached to a heat spreader but they might not be capable of removing sufficient heat without leading to high temperature gradients or hotspots. As commentary in the field of semiconductor engineering suggests, cooling is a bottleneck design issue in scaled up chips.
Cold plates Microchannel cold plates and direct-to-chip cooling.
As a response, NVIDIA is supposedly considering microchannel cold plates and direct-to-chip cooling (i.e. cooling with coolant right beside or even right over active silicon) to these Rubin Ultra GPUs. Microchannels- small coolant channels cut in metal blocks have very high heat transfer coefficients with reduced thermal resistance. They have their tradeoffs, however: pressure drop, more complicated fabrication, risk of clogging/fouling, and difficulty in making sure that all the die receives the same flow.
The thermal-hydrodynamic modeling article of recent is keen to point out that thermal-hydrodynamic cooling should combine flow resistance and heat transfer benefits; as the channel geometry becomes more aggressive, the pressure losses nonlinearly rise. At the same time, coldplate reviews warn that the problem of fouling, dryout, and manufacturing tolerances is not an insignificant barrier to widespread adoption.
The implication of this to GPU architecture.
In case NVIDIA is successful, Rubin Ultra can drive a new generation of GPU pack and cold co-design. More evenly distributed power may be required in GPUs to prevent local hot spots, pressure resistant substrates, and optimized coolant manifolds. It also creates a new standard of data center infrastructure: pumps, leak detectors, coolant loops, and reliability have to increase.
Summing it up, the issue of cooling a 2300 W GPU does not simply require the use of larger radiators; it needs entire redesign of the approach towards the combination of heat, fluid, and compute. Assuming that the rumors are true, Rubin Ultra can hasten the move towards the replacement of the chip + heatsink with complete liquid thermal ecosystems.
إرسال تعليق