Almost every single piece of modern electronics generates heat whether we notice it or not. Without properly managing that heat, our electronic systems would destroy themselves or conversely, we'd be severely limiting our computing capabilities.
The average TechSpot reader will think, of course, CPU and GPU cooling, but why does RAM not need fans to keep it cool? Why is there such a huge disparity between the performance of a mobile processor and a desktop processor even though the dies are pretty similar in size? Why have recent performance gains from new generations of chips started to slow down?
The answer to all of these has to do with heat and the physics of how digital computers work on the nanoscale. This article will touch on the basic science of heat, how and why it is generated in electronics, and the various methods we have developed to control it.
It's getting hot in here: the basics of heat
If you remember high school physics, heat is just the random motions of the atoms and molecules that make up our world. If one molecule has a higher kinetic energy than another molecule, we say it is hotter. This heat can be transferred from one object to another if they come into contact until the two reach equilibrium. That means the hotter object will transfer some of its heat to the cooler object with the end result being a temperature in between the two.
The time it takes to transfer this heat is dependent on the thermal conductivity of the two materials. Thermal conductivity is the measure of a material's ability to conduct heat. An insulator like Styrofoam has a relatively low thermal conductivity of 0.03 while a conductor like copper has a high thermal conductivity of 400. At the two extremes, a true vacuum has a thermal conductivity of 0 and diamond has the highest known thermal conductivity of over 2000.
One thing to remember is that heat always goes to cold, but there's no such thing as "cold." We just view things as "cold" if they have less heat than their surroundings. Another important definition we'll need is thermal mass which represents an object's inertia against temperature fluctuations. With the same sized furnace, it is much easier to heat a single room in a house than it is to heat the entire house. This is because the thermal mass of a room is much less than the thermal mass of an entire house.
We can put all these concepts together in the simple example of boiling water. When you turn on the stove, the hot flame will come into contact with the cooler pot. Since the material making up the pot is a good thermal conductor, heat from the fire will be transferred into the water until it boils.
The time it takes to boil will depend on the method of heating, the pot material, and the amount of water. If you tried to boil a pot of water with a small lighter, it would take forever compared to the large fire from a stove. This is because the stove has a much higher thermal output, measured in Watts, than the small lighter. Next, your water will boil faster if the pot has a higher thermal conductivity because more of the heat will be transferred to the water. If you were rich enough, a diamond pot would be the holy grail. Finally, we all know a small pot of water will boil faster than a much larger pot of water. This is because with the smaller pot, there is less thermal mass to heat up.
Once you're done cooking, you can let the water cool down naturally. When this happens, the heat from the water is being dumped into the cooler room. Since the room has a much higher thermal mass than the pot, the temperature won't change by much.
The tree amigos of heat in digital electronics
Now that we know how heat works and moves between objects, let's talk about where it comes from in the first place. All digital electronics are made up of millions and billions of transistors. For a more detailed look at how they work, check out Part 3 of our study on modern CPU design.
Essentially, transistors are electrically controlled switches that turn on and off billions of times a second. We can connect a bunch of them together to form the structures of a computer chip.
As these transistors operate, they dissipate power from three sources known as switching, short-circuit, and leakage. Switching and short-circuit power are both known as dynamic sources of heat since they are affected by the transistors turning on and off. Leakage power is known as static since it is constant and is not affected by the transistor's operation.
Two transistors connected together to form a NOT gate. The nMOS (bottom) allows current to flow when on and the pMOS (top) allows current to flow when off.
We'll start with switching power. To turn a transistor on or off, we have to set its gate to ground (logic 0) or Vdd (logic 1). It's not as simple as just flipping a switch though since this input gate has a very small amount of capacitance. We can think of this as a tiny rechargeable battery. In order to activate the gate, we must charge the battery past a certain threshold level. Once we're ready to turn the gate off again, we need to dump that charge to ground. Although these gates are microscopic, there are billions of them in modern chips and they are switching billions of times a second.
A small bit of heat is generated every time that gate charge is dumped to ground. To find the switching power, we multiply the activity factor (the average proportion of transistors switching at any given cycle), the frequency, the gate capacitance, and the voltage squared together.
Let's look at short-circuit power now. Modern digital electronics use a technique called Complementary Metal Oxide Semiconductors (CMOS). Transistors are arranged in such a way that there is never a direct path for current to flow to ground. In the above example of a NOT gate, there are two complementary transistors. Whenever the top one is on, the bottom one is off and vice-versa. This ensures that the output is either at a 0 or 1 and is the inverse of the input. As we switch transistors on and off however, there is a very short amount of time when both the transistors are conducting at the same time. When one set is turning off and another is turning on, they will both conduct when they reach the mid point. This is unavoidable and provides a temporary path for current to flow directly to ground. We can try to limit this by making the transistors between On and Off states faster, but can't fully eliminate it.
As the operating frequency of a chip increases, there are more state changes and more instantaneous short-circuits. This increases the heat output of a chip. To find short-circuit power, we multiple the short-circuit current, operating voltage, and switching frequency together.
Both of these are examples of dynamic power. If we want to reduce it, the easiest way is to just decrease the frequency of the chip. That's often not practical since it would slow down the performance of the chip. Another option is to decrease the chip's operating voltage. Chips used to run at 5V and above while modern CPUs operate around 1V. By designing the transistors to operate at a lower voltage, we can reduce the heat lost through dynamic power. Dynamic power is also the reason your CPU and GPU get hotter when you overclock. You are increasing the operating frequency and often the voltage, too. The higher these go, the more heat is generated each cycle.
The last type of heat generated in digital electronics is leakage power. We like to think of transistors as being either completely on or off, but that's not how they work in reality. There will always be a tiny amount of current that flows through even when the transistor is in the non-conducting state. It's a very complicated formula and the effect is only getting worse as we continue to shrink the transistors.
When they get smaller, there is less and less material to block the flow of electrons when we want them to be off. This is one of the main factors limiting the performance of new generations of chips as the proportion of leakage power keeps increasing each generation. The laws of physics have put us in a corner and we've used up all of our get-out-of-jail-free cards.
Take a chill pill: how to keep chips cool
So we know where heat comes from in electronics, but what can we do with it? We need to get rid of it because if things get too hot, the transistors can start to break down and become damaged. Thermal throttling is a chip's built-in method of cooling off if we don't provide adequate cooling ourselves. If the internal temperature sensors think it's getting a bit too toasty, the chip can automatically lower its operating frequency to reduce the amount of heat generated. This isn't something you want to happen though and there are many better ways to deal with unwanted heat in a computer system.
Some chips don't actually need fancy cooling solutions. Take a look around your motherboard and you'll see dozens of small chips without heatsinks. How do they not overheat and destroy themselves? The reason is that they probably don't generate much heat in the first place. Big beefy CPUs and GPUs can dissipate hundreds of Watts of power while a small network or audio chip may only use a fraction of a Watt. If this is the case, the motherboard itself or the chip's outer packaging can be enough of a heatsink to keep the chip cool. Generally though, once you get above 1 Watt, you'll need to think about proper thermal management.
The name of the game here is keeping the thermal resistance between materials as low as possible. We want to build the shortest path for the heat from a chip to get to the ambient air. This is why CPU and GPU dies come with integrated heat spreaders (IHS) on top. The actual chip inside is much smaller than the size of the package, but by spreading the heat out over a larger area, we can more efficiently cool it. It's also important to use a good thermal compound between the chip and the cooler. Without this high thermal conductivity path, the heat would not be able to easily flow from the IHS to the heatsink.
There are two main forms of cooling: passive and active. Passive cooling is just a simple heatsink attached to the chip that is cooled with ambient airflow. The material will be something with a high thermal conductivity and a high surface area. This allows it to transfer the heat from the chip to the surrounding air.
Voltage regulators and memory chips can typically get away with passive cooling since they don't generate much heat. Mobile phone processors are typically passively cooled since they are designed to be very low power. The higher the performance of a chip, the more power it will generate and the more heatsink will be required. This is why phone processors are less powerful than desktop-class processors. There simply isn't enough cooling to keep up.
Thermal image of a mobile phone CPU with passive cooling plate
Once you get into the tens of Watts, you'll likely start thinking about active cooling. This uses a fan or other method to force air across a heatsink and can handle up to a few hundred Watts. In order to take advantage of this much cooling, we need to ensure the heat is spread from the chip to the entire surface of the cooler. It wouldn't be very useful if we had a huge heatsink but no way to get the heat to it.
This is where liquid cooling and heat pipes come in. They both perform the same task of transferring as much heat as possible from a chip to a heatsink or radiator. In a liquid cooling setup, heat is transferred from the chip to a waterblock through a high thermal conductivity thermal compound. The waterblock is often copper or some other material that conducts heat well. The liquid gets hotter and stores the heat until it reaches the radiator where it can be dissipated. For smaller systems like laptops that can't fit a full liquid cooling setup, heat pipes are very common. Compared to a basic copper tube, a heat pipe setup can be 10-100x more efficient at transferring heat away from a chip.
A heat pipe is very similar to liquid cooling, but it also employs a phase transition to increase the thermal transfer. Inside heat pipes, there is a liquid that turns to vapor when heated. The vapor travels along the heat pipe until it reaches the cold end and condenses back into a liquid. The liquid returns to the hot end through gravity or capillary action. This evaporative cooling is the same reason you feel cold when getting out of the shower or the pool. In all these scenarios, the liquid absorbs heat in the process of turning into a vapor and then releases the heat once it condenses.
Heat pipe demonstration - Zootalures: Wikipedia
Now that we can get the heat out of the chip and into a heat pipe or liquid, how do we dump that heat into the air? That's where fins and radiators come in. A tube of water or a heat pipe will transfer some of its heat into the surrounding air, but not very much. To really cool things down, we need to increase the surface area of the temperature gradient.
Thin fins in a heatsink or radiator spread the heat out over a large surface area which allows a fan to efficiently carry it away. The thinner the fins, the more surface area can fit into a given size. However, if they are too thin, there won't be enough contact made with the heat pipe to get the heat into the fins in the first place. It's a very fine balance which is why in certain scenarios, a larger cooler can perform worse than a smaller, more optimized cooler. Steve over at Gamers Nexus put together a great diagram of how this all works in a typical heatsink.
Heatsink operation - Gamers Nexus
But I want to get colder: going sub-ambient!
All of the cooling methods we have talked about work by the simple transfer of heat from a hot chip to the surrounding air. This means the chip can never get colder than the ambient temperature of the room it's in. If we want to cool to sub-ambient temperatures or have something huge like an entire data center to cool, we need to add some more science. This is where chillers and thermoelectric coolers come in.
Thermoelectric cooling, also known as a Peltier device, is currently not very popular, but has potential to be very useful. These devices transfer heat from one side of a cooling plate to the other with the consumption of electricity. They use a special thermoelectric material which can create a temperature difference via an electric potential. When a DC current flows through one side of the device, heat is transferred to the other side. This allows the "cool" side to go below ambient temperature. Currently these devices are very niche since they require a lot of energy to achieve any substantial cooling. However, researchers are working to create more efficient versions for larger markets.
Just like state transitions transfer heat, changing the pressure of a fluid can also be used to transfer heat. This is how refrigerators, air conditioners, and most other cooling systems work.
A special refrigerant flows through a closed loop in which it starts as a vapor, is compressed, condensed into a liquid, expanded, and evaporated back into a vapor. This cycle repeats and transfers heat in the process. The compressor does require energy, but a system like this can cool to sub-ambient temperatures. That's how datacenters and buildings can stay cool even on the hottest day of the summer.
Standard refrigeration cycle - Keenan Pepper: Wikipedia
Systems like these are typically second order when regarding electronics. You'll first dump the heat from the chip into the room and then dump the heat from the room to outside via a vapor compression system. However, extreme overclockers and performance enthusiasts may connect dedicated chillers to their CPUs if they need extra cooling performance. Temporary methods of extreme cooling are also possible via consumables like liquid nitrogen or dry ice.
I'm cold: let's wrap up
Cooling is something all electronics require, but can take many forms. The aim of the game is to move the heat from the hot chip or system to the cooler surroundings. There's no way to actually get rid of heat, so all we can do is move it somewhere that it won't be an issue.
All digital electronics generate heat due to the nature of how their internal transistors operate. If we don't get rid of that heat, the semiconductor material starts to break down and the chip can become damaged. Heat is the enemy of all electronics designers and is one of the key limiting factors of performance growth. We can't make CPUs and GPUs much bigger because there is no good way to cool something that powerful. You just can't get the heat out.
Hopefully you'll now have a greater appreciation for all the science that happens to keep your electronics cool.