Memory power management
One technology which is finding its way into some systems is called "partial array self refresh" or PASR. On a PASR-enabled system, memory is divided into banks, each of which can be powered down independently. If (say) half of memory is not needed, that memory (and its self-refresh mechanism) can be turned off; the result is a reduction in power use, but also the loss of any data stored in the affected banks. The amount of power actually saved is a bit unclear; estimates seem to run in the range of 5-15% of the total power used by the memory subsystem.
The key to powering down a bank of memory, naturally, is to be sure that there is no important data stored therein first. That means that the system must either evacuate a bank to be powered down, or it must take care not to allocate memory there in the first place. So the memory management subsystem will have to become aware of the power topology of main memory and take that information into account when satisfying allocation requests. It will also have to understand the desired power management policy and make decisions to power banks up or down depending on the current level of memory pressure. This is going to be fun: memory management is already a complicated set of heuristics which attempt to provide reasonable results for any workload; adding power management into the mix can only complicate things further.
A recent patch set from Ankita Garg does not attempt to solve the whole problem; instead, it creates an initial infrastructure which can be used for future power management decisions. Before looking at that patch, though, a bit of background will be helpful.
The memory management subsystem already splits available memory at two different levels. On non-uniform memory access (NUMA) systems, memory which is local to a specific processor will be faster to access than memory on a different processor. The kernel's memory management code takes NUMA nodes into account to implement specific allocation policies. In many cases, the system will try to keep a process and all of its memory on the same NUMA node in the hope of maximizing the number of local accesses; other times, it is better to spread allocations evenly across the system. The point is that the NUMA node must be taken into account for all allocation and reclaim decisions.
The other important concept is that of a "zone"; zones are present on all systems. The primary use of zones is to categorize memory by accessibility; 32-bit systems, for example, will have "low memory" and "high memory" zones to contain memory which can and cannot (respectively) be directly accessed by the kernel. Systems may have a zone for memory accessible with a 32-bit address; many devices can only perform DMA to such addresses. Zones are also used to separate memory which can readily be relocated (user-space pages accessed through page tables, for example) from memory which is hard to move (kernel memory for which there may be an arbitrary number of pointers). Every NUMA node has a full set of zones.
PASR has been on the horizon for a little while, so a few people have been thinking about how to support it; one of the early works would appear to be this paper by Henrik Kjellberg, though that work didn't result in code submitted upstream. Henrik pointed out that the kernel already has a couple of mechanisms which could be used to support PASR. One of those is memory hotplug, wherein memory can be physically removed from the system. Turning off a bank of memory can be thought of as being something close to removing that memory, so it makes sense to consider hotplug. Hotplug is a heavyweight operation, though; it is not well suited to power management, where decisions to power banks of memory up or down may be made fairly often.
Another approach would be to use zones; the system could set up a separate zone for each memory bank which could be powered down independently. Powering down a bank would then be a matter of moving needed data out of the associated zone and marking that zone so that no further allocations would be made from it. The problem with this approach is a number of important memory management operations happen at the zone level; in particular, each zone has a set of limits on how many free pages must exist. Adding more zones would increase memory management overhead and create balancing problems which don't need to exist.
That is the approach that Ankita has taken, though; the patch adds another level of description called "regions" interposed between nodes and zones, essentially creating not just one new zone for each bank of memory, but a complete set of zones for each. The page allocator will always try to obtain pages from the lowest-numbered region it can in the hope that the higher regions will remain vacant. Over time, of course, this simple approach will not work and it will become necessary to migrate pages out of regions before they can be powered down. The initial patch does not address that issue, though - or any of the associated policy issues that come up.
Your editor is not a memory management hacker, but ignorance has never kept him from having an opinion on things. To a naive point of view, it would almost seem like this design has been done backward - that regions should really be contained within zones. That would avoid multiplying the number of zones in the system and the associated balancing costs. Also, importantly, it would allow regions to be controlled by the policy of a single enclosing zone. In particular, regions inside a zone used for movable allocations would be vacated with relative ease, allowing them to be powered down when memory pressure is light. Placing multiple zones within each region, instead, would make clearing a region harder.
The patch set has not gotten a lot of review attention; the people who know
what they are talking about in this area have mostly kept silent.
There are numerous
memory management patches circulating at the moment, so time for review is
probably scarce. Andrew Morton did ask
about the overhead of this work on machines which lack the PASR capability
and about how much power might actually be saved; answers to those
questions don't seem to be available at the moment. So one might conclude
that this patch set, while demonstrating an approach to memory power
management, will not be ready for mainline inclusion in the near future.
But, then, adding power management to such a tricky subsystem was never
going to be done in a hurry.
| Index entries for this article | |
|---|---|
| Kernel | Memory management/Power management |
| Kernel | Partial array self refresh (PASR) |
| Kernel | Power management |
