Memory power management [LWN.net]

By Jonathan Corbet
June 7, 2011

Efforts to reduce power consumption on Linux systems have often focused on the CPU; that emphasis does make sense, since the CPU draws a significant portion of the power used on most systems. After years of effort to improve the kernel's power behavior, add instrumentation to track wakeups, and fix misbehaving applications, Linux does quite well when it comes to CPU power. So now attention is moving to other parts of the system in search of power savings to be had; one of those places is main memory. Contemporary DRAM memory requires power for its self-refresh cycles even if it is not being used; might there be a way to reduce its consumption?

One technology which is finding its way into some systems is called "partial array self refresh" or PASR. On a PASR-enabled system, memory is divided into banks, each of which can be powered down independently. If (say) half of memory is not needed, that memory (and its self-refresh mechanism) can be turned off; the result is a reduction in power use, but also the loss of any data stored in the affected banks. The amount of power actually saved is a bit unclear; estimates seem to run in the range of 5-15% of the total power used by the memory subsystem.

The key to powering down a bank of memory, naturally, is to be sure that there is no important data stored therein first. That means that the system must either evacuate a bank to be powered down, or it must take care not to allocate memory there in the first place. So the memory management subsystem will have to become aware of the power topology of main memory and take that information into account when satisfying allocation requests. It will also have to understand the desired power management policy and make decisions to power banks up or down depending on the current level of memory pressure. This is going to be fun: memory management is already a complicated set of heuristics which attempt to provide reasonable results for any workload; adding power management into the mix can only complicate things further.

A recent patch set from Ankita Garg does not attempt to solve the whole problem; instead, it creates an initial infrastructure which can be used for future power management decisions. Before looking at that patch, though, a bit of background will be helpful.

The memory management subsystem already splits available memory at two different levels. On non-uniform memory access (NUMA) systems, memory which is local to a specific processor will be faster to access than memory on a different processor. The kernel's memory management code takes NUMA nodes into account to implement specific allocation policies. In many cases, the system will try to keep a process and all of its memory on the same NUMA node in the hope of maximizing the number of local accesses; other times, it is better to spread allocations evenly across the system. The point is that the NUMA node must be taken into account for all allocation and reclaim decisions.

The other important concept is that of a "zone"; zones are present on all systems. The primary use of zones is to categorize memory by accessibility; 32-bit systems, for example, will have "low memory" and "high memory" zones to contain memory which can and cannot (respectively) be directly accessed by the kernel. Systems may have a zone for memory accessible with a 32-bit address; many devices can only perform DMA to such addresses. Zones are also used to separate memory which can readily be relocated (user-space pages accessed through page tables, for example) from memory which is hard to move (kernel memory for which there may be an arbitrary number of pointers). Every NUMA node has a full set of zones.

PASR has been on the horizon for a little while, so a few people have been thinking about how to support it; one of the early works would appear to be this paper by Henrik Kjellberg, though that work didn't result in code submitted upstream. Henrik pointed out that the kernel already has a couple of mechanisms which could be used to support PASR. One of those is memory hotplug, wherein memory can be physically removed from the system. Turning off a bank of memory can be thought of as being something close to removing that memory, so it makes sense to consider hotplug. Hotplug is a heavyweight operation, though; it is not well suited to power management, where decisions to power banks of memory up or down may be made fairly often.

Another approach would be to use zones; the system could set up a separate zone for each memory bank which could be powered down independently. Powering down a bank would then be a matter of moving needed data out of the associated zone and marking that zone so that no further allocations would be made from it. The problem with this approach is a number of important memory management operations happen at the zone level; in particular, each zone has a set of limits on how many free pages must exist. Adding more zones would increase memory management overhead and create balancing problems which don't need to exist.

That is the approach that Ankita has taken, though; the patch adds another level of description called "regions" interposed between nodes and zones, essentially creating not just one new zone for each bank of memory, but a complete set of zones for each. The page allocator will always try to obtain pages from the lowest-numbered region it can in the hope that the higher regions will remain vacant. Over time, of course, this simple approach will not work and it will become necessary to migrate pages out of regions before they can be powered down. The initial patch does not address that issue, though - or any of the associated policy issues that come up.

Your editor is not a memory management hacker, but ignorance has never kept him from having an opinion on things. To a naive point of view, it would almost seem like this design has been done backward - that regions should really be contained within zones. That would avoid multiplying the number of zones in the system and the associated balancing costs. Also, importantly, it would allow regions to be controlled by the policy of a single enclosing zone. In particular, regions inside a zone used for movable allocations would be vacated with relative ease, allowing them to be powered down when memory pressure is light. Placing multiple zones within each region, instead, would make clearing a region harder.

The patch set has not gotten a lot of review attention; the people who know what they are talking about in this area have mostly kept silent. There are numerous memory management patches circulating at the moment, so time for review is probably scarce. Andrew Morton did ask about the overhead of this work on machines which lack the PASR capability and about how much power might actually be saved; answers to those questions don't seem to be available at the moment. So one might conclude that this patch set, while demonstrating an approach to memory power management, will not be ready for mainline inclusion in the near future. But, then, adding power management to such a tricky subsystem was never going to be done in a hurry.

Index entries for this article
Kernel	Memory management/Power management
Kernel	Partial array self refresh (PASR)
Kernel	Power management

to post comments

Memory power management

Posted Jun 9, 2011 13:33 UTC (Thu) by ejr (subscriber, #51652) [Link] (4 responses)

Another question: How does this feed back into page cache decisions? Seems the cache that may be dropped (clean cache? forgotten the name) is a candidate for being turned off, but I'm not sure if anyone can make sensible rules for when to turn off associated memory other than immediately.

Memory power management

Posted Jun 10, 2011 10:19 UTC (Fri) by Ankita (guest, #39147) [Link] (3 responses)

Handling caches is a challenge. When the system is not under memory pressure, depending upon the memory v cache footprint of the application, heuristics would need to be designed to decide on the right time to power off memory. The overhead would be to write out dirty pages from the cache before actually turning off the memory. The regions framework would enable doing this targeted reclaim of areas of memory that need to be evacuated.

Memory power management

Posted Jun 10, 2011 13:40 UTC (Fri) by ejr (subscriber, #51652) [Link] (2 responses)

Thank you.

Unfortunately, there is always memory pressure in my areas (massive graph analysis, numerics, etc.). I'm working on a memory concurrency v. performance model for some graph tasks with a thought towards powering down unneeded cores (think SCC) when they cannot contribute. Now I'm wondering if there's some way to consider the memory side, again assuming everything fits (which it doesn't).

There also is much, *much* work going into dropping NAND flash into DRAM slots (phase change, plus DRAM cache). That will change the power usage characteristics drastically. If you can turn off the DRAM cache without having to flush out all the data...

Memory power management

Posted Jun 11, 2011 0:29 UTC (Sat) by giraffedata (guest, #1954) [Link]

I think it's usually the case that essentially all the memory is "used," so I don't get how the proposed policy can be effective. Memory is "used" to varying degrees, i.e. how important the contents are. Some is actually indispensable because there's no practical way to recreate its contents. But other memory is a cache of file contents, cache of dentries, memory that could be moved to swap space, and the like.

There's also internal fragmentation in memory allocation pools -- memory that's used just to anticipate future allocations and save time.

So I don't see any policy that just says power down totally free memory as being terribly useful. We need a policy that weighs for each page the value of having that data in memory vs the cost of powering that page.

Memory power management

Posted Jun 11, 2011 2:21 UTC (Sat) by Ankita (guest, #39147) [Link]

Ah yes, the proposal aims to conserve memory power when the system is not under too much memory pressure. We are targeting scenarios where minimal consolidation of references/allocations can result in significant power savings.

Memory power management

Posted Jun 9, 2011 22:07 UTC (Thu) by djm1021 (guest, #31130) [Link]

One of the best ways to minimize power consumed for RAM is to not have so %$*(%&(& much of it to begin with! Every machine is overprovisioned with RAM because the cost of swapping is so horrible. Is it possible to solve that problem instead? Can half of RAM be not just unpowered but actually removed from systems permanently (or never purchased to begin with)? That's the goal of RAMster (shameless plug, sorry)... which is built on top of cleancache and frontswap and zcache. Basically, it load-balances memory utilization across systems to achieve a maximum memory usage which is "max of sum" instead of "sum of max". Clearly that only works if you have a number of systems that statistically vary in working set size (i.e. doesn't help Amazon during the holiday shopping season), but that's still a big step forward.

Memory power management

Posted Jun 10, 2011 8:13 UTC (Fri) by Ankita (guest, #39147) [Link] (8 responses)

Hi John,

You are right that creating regions inside zones results in a bloat in the number of zones. But the difficulty we faced is that zones already encapsulate some boundary information, that might be at a level lower than regions. A region could span multiple zones, in which case we would need another mechanism to group these sub-regions into one region that maps to an independently managed power unit. For instance, a single numa node with 8GB RAM, will come up with two zones- ZONE_DMA and ZONE_NORMAL. But if this numa node has support for power management at a different level, say 2GB, then we would create 4 regions, thus spanning the two zones. This would make targeted allocation and reclaim now depend on another piece of information to unite regions that form one single unit. Further, the zone policies like movable allocations, could still be leveraged when zones are under regions. However, as Dave pointed out, it is important to understand the performance impact of this change.

Also, besides PASR, there are other mechanisms by which memory power could be conserved. The Samsung Exynos 4210 for instance, has support for automatic power down of memory, i.e, if there are no references to certain areas of memory for a certain threshold of time, the hardware would automatically put that area of memory into a lower power state, without losing the content. A basic infrastructure to make the VM aware of the hardware topology would aid the hardware in placing memory into lower power states.

-Ankita

Memory power management power savings test

Posted Jun 10, 2011 14:24 UTC (Fri) by ccurtis (guest, #49713) [Link] (7 responses)

Is it possible to compare memory power savings by comparing usage with the full amount of RAM with usage with a restricted amount of RAM via a kernel boot parameter? The unused RAM would probably still need to be powered off manually but this seems like a simpler patch to start with. (For testing this I'm assuming an idle base load system that fits completely within the available RAM with all swap disabled, of course.)

Memory power management power savings test

Posted Jun 10, 2011 14:30 UTC (Fri) by mjg59 (subscriber, #23239) [Link] (4 responses)

We don't have any way to power down memory on existing x86 hardware, as far as I know.

Memory power management power savings test

Posted Jun 10, 2011 15:05 UTC (Fri) by ccurtis (guest, #49713) [Link] (3 responses)

Perhaps I was a bit too terse.

The amount of power actually saved is a bit unclear; estimates seem to run in the range of 5-15% of the total power used by the memory subsystem. [...] A recent patch set from Ankita Garg does not attempt to solve the whole problem; instead, it creates an initial infrastructure which can be used for future power management decisions.

Before creating this extensive infrastructure, perhaps it would be better to get an idea of what kind of power savings this actually provides. Code would still need to be written to power down the memory bank, and code would also likely need to be written to isolate the excluded RAM from the boot parameter, but this seems like a relatively easy way to answer the question before embarking on the endeavor.

Of course, this may be a done deal and it's just a matter of time before the code gets written, but it would still be interesting to see how much power is actually going to be saved. A patch like this would allow individuals to measure the power savings of their own systems as well, in case they wanted to control the trade of any overhead the new memory management changes might impose.

Memory power management power savings test

Posted Jun 10, 2011 16:20 UTC (Fri) by etienne (guest, #25256) [Link] (2 responses)

Maybe just reboot your PC with different number of memory modules and measure the power difference?

Memory power management power savings test

Posted Jun 11, 2011 3:19 UTC (Sat) by willy (subscriber, #9762) [Link] (1 responses)

The problem is that pages are interleaved across DIMMs. If you have 3 channels with a single DIMM each, cacheline 0 is on DIMM 0, cacheline 1 on DIMM 1, cacheline 2 on DIMM 2 and cacheline 3 on DIMM 0. Removing a DIMM causes the interleaving to change, which will also cause performance to change, and your measurements are now invalid.

As I understand PASR, one would not power down an entire DIMM, but rather sections of each DIMM, thus preserving the performance benefits of interleaving.

Memory power management power savings test

Posted Jun 16, 2011 4:54 UTC (Thu) by Ankita (guest, #39147) [Link]

Yes, PASR support seems to be at bank level. From a few documents I have read, I find that interleaving can be configured to control the number of banks that have to be kept open for every memory access, typically the minimum being 2 banks. Thus, if only two banks are interleaved, the other banks can potentially be turned off or not refreshed. Power v performance benchmarking will be needed to decide on the best interleaving scheme though.

Memory power management power savings test

Posted Jun 13, 2011 14:11 UTC (Mon) by nye (guest, #51576) [Link] (1 responses)

So, back in 2003 I built my first system with 1GB of RAM. Initially I thought I would save money by re-using a number of existing components, including the power supply.

Sadly, the existing power supply proved inadequate and the system would crash shortly after booting, if it even booted at all. The memory was a single DIMM so there wasn't anything I could remove to reduce power consumption, however it turned out that telling the kernel to use only about 800M allowed the system to remain completely stable for the few days it took to get a new power supply.

Thus I conclude that the most likely explanation is that RAM which the OS doesn't believe even exists actually does use less power than RAM which is simply not in use at the time.

Memory power management power savings test

Posted Jun 16, 2011 18:26 UTC (Thu) by Pc5Y9sbv (guest, #41328) [Link]

You may have just been lucky that the most voltage-sensitive circuits were active in the "upper" region you excluded. It's not likely that the memory was statically powered down because it wasn't in use, but there are many dynamic power loads in digital circuits as they change states.

It could even have to do with certain combinations of address and data bits that required more power to configure the addressing logic and route the data signals, and it didn't stabilize within the configured access timings.

The end result is that certain memory addresses in a given module will tend to show corruption before others as the power supply sags or the timings get too tight. This is why people advocate long runs with a dedicated memory test program to try to validate parts in situ. Just running an OS may not exercise combinations of address and data bits with sufficient testing coverage, at least not for many hours (or weeks!) of operation.

misbehaving applications

Posted Jun 12, 2011 17:45 UTC (Sun) by pipipen (guest, #56099) [Link]

I'm interested in what have been done to fix the misbehaving applications. Do we have an article introducing it?

>> After years of effort to improve the kernel's power behavior, add instrumentation to track wakeups, and fix misbehaving applications,

Memory power management

Posted Jul 7, 2011 6:56 UTC (Thu) by henrik.kjellberg (guest, #61640) [Link]

I happened to see that you only got the summary of my paper.
The whole thesis can be read here: http://sam.cs.lth.se/ExjobGetFile?id=239 and contains more background information about the system structure. It can be good to read for those that are curious about the technology but lack the knowledge of the memory structure.

Best regards
Henrik Kjellberg