The folio pull-request pushback

By Jonathan Corbet
September 10, 2021

When we last caught up with the page folio patch set, it appeared to be on track to be pulled into the mainline during the 5.15 merge window. Matthew Wilcox duly sent a pull request in August to make that happen. While it is possible that folios could still end up in 5.15, that has not happened as of this writing and appears increasingly unlikely. What we got instead was a lengthy discussion on the merits of the folio approach.

The kernel's memory-management subsystem deals with memory in pages, a fundamental unit that is generally set by the hardware (and is usually 4KB in size). These "base" pages are small, though, so the kernel often needs to deal with memory in larger units. To do so, it groups sets of physically contiguous pages together into compound pages, which are made up of a head page (the first base page of many in the compound page) and some number of tail pages. That leads to a situation where kernel code that is passed a struct page pointer often does not know if it is dealing with a head or a tail page without explicitly checking.

It turns out that the "make sure this is a head page" checks add up to a certain amount of expense in a running kernel. The use of struct page everywhere also makes kernel APIs unclear — it can be difficult to know if a given function can cope with tail pages or not. To address this problem, Wilcox created the concept of a "folio", which is like a struct page but which is known not to be a tail page. By changing internal functions to use folios, Wilcox is able to speed up the kernel and clean up the API at the same time. The patch set is huge and intrusive, but it appeared to have overcome most resistance and be ready to head into the mainline kernel.

Objections

Memory-management developer Johannes Weiner quickly responded to the request to express his "strong reservations" about the folio concept. Over the course of the ensuing discussion he described his objections in a number of ways, but it seems to come down to a core point: a folio is just a collection of physically contiguous pages, and that is going to make it hard to deal with a number of challenges facing memory management.

To start with, the folio design leaks too much information about memory-management internals to other users of folios, filesystems in particular. The current page-oriented APIs have the same problem, of course, but a massive API change should, he said, be the time to address that issue. So Weiner has asked, more than once, for the creation of a more abstract representation of memory that would be used in the higher levels of the kernel. This abstraction would hide a lot of details; it would also eliminate the assumption that a folio is a physically contiguous, power-of-two-sized chunk of memory.

The assumption of physical contiguity, he continued, is a serious problem because the memory-management subsystem has never been good at allocating larger, contiguous chunks of memory. At some point fragmentation takes hold and those larger chunks simply aren't there. Techniques like page compaction can help to an extent, but that comes at the cost of excessive allocation latency. "We've effectively declared bankruptcy on this already", he said. There is no point in adopting folios to represent larger chunks of memory without thinking about how those chunks will be allocated.

The other problem, he said, is that the folio concept could make it much harder to change the memory-management subsystem to use a larger base-page size. That change would make the system more efficient in many ways, including less memory wasted for page structures and less CPU time dedicated to dealing with them. There is one problem that has kept the kernel from increasing the base-page size for many years, though: internal fragmentation.

When a file's contents (or a portion thereof) are stored in the page cache, they require a certain number of full pages. Unix-like systems have a lot of small files, but even a one-line file will occupy a full page in the page cache; all of the memory in that page beyond the end of the file is simply wasted. Increasing the size of a base page will necessarily increase the amount of memory lost to this internal fragmentation as well. In a previous folio discussion, Al Viro did a quick calculation showing just how much more memory it would take to keep the kernel source in memory with a larger page size. A 64KB size would quadruple the memory used, for example; it is not a small cost.

For this reason, Weiner argued that the kernel will need to be able to manage file caching in small units (the existing 4KB size, for example) even when the memory-management subsystem moves to a larger base-page size. In other words, the page cache will need to be able to work with sub-page units of memory. A new abstraction for memory might facilitate that; the current folio concept, being firmly tied to underlying pages, cannot.

Wilcox's answer to this criticism seems to be that it makes little sense in managing memory in units other than the allocation size. The way to use larger units in the memory-management subsystem is thus to allocate compound pages, represented by folios; if everything in the kernel is using larger pages, memory will fragment less and those pages will become easier to allocate. For cases where smaller sizes are needed, such as page-cache entries for small files, simple base pages could be used. With careful allocation, the single pages could be packed together, further avoiding fragmentation.

Weiner disagreed with that approach, though, saying that it puts the fragmentation-avoidance problem in the wrong place. The kernel has two levels of memory allocation: the page allocator (which deals in full pages and is where folios are relevant) and the slab allocator, which normally deals with smaller units. They have different strengths:

The page allocator is good at cranking out uniform, slightly big memory blocks. The slab allocator is good at subdividing those into smaller objects, neatly packed and grouped to facilitate contiguous reclaim, while providing detailed breakdowns of per-type memory usage and internal fragmentation to the user and to kernel developers.

According to Weiner, Wilcox's approach forces the page allocator to deal with problems that are currently well solved in the slab allocator. "As long as this is your ambition with the folio, I'm sorry but it's a NAK from me".

The real problem with folios

The ultimate decision on the merging of folios is, of course, up to Linus Torvalds. Early in the conversation, he wrote positively about the API improvements, but also noted that the patch set does bring a lot of churn to the memory-management subsystem. He concluded: "So I don't hate the patches. I think they are clever, I think they are likely worthwhile, but I also certainly don't love them."

He also, however, noted that he wasn't entirely happy with the "folio" name, thus touching off one of the more predictable dynamics of kernel-community discussions: when the technical side gets involved and intractable, it must be time to have an extended debate on naming. So David Howells suggested "sheaf" or "ream". Torvalds favored something more directly descriptive, like "head_page". Ted Ts'o thought "mempages" would work, or maybe "pageset". Nicholas Piggin favored "cluster" or "superpage". Given the discussion, Vlastimil Babka concluded that the only suitable name was "pageshed".

Wilcox has made it abundantly clear that he doesn't care about the name and will accept just about anything if that gets the code merged. He redid the pull request with everything renamed to "pageset" just to prove that point. Needless to say, no real conclusion came from that branch of the conversation.

At the end of August, Howells posted a plea for a quick resolution on the issue; there is a lot of other pending memory-management work that either depends on the folio patches or conflicts with them. He asked:

Is it possible to take the folios patchset as-is and just live with the name, or just take Willy's rename-job (although it hasn't had linux-next soak time yet)? Or is the approach fundamentally flawed and in need of redoing?

David Hildenbrand added that he would like to see folios move out of linux-next one way or the other; sooner would be better.

Now what?

For now, at least, the conversations have wound down. The 5.15 merge window is nearing its close and the folio patches have not been pulled. Chances are that means that folios will, at best, have to wait for another development cycle. That said, Torvalds has been known to hold controversial pulls until — or even past — the end of the merge window, when he has a bit more time to think them through. So even the closing of the merge window might not be an indication that the decision has been made.

The final chapter has not been written here, but either way it seems clear that there is a lot of work yet to be done in the memory-management subsystem. Much of what needs to happen has yet to be designed, much less written and debugged; that adds some strength to one last argument from Wilcox: "The folio patch is here now". Or, as Babka asked: "should we just do nothing until somebody turns that hypothetical future into code and we see whether it works or not?" If folios go down in flames, work to improve memory-management internals at this level will have to restart from the beginning, and it's not clear that there is anybody out there who is ready to take up that challenge.

Index entries for this article
Kernel	Memory management/Folios

The folio pull-request pushback

Posted Sep 11, 2021 4:23 UTC (Sat) by koverstreet (subscriber, #4296) [Link] (1 responses)

My recap:

https://lore.kernel.org/linux-fsdevel/20210911012324.6vb7...

The folio pull-request pushback

Posted Sep 11, 2021 18:29 UTC (Sat) by flussence (guest, #85566) [Link]

This seems like it's becoming the next BKL... a huge amount of churn, even though the improvements are measurable and significant it's going to take some time to convince everyone. Hopefully not as long as that though!

The folio pull-request pushback

Posted Sep 11, 2021 19:51 UTC (Sat) by smurf (subscriber, #17840) [Link] (1 responses)

Ugh. Whatever happened to the idea that the proof is in the code?

Yes, folios may be imperfect and not-abstracted enough, but they *are* a step in the right direction and further improvements can be built built on top of them.

I don't see an argument that folios block any of the other work that somebody else might be doing (but, or so it seems, currently isn't).

The folio pull-request pushback

Posted Sep 15, 2021 18:32 UTC (Wed) by marcH (subscriber, #57642) [Link]

> Yes, folios may be imperfect and not-abstracted enough, but they *are* a step in the right direction and further improvements can be built built on top of them.
> I don't see an argument that folios block any of the other work that somebody else might be doing (but, or so it seems, currently isn't).

This is the key point IMHO. If some short-term code change makes more difficult some hypothetical, longer term plans, then the long term vision should be detailed enough to at least demonstrate how it conflicts with the short term change. Otherwise it's far beyond vaporware.

The folio pull-request pushback

Posted Sep 11, 2021 22:38 UTC (Sat) by dvdeug (subscriber, #10998) [Link] (11 responses)

> In a previous folio discussion, Al Viro did a quick calculation showing just how much more memory it would take to keep the kernel source in memory with a larger page size. A 64KB size would quadruple the memory used, for example; it is not a small cost.

Is that a realistic cost, though? 64KB would be 4832MB(? *); if you're someone running grep over the kernel source repeatedly, you've likely got 5GB to spare. A kernel compilation is going to be a lot messier, with a lot more temporary files and executables fighting for file cache. I've got about 3000 files open on this system according to lsof, which is far less than the 71000 kernel source files. That's an additional 192MB if each of them adds 64KB, which wouldn't be noticed.

There's a lot more to be studied, but I'm not sure that quick calculation reflects anything that really matters.

* Al Viro wrote "64Kb 4832Mb"; since he starts at 4Kb and everyone else says the base page is 4KB, not 4Kb, I assume that's careless capitalization.

The folio pull-request pushback

Posted Sep 12, 2021 11:40 UTC (Sun) by Sesse (subscriber, #53779) [Link] (3 responses)

> if you're someone running grep over the kernel source repeatedly, you've likely got 5GB to spare

I fail to find the logic in this?

But to take another case: What if I want to read my email, which is on Maildir, and would like the mailbox to be in cache so that I can open it quickly? Is it reasonable to waste gigabytes of RAM (which I would prefer to use on opening a few extra tabs in my browser…) on 64 kB pages for each email?

The folio pull-request pushback

Posted Sep 12, 2021 22:37 UTC (Sun) by dvdeug (subscriber, #10998) [Link] (2 responses)

> What if I want to read my email, which is on Maildir, and would like the mailbox to be in cache so that I can open it quickly? Is it reasonable to waste gigabytes of RAM (which I would prefer to use on opening a few extra tabs in my browser…) on 64 kB pages for each email?

Catting a small, cold file to Konsole with time -v yielded "Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.01". Dumping a 60k cold file to Konsole with time -v yielded "Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.26". In both cases, this was off hard drive. I'm going to say that demanding your entire mailbox with thousands of emails to be in cache so you can save hundredths of a second in opening old emails is unreasonable, and demanding the entire kernel for everyone to be optimized for that occurrence is even more unreasonable.

What if the reduced kernel memory and reduced data handling by the kernel for 64KB pages speeds up opening new tabs in your browser and switching between them? I'd like hard numbers on a variety of real-life situations, not handwaves based off contrived situations.

The folio pull-request pushback

Posted Sep 13, 2021 9:32 UTC (Mon) by Sesse (subscriber, #53779) [Link] (1 responses)

Opening old emails? Remember, I'm talking about opening a _mailbox_, reading every single email to check its headers. (No, not all systems can maintain a separate header cache.)

Also, please note that when you dismiss others' (real!) use cases as “handwaving based off contrived situations”, it does not come across as the most friendly way to make your case. At the very least, it's no less handwavy than the original assertion that you'll surely have 5 GB of free RAM.

The folio pull-request pushback

Posted Sep 13, 2021 22:37 UTC (Mon) by dvdeug (subscriber, #10998) [Link]

If your system does not have a separate header cache, there's a problem you might like to fix. When opening the mailbox up cold, you're going to be paying that cost anyway. Once you've opened up the mailbox, your mail program certainly can and should store the headers in memory (much more reliable than the kernel cache, and maybe even noticeably faster), and possibly should store the messages in memory, to avoid all this cache mess to begin with. (It'll cost less memory than storing them in kernel cache, even with 4KB pages.)

> it's no less handwavy than the original assertion that you'll surely have 5 GB of free RAM.

Kernel programmers are well-paid professionals. They don't have used Dell Optiplexes as their main PC; their programming boxes are almost certainly high-end hardware. $1000 will buy you a computer with 16GB. There's a lot of websites that tell you 8 GB is fine for programming, but generally kernel programmers aren't going to need or want to skimp out on their hardware.

The folio pull-request pushback

Posted Sep 13, 2021 21:50 UTC (Mon) by Paf (subscriber, #91811) [Link] (4 responses)

Ok, but think about all kinds of server use cases with many small files. They don’t necessarily have “spare” RAM laying around. They’re specced to the system. This is a truly huge overhead in a lot of real world cases which would cost a lot of real world money.

The folio pull-request pushback

Posted Sep 13, 2021 23:32 UTC (Mon) by dvdeug (subscriber, #10998) [Link] (3 responses)

How many small files are actually stored in cache? Conf files are generally read and abandoned. If it is a gain, then major server operators would be quick to code systems to store data in larger data structures; even now, tossing a bunch of small files into a database could save space and time.

I'm sure increasing the page size to 64KB would be an improvement in some cases, and hurt others. I just think that storing every single kernel file in cache is a bad comparison. (If nothing else, if grepping the entire kernel is something you're actually doing frequently, you're doing it wrong; efficient searching via precomputed indexes has been around for 50 years, or if we ignore the electronic computer part, for centuries.) There should be some sort of measurement on real situations.

The folio pull-request pushback

Posted Sep 14, 2021 5:28 UTC (Tue) by Wol (subscriber, #4433) [Link] (1 responses)

> How many small files are actually stored in cache? Conf files are generally read and abandoned. If it is a gain, then major server operators would be quick to code systems to store data in larger data structures; even now, tossing a bunch of small files into a database could save space and time.

Hasn't the kernel just added a "read and abandon" facility? User space may read and abandon, but the kernel caches EVERYTHING by default aiui. The cost of that is measurable, and big.

Cheers,
Wol

The folio pull-request pushback

Posted Sep 20, 2021 0:37 UTC (Mon) by jwarnica (subscriber, #27492) [Link]

Kernel will cache everything, but discard LRU when needed.

A config file read, parsed, stored in working memory, and closed, it being cached it's a "waste" of cache memory either way.

It's only painful if caching that new file displaces an older cached file which will be read again in some human meaningful time.

The folio pull-request pushback

Posted Sep 14, 2021 11:05 UTC (Tue) by cpitrat (subscriber, #116459) [Link]

A computer is done for general purpose. Dismissing any use case that needs opening a large number of small file is a weird way to answer the concern. The kernel files are representative of what most devs would work with: small files. Is it representative of all users and all workloads? Certainly not but it represents an existing scenario and quadrupling the memory consumed is problematic.

Many files smaller than 64KiB ar me opened by the system at some point and stored in cache (all configuration files, log files ...). Only in /etc I have 2500 files smaller than 8KiB and I'm pretty sure most of them have been opened at one point and cached by the kernel. Same for files under $HOME/.config and other dotted directories.

The folio pull-request pushback

Posted Sep 17, 2021 9:31 UTC (Fri) by arnd (subscriber, #8866) [Link]

I did some measurements a while ago, using linux-5.4 at the time to see the effect of additional
memory usage of the different page sizes. I tried running linux-5.14-rc1 folio as well, and
put the results in a graph:

https://docs.google.com/spreadsheets/d/1Y-eeXEHr8Tud2ul4i...

I did this on a 16-Core Arm machine that supports 4KB, 16KB and 64KB pages, giving 4GB to a virtual
machine, and pinning down part of that memory before building a fixed kernel source tree.

Since the compiler uses mostly contiguous anonymous memory, the effect of the page size is not as strong as when considering only the page cache that wastes more memory, but you can definitely see that the 64KB kernel needs around double the RAM compared to a 4KB kernel, and it also suffers more when it does start paging. The 4KB kernel seems to work much better when it's already deep into swap, while the 64KB kernel gets unusable pretty much instantly as soon as it runs out of free pages.

The 16KB page kernel works better than expected -- not only is it almost as fast as the 64KB version when it has enough RAM available, it also copes with out-of-memory conditions aslmost as well as the 4KB version.

The folio-enabled kernel also seems to have a problem with running into swap, but I don't know if that's a result of something different in the folio patches, or a difference between the old 5.4 kernel and the new 5.14-rc1 version. If I find the time to run another test with 5.14-rc1 without the folio patches, I'll add the data to the graph.

The folio pull-request pushback

Posted Sep 28, 2021 16:13 UTC (Tue) by immibis (subscriber, #105511) [Link]

Okay, what if I have 32GB of RAM and I want to grep a bigger project that was using 30GB before? You can't really justify a 300% overhead(!!!) by saying it's okay on one particular workload.

The folio pull-request pushback

Posted Sep 23, 2021 8:22 UTC (Thu) by SomeOtherGuy (guest, #151918) [Link]

There's a lot of focus on the name here, this reference is often overused but I think you guys are having a bit of a "bikeshed" moment - the name is the most trivial part of this.

This is quite a common situation, so much so the object-orientated analogy is common (but slightly altered)

We have a Bird class, later Birds gain the ability to fly, our Penguin subclass can't fly - we have to handle that.

At some point the sane action is to create a FlyingBird subclass and stick your flying birds there and leave Penguin under the now-implied-flightless Bird base class, or (which you can do here as structs don't inherit) have a FlightlessBird name we put Penguins under.

You have a third option: both, you now have FlightlessBird and FlyingBird and the implied-property-of-Bird (whichever it was, flightless or not) is now attached to the name.

That's it, those are your options up to name isomorphism - bickering about FlyingBird vs flightful_bird doesn't change what's going on.

The folio pull-request pushback

Posted Oct 10, 2021 9:00 UTC (Sun) by scientes (guest, #83068) [Link]

Maybe the TLB could be put *after* the cache, by using globally addressable memory, and having ranged memory permissions for such large slabs of memory, instead of paged memory be the only option (you can still defrag by putting a TLB and MMU *behind* the cache, which like an IOMMU has minimal overhead). We have 64-bit of address space now—there is no shortage of address space.