Ghosts of Unix past, part 2: Conflated designs

November 4, 2010

This article was contributed by Neil Brown

In the first article in this series, we commenced our historical search for design patterns in Linux and Unix by illuminating the "Full exploitation" pattern which provides a significant contribution to the strength of Unix. In this second part we will look at the first of three patterns which characterize some design decisions that didn't work out so well.

The fact that these design decisions are still with us and worth talking about shows that their weaknesses were not immediately obvious and, additionally, that these designs lasted long enough to become sufficiently entrenched that simply replacing them would cause more harm than good. With these types of design issues, early warning is vitally important. The study of these patterns can only serve if they help us to avoid similar mistakes early enough. If they only allow us to classify that which we cannot avoid, there would be little point in studying them at all.

These three patterns are ordered from the one which seems to give most predictive power to that which is least valuable as an early warning. But hopefully the ending note will not be one of complete despair - any guidance in preparing for the future is surely better than none.

Conflated Designs

This week's pattern is exposed using two design decisions which were present in early Unix and have been followed by a series of fixes which have address most of the resulting difficulties. By understanding the underlying reason that the fixes were needed, we can hope to avoid future designs which would need such fixing. The first of these design decisions is taken from the implementation of the single namespace discussed in part 1.

The mount command

The central tool for implementing a single namespace is the 'mount' command, which makes the contents of a disk drive available as a filesystem and attaches that filesystem to the existing namespace. The flaw in this design which exemplifies this pattern is the word 'and' in that description. The 'mount' command performs two separate actions in one command. Firstly it makes the contents of a storage device appear as a filesystem, and secondly it binds that filesystem into the namespace. These two steps must always be done together, and cannot be separated. Similarly the unmount command performs the two reverse actions of unbinding from the namespace and deactivating the filesystem. These are, or at least were, inextricably combined and if one failed for some reason, the other would not be attempted.

It may seem at first that it is perfectly natural to combine these two operations and there is no value in separating them. History, however, suggests otherwise. Considerable effort has gone into separating these operations from each other.

Since version 2.4.11 (released in 2001), Linux has a 'lazy' version of unmount. This unbinds a filesystem from the namespace without insisting on deactivating it at the same time. This goes some way to splitting out the two functional aspects of the original unmount. The 'lazy' unmount is particularly useful when a filesystem has started to fail for some reason, a common example being an NFS filesystem from a server which is no longer accessible. It may not be possible to deactivate the filesystem as there could well be processes with open files on the filesystem. But at least with lazy unmounted it can be removed from the namespace so new processes wont be able to try to open files and so get stuck.

As well as 'lazy' unmounts, Linux developers have found it useful to add 'bind' mounts and 'move' mounts. These allow one part of the name space to be bound to another part of the namespace (so it appears twice) or a filesystem to be moved from one location to another — effectively a 'bind' mount followed by a 'lazy' unmount. Finally we have a pivot_root() system call which performs a slightly complicated dance between two filesystem starting out with the first being the root filesystem and the second being a normal mounted file system, and ending with the second being the root and the first being mounted somewhere else in that root.

It might seem that all of the issues with combining the two functions into a single 'mount' operation have been adequately resolved in the natural course of development, but it is hard to be convinced of this. The collection of namespace manipulation functions that we now have is quite ad hoc and so, while it seems to meet current needs, there can be no certainty that it is in any sense complete. A hint of this incompleteness can be seen in the fact that, once you perform a lazy unmount, the filesystem may well still exist, but it is no longer possible to manipulate it as it does not have a name in the global namespace, and all current manipulation operations require such a name. This makes it difficult to perform a 'forced' unmount after a 'lazy' unmount.

To see what a complete interface would look like we would need to exploit the design concept discussed last week: "everything can have a file descriptor". Had that pattern been imposed on the design of the mount system call we would likely have:

A mount call that simply returned a file descriptor for the file system.
A bind call that connected a file descriptor into the namespace, and
An unmount call that disconnected a filesystem and returned a file descriptor.

This simple set would easily provide all the functionality that we currently have in an arguably more natural way. For example the functionality currently provided by the special-purpose pivot_root() system call could be achieve with the above with at most the addition of fchroot(), an obvious analogue of fchdir() and chroot().

One of the many strengths of Unix - particularly seen in the set of tools that came with the kernel - is the principle of building and then combining tools. Each tool should do one thing and do it well. These tools can then be combined in various ways, often to achieve ends that the tool developer could not have foreseen. Unfortunately the same discipline was not maintained with the mount() system call.

So this pattern is to some extent the opposite of the 'tools approach'. It needs a better name than that, though; a good choice seems to be to call it a "conflated design". One dictionary (PJC) defines "conflate" as "to ignore distinctions between, by treating two or more distinguishable objects or ideas as one", which seems to sum up the pattern quite well.

The open() system call.

Our second example of a conflated design is found in the open() system call. This system call (in Linux) takes 13 distinct flags which modify its behavior, adding or removing elements of functionality - multiple concepts are thus combined in the one system call. Much of this combination does not imply a conflated design. Several of the flags can be set or cleared independently of the open() using the F_SETFL option to fcntl(). Thus while they are commonly combined, they are easily separated and so need not be considered to be conflated.

Three elements of the open() call are worthy of particular attention in the current context. They are O_TRUNC, O_CLOEXEC and O_NONBLOCK.

In early versions of Unix, up to and including Level 7, opening with O_TRUNC was the only way to truncate a file and, consequently, it could only be truncated to become empty. Partial truncation was not possible. Having truncation intrinsically tied to open() is exactly the sort of conflated design that should be avoided and, fortunately, it is easy to recognize. BSD Unix introduced the ftruncate() system call which allows a file to be truncated after it has been opened and, additionally, allows the new size to be any arbitrary value, including values greater than the current file size. Thus that conflation was easily resolved.

O_CLOEXEC has a more subtle story. The standard behavior of the exec() system call (which causes a process to stop running one program and to start running another) is that all file descriptors available before the exec() are equally available afterward. This behavior can be changed, quite separately from the open() call which created the file descriptor, with another fcntl() call. For a long time this appeared to be a perfectly satisfactory arrangement.

However the advent of threads, where multiple processes could share their file descriptors (so when one thread or process opens a file, all threads in the group can see the file descriptor immediately), made room for a potential race. If one process opens a file with the intent of setting the close-on-exec flag immediately, and another process performs an exec() (which causes the file table to not be shared any more), the new program in the second process will inherit a file descriptor which it should not. In response to this problem, the recently-added O_CLOEXEC flag causes open() to mark the file descriptor as close-on-exec atomically with the open so there can be no leakage.

It could be argued that creating a file descriptor and allowing it to be preserved across an exec() should be two separate operations. That is, the default should have been to not keep a file descriptor open across exec(), and a special request would be needed to preserve it. However foreseeing the problems of threads when first designing open() would be beyond reasonable expectations, and even to have considered the effects on open() when adding the ability to share file tables would be a bit much to ask.

The main point of the O_CLOEXEC example then is to acknowledge that recognizing a conflated design early can be very hard, which hopefully will be an encouragement to put more effort in reviewing a design for these sorts of problems.

The third flag of interest is O_NONBLOCK. This flag is itself conflated, but also shows conflation within open(). In Linux, O_NONBLOCK has two quite separate, though superficially similar, meanings.

Firstly, O_NONBLOCK affects all read or write operations on the file descriptor, allowing them to return immediately after processing less data than requested, or even none at all. This functionality can separately be enabled or disabled with fcntl() and so is of little further interest.

The other function of O_NONBLOCK is to cause the open() itself not to block. This has a variety of different effects depending on the circumstances. When opening a named pipe for write, the open will fail rather than block if there are no readers. When opening a named pipe for read, the open will succeed rather than block, and reads will then return an error until some process writes something into the pipe. On CDROM devices an open for read with O_NONBLOCK will also succeed but no disk checks will be performed and so no reads will be possible. Rather the file descriptor can only be used for ioctl() commands such as to poll for the presence of media or to open or close the CDROM tray.

The last gives a hint concerning another aspect of open() which is conflated. Allocating a file descriptor to refer to a file and preparing that file for I/O are conceptually two separate operations. They certainly are often combined and including them both in the one system call can make sense. Requiring them to be combined is where the problem lies.

If it were possible to get a file descriptor on a given file (or device) without waiting for or triggering any action within that file, and, subsequently, to request the file be readied for I/O, then a number of subtle issues would be resolved. In particular there are various races possible between checking that a file is of a particular type and opening that file. If the file was renamed between these two operations, the program might suffer unexpected consequences of the open. The O_DIRECTORY flag was created precisely to avoid this sort of race, but it only serves when the program is expecting to open a directory. This race could be simply and universally avoided if these two stages of opening a file were easily separable.

A strong parallel can be seen between this issue and the 'socket' API for creating network connections. Sockets are created almost completely uninitialized; thereafter a number of aspects of the socket can be tuned (with e.g. bind() or setsockopt()) before the socket is finally connected.

In both the file and socket cases there is sometimes value in being able to set up or verify some aspects of a connection before the connection is effected. However with open() it is not really possible in general to separate the two.

It is worth noting here that opening a file with the 'flags' set to '3' (which is normally an invalid value) can sometimes have a similar meaning to O_NONBLOCK in that no particular read or write access is requested. Clearly developers see a need here but we still don't have a uniform way to be certain of getting a file descriptor without causing any access to the device, or a way to upgrade a file descriptor from having no read/write access to having that access.

As we saw, most of the difficulties caused by conflated design, at least in these two examples, have been addressed over time. It could therefore be argued that as there is minimal ongoing pain, the pattern should not be a serious concern. That argument though would miss two important points. Firstly they have already caused pain over many years. This could well have discouraged people from using the whole system and so reduce the overall involvement in, and growth of, the Unix ecosystem.

Secondly, though the worst offenses have largely been fixed, the result is not as neat and orthogonal as it could be. As we saw during the exploration, there are some elements of functionality that have not yet been separated out. This is largely because there is no clear need for them. However we often find that a use for a particular element of functionality only presents itself once the functionality is already available. So by not having all the elements cleanly separated we might be missing out on some particular useful tools without realizing it.

There are undoubtedly other areas of Unix or Linux design where multiple concepts have been conflated into a single operation, however the point here is not to enumerate all of the flaws in Unix. Rather it is to illustrate the ease with which separate concepts can be combined without even noticing it, and the difficulty (in some cases) of separating them after the fact. This hopefully will be an encouragement to future designers to be aware of the separate steps involved in a complex operation and to allow - where meaningful - those steps to be performed separately if desired.

Next week we will continue this exploration and describe a pattern of misdesign that is significantly harder to detect early, and appears to be significantly harder to fix late. Meanwhile, following are some exercises that may be used to explore conflated designed more deeply.

Exercises.

Explain why open() with O_CREAT benefits from an O_EXCL flag, but other system calls which create filesystem entries (mkdir(), mknod(), link(), etc) do not need such a flag. Determine if there is any conflation implied by this difference.
Explore the possibilities of the hypothetical bind() call that attaches a file descriptor to a location in the namespace. What other file descriptor types might this make sense for, and what might the result mean in each case.
Identify one or more design aspects in the IP protocol suite which show conflated design and explain the negative consequences of this conflation.

Ghosts of Unix past, part 3: Unfixable designs

Index entries for this article
Kernel	Development model/Patterns
GuestArticles	Brown, Neil

Ghosts of Unix past, part 2: Conflated designs

Posted Nov 4, 2010 18:51 UTC (Thu) by pj (subscriber, #4506) [Link]

I think someone else did a bunch of this kind of analysis... it resulted in plan9.

Ghosts of Unix past, part 2: Conflated designs

Posted Nov 5, 2010 10:58 UTC (Fri) by Yorick (guest, #19241) [Link] (10 responses)

O_CLOEXEC is only a hack needed because of the mistake of not having close-on-exec as the default state of any newly created descriptor. Requiring an explicit fcntl(~CLOEXEC) would have saved us much grief.

However, CLOEXEC isn't really a natural property of the descriptor but of the execve() call which perhaps should take a list of descriptors to preserve (and how they should map into the descriptor number space afterwards) instead.

This over-adornment of descriptors is even more obvious for O_NONBLOCK which doesn't belong there at all, because it governs how individual read and write calls work; this is not a deep property of the descriptor. Anyone who has tried to do, say, non-blocking reads and blocking writes on the same socket (especially in different threads) knows about this. Having a flag in read()/write() (etc) would be better in all respects, not least for code understanding and review.

Ghosts of Unix past, part 2: Conflated designs

Posted Nov 5, 2010 11:45 UTC (Fri) by RobSeace (subscriber, #4435) [Link] (2 responses)

> Anyone who has tried to do, say, non-blocking reads and blocking writes on
> the same socket (especially in different threads) knows about this. Having
> a flag in read()/write() (etc) would be better in all respects, not least
> for code understanding and review.

Well, for sockets you already have this as a recv()/send() flag: MSG_DONTWAIT... So, it's really only an issue for non-socket FDs; and, that would require some kind of new I/O functions which took flags like recv()/send() do, in order to solve... (Or, just make recv()/send() work on non-socket FDs?)

Ghosts of Unix past, part 2: Conflated designs

Posted Nov 5, 2010 12:58 UTC (Fri) by Yorick (guest, #19241) [Link] (1 responses)

Yes, it is a bit annoying that send/recv only work for sockets. And the mere existence of a per-descriptor NONBLOCK flag is enough to make it much harder to see whether a given call to read()/write() is likely to block or not.

More to the point, there is an ever-growing set of I/O syscalls doing essentially the same thing, none being a clear super-set of the rest; read/write, pread/pwrite, readv/writev, preadv/pwritev, send/recv, sendto/recvfrom, sendmsg/recvmsg... There were good reasons for every addition, but there is a clear lack of orthogonality and no single general call the remaining can be defined in terms of.

Ghosts of Unix past, part 2: Conflated designs

Posted Nov 12, 2010 23:08 UTC (Fri) by giraffedata (guest, #1954) [Link]

I agree that in many cases, the non-blockingness is a property of the read, and not the file descriptor. But making it an argument of read() instead of a file descriptor attribute would violate a fundamental Unix principle, covered in the first article of this series: that of the generic byte stream.

In some cases, you want a piece of code to be agnostic of blocking, just as it is agnostic to socket vs tape drive. The code neither knows nor cares whether its read will block or return zero bytes. It's the caller's business alone.

So I would like to see both. The pread() situation is quite analogous: pread() extends the simple byte stream with the concept of stream position, but another program can still remain agnostic to position, using classic read() while its caller manipulates position with lseek().

Ghosts of Unix past, part 2: Conflated designs

Posted Nov 8, 2010 19:54 UTC (Mon) by dd9jn (✭ supporter ✭, #4459) [Link] (6 responses)

> perhaps should take a list of descriptors to preserve

An easier and compatible fix would be a system call to close all file descriptors except those given to that call. The workaround everyone uses today, figuring out the maximum number file descriptors possible and call close(2) for each of them, is quite expensive in terms of system calls (~1000 calls in standard situations).

Ghosts of Unix past, part 2: Conflated designs

Posted Nov 8, 2010 21:13 UTC (Mon) by Yorick (guest, #19241) [Link] (5 responses)

Reading /proc/self/fd/ and closing only those found there is a bit better, but not quite as portable and still a lot slower than something like BSD closefrom(), which surely would be handy.
But it would still be inferior to having CLOEXEC by default (or passing a list of descriptors to preserve to exec). It is way too easy to leak descriptors by accident or because of careless code in a library.

Ghosts of Unix past, part 2: Conflated designs

Posted Nov 11, 2010 1:10 UTC (Thu) by jonabbey (guest, #2736) [Link] (3 responses)

Reading from /proc/self/fd and closing descriptors based on that is still subject to race conditions in threaded programs.

Ghosts of Unix past, part 2: Conflated designs

Posted Nov 11, 2010 2:38 UTC (Thu) by foom (subscriber, #14868) [Link]

Nope, it's not, because you do the closing of fds after forking (but before exec). You are guaranteed that there will be no other threads running at that point.

Ghosts of Unix past, part 2: Conflated designs

Posted Nov 11, 2010 3:30 UTC (Thu) by quotemstr (subscriber, #45331) [Link]

After fork(), a program has one thread.

Ghosts of Unix past, part 2: Conflated designs

Posted Nov 17, 2010 0:23 UTC (Wed) by mhelsley (guest, #11324) [Link]

And if your process links with libraries that internally use fds then you could easily trample the "internal" workings of those libraries by closing the fds. So a naive process can't assume it knows how to handle each fd in /proc/self/fd* without blurring the lines between library and application.

Ghosts of Unix past, part 2: Conflated designs

Posted Nov 11, 2010 11:28 UTC (Thu) by dd9jn (✭ supporter ✭, #4459) [Link]

Thanks for the pointer to closefrom - not exactly what I want but helpful. Time for a new configure test. And, yes, /proc/foo is not portable enough.

Ghosts of Unix past, part 2: Conflated designs

Posted Nov 5, 2010 13:33 UTC (Fri) by dskoll (subscriber, #1630) [Link] (1 responses)

...hypothetical bind() call that attaches a file descriptor to a location in the namespace...

Isn't that like fattach() from the STREAMS system?

The hypothetical `bind'...

Posted Nov 9, 2010 17:43 UTC (Tue) by civodul (guest, #58311) [Link]

How hypothetical is this: http://man.cat-v.org/plan_9/2/bind ? :-)

Typo

Posted Nov 11, 2010 2:32 UTC (Thu) by dw (subscriber, #12017) [Link] (1 responses)

s/is going to into the/is going to go into the/

Typo

Posted Nov 11, 2010 2:33 UTC (Thu) by dw (subscriber, #12017) [Link]

Managed to click on the wrong Comments link, argh!

Ghosts of Unix past, part 2: Conflated designs

Posted Nov 11, 2010 8:32 UTC (Thu) by mti (subscriber, #5390) [Link] (3 responses)

I do agree with the author that that some parts of the Unix API is less than optimal for a modern computer. But I do very much not agree that it is bad design, quite the opposite. Calls like open(), read(), write(), mount() were designed ~40 years ago and are still quite usable (if not perfect). This is really good design.

It is not much we can learn from the "design flaws" mentioned in the article. It would have been very hard for Thompson and Ritchie to predict CDROM, sockets or distributed file systems.

The beauty of the original design was its simplicity. This made it possible to later extend the design.

If we want to look at bad designs there are better examples. I would suggest SYSV IPC or curses. Or maybe look outside Unix. See how the equivalent of read(), write() and mount() were done on CP/M or DOS 1.0.

Ghosts of Unix past, part 2: Conflated designs

Posted Nov 11, 2010 23:51 UTC (Thu) by bronson (subscriber, #4806) [Link] (2 responses)

It's a waste of time look at a bad design and discuss all the ways it could be better. Anybody can do that. Taking an excellent design and finding ways to improve it, now that is interesting. SysV IPC and curses (and device numbers and terminfo/tty and nondeterministic exec and...) have been beaten to death for decades. No need for that on LWN.

Not sure where you see the author saying anything is bad design. He's just considering how good things can be made better. Show me something that can't!

Ghosts of Unix past, part 2: Conflated designs

Posted Nov 12, 2010 22:58 UTC (Fri) by giraffedata (guest, #1954) [Link]

I believe the author does say that conflating is bad design.

And the point of looking at design patterns is that one shouldn't expect to predict things like CDROMs, sockets, and network filesystems. Instead of trying to list all the ways your thing will be used, just follow certain patterns and things will work out by themselves. Even if you can't see, or there doesn't exist, any present downside to conflating two designs, don't conflate them anyway and you will be more successful.

We may still be able to excuse Thompson and Ritchie with a hindsight argument by saying that the way things looked at the time, creating a filesystem image and adding it to the namespace were fundamentally a single gestalt, and it is only since then that we have learned to think of it as two things.

Ghosts of Unix past, part 2: Conflated designs

Posted Nov 18, 2010 15:32 UTC (Thu) by renox (guest, #23785) [Link]

> Taking an excellent design and finding ways to improve it, now that is interesting.

I agree but Plan9's designers already did this to Unix..
So improving on Plan9 would be interesting, but I'm not sure I see the point for Unix!

Ghosts of Unix past, part 2: Conflated designs

Posted Nov 11, 2010 23:55 UTC (Thu) by bronson (subscriber, #4806) [Link] (11 responses)

Just a quick nod to the unix gurus for NOT conflating fork/exec, something that intuition would suggest should be conflated (if you think fork/exec actually is intuitive, just look at all the OSes that got it wrong).

Not a week goes by that I don't find my life made better by forking, tweaking, then execing. Mad props.

Ghosts of Unix past, part 2: Conflated designs

Posted Nov 17, 2010 12:42 UTC (Wed) by brinkmd (guest, #45122) [Link] (1 responses)

But fork() is very much a conflated system call, so is exec(). Fork duplicates the address space, and the descriptor table, and a bunch of other stuff. exec() loads a binary image into an address space and creates a thread and makes that thread runnable in the address space.

That fork() is conflated is even visible within the limited world view of Linux, see clone().

Ghosts of Unix past, part 2: Conflated designs

Posted Nov 17, 2010 12:52 UTC (Wed) by brinkmd (guest, #45122) [Link]

Sorry, of course exec does not create a thread, that would be spawn(), another conflated call. By the way, fork/exec is an example of bad design, see the interaction of fork with pthreads, or the problems of open file descriptors being inherited unwillingly (FD_CLOEXEC). People who write portable software learn to forget about fork and exec very quickly.

Ghosts of Unix past, part 2: Conflated designs

Posted Jan 4, 2011 22:27 UTC (Tue) by lwn555 (guest, #72175) [Link] (8 responses)

Admittedly, the fork design pattern is appealing for achieving parallelism in a simple way. However it has many shortcomings which are not immediately obvious.

The fork design pattern is terribly inefficient on systems without virtual memory hardware. Even on those with a VMM unit, copying all the page tables just to perform a simple task is often needlessly expensive. This encourages applications to manage process pools which defeats the simplicity of using fork for simple tasks.

As mentioned by another poster, unfortunately unix file descriptors default to inheritable, which is the opposite of what is desired. In just about 100% of cases, the code doing the fork knows exactly which file handles it wants to pass into a child, yet this code knows nothing about the file descriptors opened in 3rd party libraries. In fact, even if the third party code sets CLOEXEC correctly for itself, a process wishing to spawn multiple children has no way to set the flags correctly for all children. This problem is amplified for multithreaded programs, which can be cloned with file handles and mutexes in invalid states, necessitating the kludge which is pthread_atfork.

This is exactly the reason it's common for security minded linux apps to cycle through closing the first 1024 file descripters immediately before calling exec. This is the only way to be reasonably confident (but not 100%) that handles are not inadvertently leaked to children.

In order to be efficient, the operating system must over commit resources to accommodate all processes using fork. Consider a web browser session occupying some 100MB of ram. Suppose it forks children to do parallel processing, such as downloading files. Now, the main browser continues to fetch new pages and media, which fits into the same 100MB of ram, however the existence of forked children means the kernel cannot free the old unused 100MB of ram since it belongs to a child.

Fork just gets more problematic as the parent processes get larger.
In principal it's not unreasonable for a 1.5GB database process to spawn a 5MB job, yet the fork implies over-committing 1.5GB of ram to this single child at least temporarily. In practice, over-committing can lead to insufficient memory conditions, which is why kernel developers invented the dreaded "Out of memory process killer" to kill otherwise well behaved processes under linux.

Consider that without fork, the fundamental need to overcommit disappears.

Combine all this with the fact that fork isn't very portable, one must come to the conclusion that fork should generally be avoided in large scale projects. Or, if it is used, the parent's role should be limited to forking and monitoring children. This largely precludes the benefits of the fork programming pattern in the first place.

Ghosts of Unix past, part 2: Conflated designs

Posted Jan 4, 2011 23:28 UTC (Tue) by dlang (guest, #313) [Link] (7 responses)

what you are missing is that linux has for years not actually allocated all that extra ram for a fork, instead it has marked the ram as being shared, but copy-on-write (COW), so that if the memory is not written to, it is never duplicated.

there is some overhead in changing the page tables, but it's pretty low.

Ghosts of Unix past, part 2: Conflated designs

Posted Jan 5, 2011 4:40 UTC (Wed) by khc (guest, #45209) [Link]

I don't think he missed it, isn't that all the overcommit stuff he was talking about?

Ghosts of Unix past, part 2: Conflated designs

Posted Jan 5, 2011 5:04 UTC (Wed) by lwn555 (guest, #72175) [Link]

"what you are missing is that linux has for years not actually allocated all that extra ram for a fork, instead it has marked the ram as being shared, but copy-on-write (COW)"

I don't believe I've said anything to contradict this.
On systems with a MMU, fork copies the page tables and not the pages themselves such that the new processes share physical ram until they are written to.

" so that if the memory is not written to, it is never duplicated."

Whether you've realized it or not, the problem of over-committed memory remains present. At the time the kernel receives the "fork()" syscall from a large process (imagine 1.5GB working set) which uses more ram than is available to the child, it has to choose between two bad choices:
1. Either deny the request up front due to low memory constraints.
or
2. over-commit memory in a gamble that neither the parent nor the child will change too many pages.

Both answers are seriously flawed. I gave two examples of applications which demonstrate either the inefficiency of fork(), or the risky over commit behavior.

Most administrators will agree that the "OOM Killer" has no place in stable production environments. The only way to guaranty well behaved processes are not killed is for the kernel to guaranty resources by not over-committing them. This spells trouble for interfaces like fork(), which depend on over-committed memory to work efficiently.

Without over-committed memory, a large process would find itself unable to issue fork/exec calls to spawn a small process.

If the parent is a tiny daemon who's only purpose is to spawn children, this isn't such a big deal. However, it is a disappointment that the fork syscall is either very risky, or a resource hog when called from large parents.

Even if fork had no other problems, this is an excellent reason to seek alternatives.

Ghosts of Unix past, part 2: Conflated designs

Posted Jan 5, 2011 6:32 UTC (Wed) by lwn555 (guest, #72175) [Link] (4 responses)

Obviously the following link is for Solaris rather than linux, but it provides additional insight into the problems of forking which I've attempted to explain.

http://developers.sun.com/solaris/articles/subprocess/sub...

Ghosts of Unix past, part 2: Conflated designs

Posted Jan 7, 2011 0:32 UTC (Fri) by bronson (subscriber, #4806) [Link] (3 responses)

From the paper:

> Even though fork() has been improved over the years to use the COW (copy-on-write) semantics

If the years the author is referring to is the 70s, then sure! Otherwise, the paper appears to be little more than an indictment of a poor implementation of fork.

Ghosts of Unix past, part 2: Conflated designs

Posted Jan 7, 2011 23:53 UTC (Fri) by lwn555 (guest, #72175) [Link] (2 responses)

I would not know when COW fork was implemented in various kernels.
Presumably not long after MMU hardware became available.

Still, a 1GB process needs 244,140 * 4KB page entries to be copied for the child. That's a lot of baggage if the child's sole purpose is to call exec(). Better to use vfork/exec when possible.

I'd like to be clear that the over commit issues with fork() are not an implementation problem but are a fundamental consequence of what fork does.

If the parent has a working data set of 100MB, and the child only needs 5MB from the parent, fork() still marks the remaining 95MB as needed by the child.

Assume the parent modifies it's entire 100MB working set while the child continues running with it's 5MB working set, then eventually both processes will consume 200MB instead of the 105MB which is technically needed.

So, regardless of the fork implementation, 95MB out of 200MB is wasted. As the parent spawns more children over time, the % wasted only gets worse.

Of course there are workarounds, but they come at the expense of forgoing the semantics which make fork appealing in the first place: inheriting context and data structures from the parent without IPC.

Ghosts of Unix past, part 2: Conflated designs

Posted Jan 8, 2011 0:12 UTC (Sat) by dlang (guest, #313) [Link] (1 responses)

if the child really only needs the 5MB, it can free the rest of the allocations and you are back to the 105MB total.

if the programmer isn't sure if the child needs 5MB of data of the entire 100MB of data then they would need to keep everything around in any case.

the worst-case of COW is that you use as much memory as you would without it. In practice this has been shown empirically to be a very large savings. some people are paranoid about this and turn off overcommit so that even in this worst case they would have the memory, but even they benefit from the increased speed, and from the fact that almost all the time the memory isn't needed.

so I disagree with your conclusion that there is so much memory wasted.

Ghosts of Unix past, part 2: Conflated designs

Posted Jan 8, 2011 9:12 UTC (Sat) by lwn555 (guest, #72175) [Link]

"if the child really only needs the 5MB, it can free the rest of the allocations and you are back to the 105MB total."

Easily said. While it's technically possible to free all unused memory pages after a fork, it's unusual to actually do this. The piece of code calling fork() may not really be aware or related to the memory allocated by the rest of the process.

Consider how difficult it would be for one library to deallocate the structures of other libraries after performing a fork.

Even if we did track all objects to free after forking, malloc may or may not actually be able to free the pages back to the system, particularly with pages allocated linearly via sbrk() since objects needed by the child are likely to be near the end.

"the worst-case of COW is that you use as much memory as you would without it."

We can agree there are no reasons not to use copy on write to implement fork.

"so I disagree with your conclusion that there is so much memory wasted."

Then I think you misunderstood the example. No matter which way you cut it, so long as the child doesn't do anything to explicitly free unused pages, it is stuck with 95MB of unusable ram. If the parent updates it's entire working set, then the child will be the sole owner of the data. If the parent quits and the child is allowed to continue, then the useless 95MB is still there. And this is only for one child.

You may feel this is a contrived example, but I can think of many instances where it would be desirable for a large parent to branch work into child processes such that this is a problem.

Fork works great in academic examples and programs where the parent is small, doesn't touch it's data, or the children are short lived. But there are applications where the fork paradigm in and of itself leads to excessive memory consumption.

Ghosts of Unix past, part 2: Conflated designs

Posted Nov 17, 2010 12:45 UTC (Wed) by brinkmd (guest, #45122) [Link] (1 responses)

Amazingly, yet another article in the series that completely fails to recognize the contributions that have been made by Plan 9, GNU Hurd, and microkernels from Mach to L4 to KeyKOS/EROS/Coyotos in the last two decades. These problems have already been analyzed in much detail by these projects, and fixed. Does the author not know this, or does he have a hidden agenda of secretly educating his readers without them being scared by too big a world out of their comfort zone? I can't wait to see the answer, hopefully in one of the later parts of the series.

Ghosts of Unix past, part 2: Conflated designs

Posted Nov 19, 2010 22:23 UTC (Fri) by Zizzle (guest, #67739) [Link]

Well from the article:

"The fact that these design decisions are still with us and worth talking about shows that their weaknesses were not immediately obvious and, additionally, that these designs lasted long enough to become sufficiently entrenched that simply replacing them would cause more harm than good."

Those who care deeply about such things have probably already gone to those other OSes. Not a very popular decision, but good for them.

For the rest of us, apps are more important than the OS, and we prefer to keep our existing apps running. On Linux.

So they are pretty much irrelevant to the bulk of us here reading Linux Weekly News.

Which I think is implied in the "more harm than good" phrase.

The fact that the OSes you mention occupy such small niches it could be argued shows that caring about backwards compatibility matters to most users.

I have to disagree about O_CLOEXEC

Posted Nov 26, 2010 12:47 UTC (Fri) by Ross (guest, #4065) [Link] (1 responses)

Well, maybe I don't have to... I just want to. :)

I don't think creating a file descriptor and allowing it to be duplicated on fork() or preserved through exec should be separate operations.

I think changing fork() to close file descriptors by default would abuse it's currently clean design. It's so beautiful, like a cell dividing. The only special cases which have to exist are things like PID and PPID. Of course people added locks and asynchronous IO. I don't think complicating it more would be nice.

Similarly exec() takes an existing process and environment and just changes the running program. I don't see file descriptors as part which program is running but as part of the environment that it runs in. The simplest thing is that the exec'd program sees exactly the same set of file descriptors that there were right before the system call.

So I'd argue that the defaults make sense as they are. File descriptors are real by default. I'd say the problem is O_CLOEXEC. It seems really useful, but it doesn't fit into this model.

As someone else pointed out, a system call to close all file descriptors (or maybe ones from a given range?) would be a more orthogonal way to handle it, and would probably be useful elsewhere. I've seen lots of code that loops from 0 to 255 closing everything just to be sure it wasn't leaking something.

Now if the automatic-trigger part of O_CLOEXEC is an important feature (maybe you don't trust your code to close things before calling exec), there are some solutions entirely in userspace. First, the C library could make special versions of the exec functions available which closed everything first. You could presumably use grep, macros, or other tricks to make sure you only used those versions. Second, the C library could even make provisions to track file descriptors that should be closed on exec without any help from the kernel.

In summary fork() and exec() are two well-designed parts of Unix. Making them uglier to get rid of this flag to open() would not have been an improvement.

I have to disagree about O_CLOEXEC

Posted Nov 26, 2010 22:51 UTC (Fri) by neilbrown (subscriber, #359) [Link]

I don't think anyone is suggesting changes to fork, though of course it has already been noted that fork shows signs of conflation which 'clone' and 'unshare' help to remove.

However 'exec' is very special. Unlike fork and everything else, the calling process has no control over what happens after the exec call succeeds, so it needs to do everything before.

It could close some file descriptors before without racing with other threads by using 'unshare' to have a private file-table, then closing whatever has been marked in libc as 'close on exec'.

But there are (or at least 'could be') times when you want some file descriptor to still be open if 'exec' fails, but you don't want it to be open after the exec succeeds. For that you really need close-on-exec.

And if it is necessary to have clone-on-exec, then it makes most sense for it to default to 'set' as that is commonly what is wanted, and that is easiest to manage in a race-free way.

The main point that I got from your comment is that while is might be clear that something isn't right with this whole design area, it is open for debate which bits are 'right' and which bits are 'wrong'. I would certainly agree with that.

Ghosts of Unix past, part 2: Conflated designs

Conflated Designs

The mount command

The open() system call.

Exercises.

Next article

Ghosts of Unix past, part 2: Conflated designs

Ghosts of Unix past, part 2: Conflated designs

Ghosts of Unix past, part 2: Conflated designs

Ghosts of Unix past, part 2: Conflated designs

Ghosts of Unix past, part 2: Conflated designs

Ghosts of Unix past, part 2: Conflated designs

Ghosts of Unix past, part 2: Conflated designs

Ghosts of Unix past, part 2: Conflated designs

Ghosts of Unix past, part 2: Conflated designs

Ghosts of Unix past, part 2: Conflated designs

Ghosts of Unix past, part 2: Conflated designs

Ghosts of Unix past, part 2: Conflated designs

Ghosts of Unix past, part 2: Conflated designs

The hypothetical `bind'...

Typo

Typo

Ghosts of Unix past, part 2: Conflated designs

Ghosts of Unix past, part 2: Conflated designs

Ghosts of Unix past, part 2: Conflated designs

Ghosts of Unix past, part 2: Conflated designs

Ghosts of Unix past, part 2: Conflated designs

Ghosts of Unix past, part 2: Conflated designs

Ghosts of Unix past, part 2: Conflated designs

Ghosts of Unix past, part 2: Conflated designs

Ghosts of Unix past, part 2: Conflated designs

Ghosts of Unix past, part 2: Conflated designs

Ghosts of Unix past, part 2: Conflated designs

Ghosts of Unix past, part 2: Conflated designs

Ghosts of Unix past, part 2: Conflated designs

Ghosts of Unix past, part 2: Conflated designs

Ghosts of Unix past, part 2: Conflated designs

Ghosts of Unix past, part 2: Conflated designs

Ghosts of Unix past, part 2: Conflated designs

Ghosts of Unix past, part 2: Conflated designs

I have to disagree about O_CLOEXEC

I have to disagree about O_CLOEXEC