Leading items

Welcome to the LWN.net Weekly Edition for November 11, 2021

This edition contains the following feature content:

Late-bound argument defaults for Python: an attempt to fix a longstanding "wart" in the Python language.
5.16 Merge window, part 1: the first set of changes for the 5.16 kernel release.
The balance between features and performance in the block layer: when should the need for performance block the addition of new features?
Intel AMX support in 5.16: how the kernel will support Intel's matrix instructions.
Concurrency in Julia: an extensive look at ways to do concurrent programming in the Julia language.

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

Late-bound argument defaults for Python

By Jake Edge
November 10, 2021

Python supports default values for arguments to functions, but those defaults are evaluated at function-definition time. A proposal to add defaults that are evaluated when the function is called has been discussed at some length on the python-ideas mailing list. The idea came about, in part, due to yet another resurrection of the proposal for None-aware operators in Python. Late-bound defaults would help with one use case for those operators, but there are other, stronger reasons to consider their addition to the language.

In Python, the defaults for arguments to a function can be specified in the function definition, but, importantly, they are evaluated in the scope where the function is defined. So default arguments cannot refer to other arguments to the function, as those are only available in the scope of the function itself when it gets called. For example:

    def foo(a, b = None, c = len(a)):
        ...

That definition specifies that a has no default, b defaults to None if no argument gets passed for it, and c defaults to the value of len(a). But that expression will not refer to the a in the argument list; it will instead look for an a in the scope where the function is defined. That is probably not what the programmer intended. If no a is found, the function definition will fail with a NameError.

Idea

On October 24, Chris Angelico introduced his proposal for late-bound arguments. He used an example function, derived from the bisect.bisect_right() function in the standard library, to demonstrate the idea. The function's arguments are specified as follows:

def bisect(a, x, lo=0, hi=None):

He notes that there is a disparity between lo and hi: "It's clear what value lo gets if you omit it. It's less clear what hi gets." Early in his example function, hi is actually set to len(a) if it is None. Effectively None is being used as a placeholder (or sentinel value) because Python has no way to directly express the idea that hi should default to the length of a. He proposed new syntax to identify hi as a late-bound argument:

def bisect(a, x, lo=0, hi=:len(a)):

The "=:" would indicate that if no argument is passed for hi, the expression would be evaluated in the context of the call and assigned to hi before any of the function's code is run. It is interesting to note that the documentation for bisect.bisect_right() linked above looks fairly similar to Angelico's idea (just lacking the colon) even though the actual code in the library uses a default value of None. It is obviously useful to know what the default will be without having to dig into the code.

In his post, Angelico said that in cases where None is a legitimate value, there is another way to handle the default, but it also obscures what the default will be:

And the situation only gets uglier if None is a valid argument, and a unique sentinel is needed; this standard idiom makes help() rather unhelpful:
_missing = object()
def spaminate(thing, count=_missing):
    if count is _missing: count = thing.getdefault()
Proposal: Proper syntax and support for late-bound argument defaults.
def spaminate(thing, count=:thing.getdefault()):
    ...
[...]
The purpose of this change is to have the function header define, as fully as possible, the function's arguments. Burying part of that definition inside the function is arbitrary and unnecessary.

The first order of business in these kinds of discussions is the inevitable bikeshedding about how the operator is spelled. Angelico chose a "deliberately subtle" syntax, noting that in many cases it will not matter when the argument is bound. It is visually similar to the walrus operator (":="), but that is not legal in a function definition, so there should be no ambiguity, he said.

Ethan Furman liked the idea but would rather see a different operator (perhaps "?=") used because of the potential confusion with the walrus operator. Guido van Rossum was also in favor of the feature, but had his spelling suggestion as well:

I like that you're trying to fix this wart! I think that using a different syntax may be the only way out. My own bikeshed color to try would be `=>`, assuming we'll introduce `(x) => x 1` as the new lambda syntax, but I can see problems with both as well :-).

New syntax for lambda expressions has also been discussed, with most settling on "=>" as the best choice, in part because "->" is used for type annotations; some kind of "arrow" operator is commonly used in other languages for defining anonymous functions. Several others were similarly in favor of late-bound defaults and many seemed to be happy with Van Rossum's spelling, but Brendan Barnwell was opposed to both; he was concerned that it would "encourage people to cram complex expressions into the function definition". Since it would only truly be useful—readable—for a simpler subset of defaults, it should not be added, he said. Furthermore:

To me, this is definitely not worth adding special syntax for. I seem to be the only person around here who detests "ASCII art" "arrow" operators but, well, I do, and I'd hate to see them used for this. The colon or alternatives like ? or @ are less offensive but still too inscrutable to be used for something that can already be handled in a more explicit way.

But Steven D'Aprano did not think that the addition of late-bound defaults would "cause a large increase in the amount of overly complex default values". Angelico was also skeptical that the feature was some sort of bad-code attractant. "It's like writing a list comprehension; technically you can put any expression into the body of it, but it's normally going to be short enough to not get unwieldy." In truth, any feature can be abused; this one does not look to them to be particularly worse in that regard.

PEP 671

Later that same day, Angelico posted a draft of PEP 671 ("Syntax for late-bound function argument defaults"). In it, he adopted the "=>" syntax, though he noted a half-dozen other possibilities. He also fleshed out the specification of the default expression and some corner cases:

The expression is saved in its source code form for the purpose of inspection, and bytecode to evaluate it is prepended to the function's body.
Notably, the expression is evaluated in the function's run-time scope, NOT the scope in which the function was defined (as are early-bound defaults). This allows the expression to refer to other arguments.
Self-referential expressions will result in UnboundLocalError::
    def spam(eggs=>eggs): # Nope
Multiple late-bound arguments are evaluated from left to right, and can refer to previously-calculated values. Order is defined by the function, regardless of the order in which keyword arguments may be passed.

But one case, which had been raised by Ricky Teachey in the initial thread, was discussed at some length when Jonathan Fine asked about the following function definition:

def puzzle(*, a=>b 1, b=>a 1):
    return a, b

Angelico was inclined to treat that as a syntax error, "since permitting it would open up some hard-to-track-down bugs". Instead it could be some kind of run-time error in the case where neither argument is passed, he said. He is concerned that allowing "forward references" to arguments that have yet to be specified (e.g. b in a=>b 1 above) will be confusing and hard to explain. D'Aprano suggested handling early-bound argument defaults before their late-bound counterparts and laid out a new process for argument handling that was "consistent and understandable". In particular, he saw no reason to make some kinds of late-bound defaults into a special case:

Note that step 4 (evaluating the late-bound defaults) can raise *any* exception at all (it's an arbitrary expression, so it can fail in arbitrary ways). I see no good reason for trying to single out UnboundLocalError for extra protection by turning it into a syntax error.

Angelico noted that it was still somewhat difficult for even experienced Python programmers to keep straight, but, in addition, he had yet to hear of a real use case. Erik Demaine offered two examples, "though they are a bit artificial"; he said that simply evaluating the defaults in left-to-right order (based on the function definition) was reasonably easy to understand. Angelico said that any kind of reordering of the evaluation was not being considered; as he sees it:

The two options on the table are:
1) Allow references to any value that has been provided in any way
2) Allow references only to parameters to the left

Option 2 is a simple SyntaxError on compilation (you won't even get as far as the def statement). Option 1 allows everything all up to the point where you call it, but then might raise UnboundLocalError if you refer to something that wasn't passed.
The permissive option allows mutual references as long as one of the arguments is provided, but will give a peculiar error if you pass neither. I think this is bad API design.

Van Rossum pointed out that the syntax-error option would break new ground: "Everywhere else in Python, undefined names are runtime errors (NameError or UnboundLocalError)." Angelico sees the error in different terms, though, noting that mismatches in global and local scope are a syntax error; he gave an example:

>>> def spam():
...     ham
...     global ham
...
  File "<stdin>", line 3
SyntaxError: name 'ham' is used prior to global declaration

He also gave a handful of different function definitions that were subtly different using the new feature; he was concerned about the "bizarre inconsistencies" that can arise, because they "are difficult to explain unless you know exactly how everything is implemented internally". He would prefer to see real-world use cases for the feature to decide whether it should be supported at all, but was adamant that the strict left-to-right interpretation was easier to understand:

If this should be permitted, there are two plausible semantic meanings for these kinds of constructs:
1) Arguments are defined left-to-right, each one independently of each other
2) Early-bound arguments and those given values are defined first, then late-bound arguments

The first option is much easier to explain [...]

D'Aprano explained that the examples cited were not particularly hard to understand and fell far short of the "bizarre inconsistencies" bar. There is a clear need to treat the early-bound and late-bound defaults differently:

However there is a real, and necessary, difference in behaviour which I think you missed:
    def func(x=x, y=>x)  # or func(x=x, @y=x)
The x=x parameter uses global x as the default. The y=x parameter uses the local x as the default. We can live with that difference. We *need* that difference in behaviour, otherwise these examples won't work:
    def method(self, x=>self.attr)  # @x=self.attr

    def bisect(a, x, lo=0, hi=>len(a))  # @hi=len(a)
Without that difference in behaviour, probably fifty or eighty percent of the use-cases are lost. (And the ones that remain are mostly trivial ones of the form arg=[].) So we need this genuine inconsistency.

As can be seen, D'Aprano prefers a different color for the bikeshed: using "@" to prepend late-bound default arguments. He also said that Angelico had perfectly explained the "harder to explain" option in a single sentence; both are equally easy to explain, D'Aprano said. Beyond that, it does not make sense to "prohibit something as a syntax error because it *might* fail at runtime". In a followup message, he spelled that out further:

We don't do this:
    y = x 1  # Syntax error, because x might be undefined
and we shouldn't make this a syntax error
    def func(@spam=eggs 1, @eggs=spam-1):
either just because `func()` with no arguments raises. So long as you pass at least one argument, it works fine, and that may be perfectly suitable for some uses.

Winding down

While many of the participants in the threads seem reasonably happy—or at least neutral—on the idea, there is some difference of opinion on the details as noted above. But several thread participants are looking for a more general "deferred evaluation" feature, and are concerned that late-bound argument defaults will preclude the possibility of adding such a feature down the road. Beyond that, Eric V. Smith wondered about how late-bound defaults would mesh with Python's function-introspection features. Those parts of the discussion got a little further afield from Angelico's proposal, so they merit further coverage down the road.

At first blush, Angelico's idea to fix this "wart" in Python seems fairly straightforward, but the discussion has shown that there are multiple facets to consider. It is not quite as simple as "let's add a way to evaluate default arguments when the function is called"—likely how it was seen at the outset. That is often the case when looking at new features for an established language like Python; there is a huge body of code that needs to stay working, but there are also, sometimes conflicting, aspirations for features that could be added. It is a tricky balancing act.

As with many python-ideas conversations, there were multiple interesting sub-threads, touching on language design, how to teach Python (and this feature), how other languages handle similar features (including some discussion of ALGOL thunks), the overall complexity of Python as it accretes more and more features, and, of course, additional bikeshedding over the spelling. Meanwhile, Angelico has been working on a proof-of-concept implementation, so PEP 671 (et al.) seems likely to be under discussion for some time to come.

Comments (22 posted)

5.16 Merge window, part 1

By Jonathan Corbet
November 4, 2021

As of this writing, Linus Torvalds has pulled exactly 6,800 non-merge changesets into the mainline repository for the 5.16 kernel release. That is probably a little over half of what will arrive during this merge window, so this is a good time to catch up on what has been pulled so far. There are many significant changes and some large-scale restructuring of internal kernel code, but relatively few ground-breaking new features.

Changes pulled in the first half of the 5.16 merge window include:

Architecture-specific

The Arm 8.6 timer extensions are now supported.
MIPS systems have a new just-in-time compiler for the BPF virtual machine.
KFENCE is now supported on PA-RISC machines.
Hitting the TOC button ("transfer of control") on a PA-RISC machine will now cause the kernel to enter the debugger.
RISC-V now has support for virtualization with KVM, an improvement that took rather longer than the developers would have liked.
The kernel has gained support for Intel's Advanced Matrix Extension (AMX) feature. This required extensive discussion and a major reworking of the existing floating-point support code.

Core kernel

The futex_waitv() system call (described in this article) has been merged.
The CPU scheduler has gained an understanding of "clusters", a hardware arrangement where multiple cores share the same L2 cache. The cluster-aware scheduler will take pains to distribute tasks across all clusters in the system to balance the load on caches across the machine.
The tracefs interface to the tracing mechanism now supports the setting of owner and group permissions; this feature can be used to give a specific group access to tracing functionality. The "other" permission bits still cannot be set to allow access to the world as a whole, though.
As usual there is a pile of BPF changes. The new bpf_trace_vprintk() helper can output information without the three-argument limit of bpf_trace_printk(). Support for calling kernel functions in loadable modules from BPF has been added. There is a new bloom-filter map type. Unprivileged BPF is now disabled by default. There is also a new document describing the licensing of various BPF components and the requirements for users.

Filesystems and block I/O

The block layer continues to see a series of performance optimizations resulting in significantly better per-core operation rates.
There is now support for multi-actuator (rotating) disks that can access multiple sectors at the same time. This commit documents the sysfs interface for these drives.
There is a new ioctl() command (CDROM_TIMED_MEDIA_CHANGE) for detecting media-change events in CDROM drives. Evidently people are still using CDROM drives...
The EROFS filesystem has gained simple multiple-device support.

Hardware support

Media: OmniVision OV13B10 sensors, Hynix Hi-846 sensors, and R-Car image signal processors.
Miscellaneous: Microchip external interrupt controllers, Apple mailbox controllers, Ingenic JZ47xx SoCs SPI controllers, Cadence XSPI controllers, Maxim MAX6620 fan controllers, Intel Keem Bay OCS elliptic-curve encryption accelerators, ACPI WMAA backlight interfaces, Intel ISHTP eclite embedded controllers, Barco P50 GPIO, and Samsung S6D27A1 DPI panels.
Networking: Realtek RTL8365MB-VC Ethernet switches, Realtek 802.11ax wireless chips, Asix AX88796C-SPI Ethernet adapters, and Mellanox MSN4800-XX line cards.

Networking

There is a new, user-settable socket option called SO_RESERVE_MEM. It has the effect of reserving some kernel memory permanently for the relevant socket; that, in turn, should speed networking operations, especially when the system is under memory pressure. Note that the feature is only available when memory control groups are in use, and the reserved memory is charged against the group's quota.
The In-situ Operations, Administration, and Maintenance (IOAM) support has been enhanced support the encapsulation of IOAM data into in-transit packets. This commit has a little bit of further information.
The ethtool netlink API has gained the ability to control transceiver modules; see this commit and this commit for more information.
The netfilter subsystem can now classify packets at egress time; see this commit for some more information.
Support for Automatic Multicast Tunneling (RFC 7450) has been added.
There are two new sysctl knobs controlling what happens to the ARP cache when a network device loses connectivity. arp_evict_nocarrier says whether ARP cache entries should be deleted when an interface loses carrier, while ndisc_evict_nocarrier does the same thing for the neighbor discovery table. Both exist to allow cache entries to be retained when a WiFi interface moves between access points on the same network. This commit and this one contain more information.

Security-related

Most of the work for strict memcpy() bounds checking has been merged. The actual patches enabling bounds checking across the kernel have not (yet) been merged, though, pending fixes for some remaining problems.
The io_uring subsystem has gained audit support.
The SELinux and Smack security modules can now apply security policies to io_uring operations.
Auditing will now record the contents of the open_how structure passed to openat2().
The integrity measurement architecture (IMA) can now apply rules based on the group ID of files and the users accessing them.
The default Spectre-mitigation behavior for seccomp() threads has changed, resulting in fewer mitigations being applied and a corresponding performance increase. One can read this commit for the reasoning behind this change, but the short version is that the extra mitigations weren't really buying any more security.

Internal kernel changes

The folio patch set, which has been the topic of much discussion over the past several months, was the very first thing to be merged for 5.16. This work adds a new "folio" type to represent pages that are known not to be tail pages, then reworks the internal memory-management APIs to use this type. The result is better type clarity and even a small performance increase — and a lot more work to do in the future.
There is a new internal function, cc_platform_has(), that provides a generic interface for kernel code to query the presence of confidential-computing features. Its first use is to replace mem_encrypt_active() for checking whether memory encryption is in use.

Assuming that the usual two-week schedule holds, the 5.16 merge window can be expected to close on November 14. Once that happens, we'll be back with a summary of the remaining changes pulled for the next kernel release.

Comments (none posted)

The balance between features and performance in the block layer

By Jonathan Corbet
November 5, 2021

Back in September, LWN reported on a series of block-layer optimizations that enabled a suitably equipped system to sustain 3.5 million I/O operations per second (IOPS). That optimization work has continued since then, and those 3.5 million IOPS would be a deeply disappointing result now. A recent disagreement over the addition of a new feature has highlighted the potential cost of a heavily optimized block layer, though; when is a feature deemed important enough to outweigh the drive for maximum performance?

Block subsystem maintainer Jens Axboe has continued working to make block I/O operations go faster. A recent round of patches tweaked various fast paths, changed the plugging mechanism to use a singly linked list, and made various other little changes. Each is a small optimization, but the work adds up over time; the claimed level of performance is now 8.2 million IOPS — well over September's rate, which looked good at the time. This work has since found its way into the mainline as part of the block pull request for 5.16.

So far, so good; few people will argue with massive performance improvements. But they might argue with changes that threaten to interfere, even in a tiny way, with those improvements.

Consider, for example, this patch set from Jane Chu. It adds a new flag (RWF_RECOVERY_DATA) to the preadv2() and pwritev2() system calls that can be used by applications trying to recover from nonvolatile-memory "poisoning". Implementations of nonvolatile memory have different ways of detecting and coping with data corruption; Intel memory, it seems, will poison the affected range, meaning that it cannot be accessed without generating an error (which then turns into a SIGBUS signal). An application can respond to that error by reading or writing the poisoned range with the new flag; a read will replace the poisoned data with zeroes (allowing the application to obtain whatever data is still readable), while a write will overwrite that data and attempt to clear the poisoned status. Either way, the application can attempt to recover from the problem and continue running.

Christoph Hellwig objected to this new flag on the grounds that it would slow down the I/O fast path:

Well, my point is doing recovery from bit errors is by definition not the fast path. Which is why I'd rather keep it away from the pmem read/write fast path, which also happens to be the (much more important) non-pmem read/write path.

Pavel Begunkov also complained, saying that each flag adds a bit of overhead that piles up over time: "default config kernels are already sluggish when it comes to really fast devices and it's not getting better". That caused Darrick Wong to ask: "So we can't have data recovery because moving fast [is] the only goal?". Begunkov denied saying that, but wasn't really clear on what he was saying.

The cost of this flag is tiny — perhaps not even measurable — in cases where it is not used. But even that overhead can look unacceptable to developers who are working to get the sustained IOPS numbers as high as possible. One flag leads to another and another, and someday in the future the performance cost becomes significant — or that is the argument, anyway. To avoid this kind of problem, the argument continues, niche features like nonvolatile memory poison recovery should be restricted to parts of the kernel that are not seen as being the fast path. In this case, adding the needed functionality to fallocate() has been suggested and tried, but it was eventually decided that fallocate() is not the right place for hardware-management features like poison-clearing.

Thus the current implementation, which has run into fast-path concerns. That, in turn, has provoked an extended outburst from Dave Chinner, who thinks that much of the current optimization work is misplaced:

The current approach of hyper-optimising the IO path for maximum per-core IOPS at the expense of everything else has been proven in the past to be exactly the wrong approach to be taking for IO path development. Yes, we need to be concerned about performance and work to improve it, but we should not be doing that at the cost of everything else that the IO stack needs to be able to do.

Optimization, he continued, should come after the needed functionality is present; "using 'fast path optimisation' as a blunt force implement to prevent new features from being implemented is just ... obnoxious".

The conversation stalled out shortly after that. This kind of disagreement over features and performance has been heard in the kernel community many times over the decades, though. Going faster is a popular goal, and the developers who are focused on performance have been known to get tired of working through performance regressions caused by feature additions that accumulate over time. But functionality, too, is important, and developers tire of having needed features blocked on seemingly insignificant performance concerns.

Often, performance-related pushback leads to revised solutions that do not have the same performance cost; along those lines, it's worth noting that Hellwig has some ideas for improved ways of handling I/O to nonvolatile memory. Other times, it just leads to the delay or outright blocking of needed functionality. What will happen in this case is not clear at this point, but debates like this will almost certainly be with us in the coming decades as well. It is, in the end, how the right balance between performance and functionality is (hopefully) found.

Comments (9 posted)

Intel AMX support in 5.16

By Jonathan Corbet
November 8, 2021

The x86 instruction set is large, but that doesn't mean it can't get bigger yet. Upcoming Intel processors will feature a new set of instructions under the name of "Advanced Matrix Extensions" (AMX) that can be used to operate on matrix data. After a somewhat bumpy development process, support for AMX has found its way into the upcoming 5.16 kernel. Using it will, naturally, require some changes by application developers.

AMX (which is described in this document) is meant to be a general architecture for the acceleration of matrix operations on x86 processors. In its initial form, it implements a set of up to eight "tiles", which are arrays of 16 64-byte rows. Programmers can store matrices in these tiles of any dimension that will fit therein; a matrix of 16x16 32-bit floating-point values would work, but other geometries are supported too. The one supported operation currently will multiply the matrices stored in two tiles, then add the result to a third tile. By chaining these operations, multiplication of matrices of any size can be implemented. Evidently other operations are meant to come in the future.

While AMX may seem like a feature aimed at numerical analysis, the real target use case would appear to be machine-learning applications. That would explain why 16-bit floating-point arithmetic is supported, but 64-bit is not.

The design of AMX gives the kernel control over whether these features can be used by any given process. There are a couple of reasons for this, one being that AMX instructions, as one might imagine, use a lot of processor resources. A process doing heavy AMX work on a shared computer may adversely affect other processes. But AMX also cannot be supported properly unless both the kernel and the user-space process are ready for it.

Development process

Support for AMX was first posted by Chang Bae in October 2020, but got relatively few review comments. By the time version 4 came out in February, more developers were paying attention, and they were not entirely pleased with how this feature was being integrated into the kernel's existing floating-point unit (FPU) code. Various versions followed, with the frustration level seeming to increase; at the end of September, Len Brown posted minutes from a conversation that, seemingly, showed a path forward.

Unfortunately, version 11, posted the very next day, seemed to ignore many of the decisions that had been made. This posting drew a sharp rebuke from Thomas Gleixner, who felt that the feature was being shoved into the kernel without listening to the complaints that were being raised. Things weren't looking great for AMX, but work was happening behind the scenes; in mid-October, Gleixner posted a massive reworking of the FPU code meant to ease the task of supporting AMX. A new AMX patch set followed shortly thereafter, and that, more or less, is what ended up in 5.16.

Gleixner's pull request for this code acknowledged its relatively unseasoned nature:

Note, this is relatively new code despite the fact that AMX support is in the works for more than a year now.
The big refactoring of the FPU code, which allowed to do a proper integration has been started exactly 3 weeks ago. Refactoring of the existing FPU code and of the original AMX patches took a week and has been subject to extensive review and testing. The only fallout which has not been caught in review and testing right away was restricted to AMX enabled systems, which is completely irrelevant for anyone outside Intel and their early access program. There might be dragons lurking as usual, but so far the fine grained refactoring has held up and eventual yet undetected fallout is bisectable and should be easily addressable before the 5.16 release. Famous last words...

The FPU code is relatively tricky, low-level stuff, so it would indeed be unsurprising to find a dragon or two lurking in the new work.

Using AMX

As noted above, the kernel is able to control which processes are able to use the AMX instructions. The first step for a user-space process would be to use a new arch_prctl() command (ARCH_GET_XCOMP_SUPP) to get a list of supported features; if the appropriate bit is set in the result, AMX is available. Then, another arch_prctl() command (ARCH_REQ_XCOMP_PERM) can be used to request permission to use AMX. Some checks are made (one to be described shortly), and there is an opportunity for security modules to express an opinion as well. Normally, though, the request will be granted. Permissions apply to all threads in a process and are carried over a fork; calling execve() will reset them, though.

One challenge presented by AMX is that processors can create a great deal of internal state while the AMX instructions are running. If the CPU is interrupted in the middle of an operation, that state must be saved somewhere or a lot of work could be lost. So, if a process is using AMX, the kernel must be prepared to save up to about 10KB of data in its interrupt handlers before doing much of anything else. This saving is done using the XSAVE instruction.

The kernel allocates memory for each process that can be used for this purpose. Allocating 10KB for every process in the system would waste a lot of memory, though; most processes will never use AMX instructions. Happily, the processor can be configured to trap into the kernel the first time any process executes an AMX instruction; the kernel can then check whether permission to use those instructions has been granted and, if so, allocate an appropriately sized buffer to hold the FPU state and allow the operation to continue.

One potential problem has to do with the sigaltstack() system call, which allows a thread to establish a new stack for signal handling. That stack, too, must be large enough to hold the FPU state if the process involved is using AMX. For years, developers have been told to use MINSIGSTKSZ as the minimum size for this stack; that size, which is 2KB, is nowhere near large enough for AMX-using processes. Indeed, it is not even large enough to use the AVX-512 extensions, a fact that has caused some corrupted-stack problems in the past.

To avoid this problem for AMX, the kernel will check to ensure that all signal stacks are sufficiently large. This check is done at each call to sigaltstack(), but a check of existing stacks is also done when a process requests permission to use AMX in the first place. Processes not using AMX will not need the larger stack and, thus, will not be broken by these checks. Processes that do want to use AMX will not be allowed to unless all of their signal stacks are big enough.

Once the infrastructure to perform these checks was put in place, the kernel also gained the ability to ensure that processes using AVX-512 have adequately sized signal stacks. Enforcing that condition, though, has the potential to break applications that seemingly work now, perhaps because their signal handlers are never actually called. To avoid this problem, there is a kernel configuration option (STRICT_SIGALTSTACK_SIZE) and a command-line option (strict_sas_size=), either of which can be used to control whether the strict checks are carried out when AVX-512 is in use.

Assuming that all of the pieces hold together, this is the form that AMX support will take in 5.16. Those wanting more information can look at the commits containing AMX test cases and some documentation on the arch_prctl() commands. Meanwhile, keep an eye out for dragons for the next nine weeks or so.

Comments (13 posted)

Concurrency in Julia

November 9, 2021

This article was contributed by Lee Phillips

The Julia programming language has its roots in high-performance scientific computing, so it is no surprise that it has facilities for concurrent processing. Those features are not well-known outside of the Julia community, though, so it is interesting to see the different types of parallel and concurrent computation that the language supports. In addition, the upcoming release of Julia version 1.7 brings an improvement to the language's concurrent-computation palette, in the form of "task migration".

Multithreading

Julia supports a variety of styles of concurrent computation. A multithreaded computation is a type of concurrency that involves simultaneous work on different processing units. Multithreading, and the related term parallel computation, usually imply that the various threads have access to the same main memory; therefore it involves multiple cores or processing units on a single computer.

When Julia is started without the -t flag it will use only one thread. To allow it to use all of the available hardware threads, it accepts the -t auto argument. This allows access to all of the logical threads on the machine. This is often not optimal, and a better choice can be -t n, where n is the number of physical cores. For example, the popular Intel Core processors provide two logical cores for each physical core, doubling up using a technique called hyperthreading, which can yield anything from a modest speedup to an actual slowdown, depending on the type of calculation.

To demonstrate multithreading in Julia, we will use the function below that tests whether a number is prime. As with all of the examples in the article, isprime() leaves many potential optimizations on the table in the interests of simplicity.

    function isprime(x::Int)
        if x < 2
            return false
        end
        u = isqrt(x) # integer square root
        for n in 2:u
            if x % n == 0
                return false
            end
        end
        return true
    end

    function isprime(x::Any)
        return "Integers only, please"
    end

The function should look only at integers (Int is an alias for Int64), so we've taken advantage of multiple dispatch to create two methods: one that tests for primality and one that complains if we give it anything besides an integer. The most specific method is always dispatched, so this lets us handle incorrect data types without writing explicit tests in the function.

A useful package for multithreading that performed well in my tests is Folds. This is not in the standard library so you must install it using Julia's package manager. The Folds package provides a small collection of high-level macros and functions for concurrency. The name derives from the use of "fold" to mean a reduction over an array. Folds provides multithreaded versions of map(), sum(), maximum(), minimum(), reduce(), collect(), and a handful of other automatically parallelized operations over collections.

Here we use map() to check a range of numbers for primality. If f() is a function of one variable and A is an array, the expression map(f, A) applies f() to each element of A, returning an array with the same shape as A. Replacing a map() operation with a multithreaded one is as easy as this:

    julia> @btime map(isprime, 1000:9999999);
      8.631 s (2 allocations: 9.54 MiB)

    julia> @btime Folds.map(isprime, 1000:9999999);
      5.406 s (45 allocations: 29.48 MiB)

In Julia, the syntax a:n:b defines an iterator called a "range" that steps from a to b by n (which defaults to 1). For the map() call, those ranges get turned into a vector.

Folds.map() divides the array into equal segments, one for each available thread, and performs the map() operation in parallel over each segment. As with all the experiments in this article, the timings were performed using Julia version 1.7rc1. For this test I started the REPL with julia -t2.

The @btime macro, from the BenchmarkTools package, runs the job several times and reports an average run time. The timings indicate a speedup not all that far from the ideal factor of two, on my two-core machine, at the cost of higher memory consumption. A CPU-usage meter confirms that only one core was working during the first calculation, while two were busy during the second.

Another form of multithreading is use of graphics processing units (GPUs). Originally designed to accelerate the inherently parallel graphics calculations in 3D games and other graphics-heavy applications, GPUs, with their hundreds or thousands of floating-point processors, are now used as array coprocessors for a wide variety of applications. The best candidates for GPU computing are inherently parallel calculations with a large ratio of arithmetic to memory operations. There is an organization called JuliaGPU that serves as an umbrella for Julia packages that implement or rely on this style of parallel processing.

Map operations are inherently parallel; they involve a function acting on each element of a collection in isolation, neither reading from nor writing to other elements. Because of this, conventional maps can be replaced with multithreaded versions without having to change anything else in the program. The complications that come with simultaneous access to the same data don't arise.

Real-life programs will contain parts that can be easily parallelized, others where greater care must be taken, and parts that can not be parallelized at all. Consider a simplistic simulation of a galaxy, with bodies moving under the influence of their mutual gravitational forces. The force on each body must be calculated by looking at the positions of all the others (out to some distance, depending on the accuracy sought); although this part of the calculation can be parallelized, the pattern is more complicated than the map() patterns considered above. But after all the forces are stored, each body's change in velocity can be calculated independently of the others, in an easily parallelizable procedure with the structure of a simple map().

Julia has all the facilities for the more complex forms of concurrent computation. A for loop can be made multithreaded using the @threads macro this way:

    Threads.@threads for i = 1:N
	<do something>        
    end

In this case, if, for example, we have two threads, one thread will get the loop from 1 to N/2 and the other from N/2 to N. If more than one thread needs access to the same data, that data must be protected by locking, which is provided by the lock() function. Julia prevents some race conditions through the use of an Atomic data type. Programs are not allowed to change data of this type except through a set of atomic operations such as Threads.atomic_add!() and Threads.atomic_sub!() instead of normal addition and subtraction.

Distributed processing

A different style of concurrency involves the cooperatively multitasking asynchronous coroutines that Julia calls Tasks. These coroutines can share a single thread, be distributed among threads, or be assigned to different nodes in a cluster or to geographically distributed machines communicating over the internet. All of these forms of distributed computing share the characteristic that processors do not have direct access to the same memory, so the data must be transported to the places where computation is to happen.

Tasks are well suited to the situation where a program needs to start a process that may take a while, but the program can continue because it doesn't need the result immediately. Examples are downloading data from the internet, copying a file, or performing a long-running calculation whose result is not needed until later.

When Julia is invoked with the -p n flag, n worker processes are initialized, in addition to the main executive process. The two multiprocessing flags can be combined, as in julia -p2 -t2. This would start up with two worker processes, each of which can compute with two threads. For the examples in this section, I used julia -p2.

The "manual" way to launch a task is by using the macro call @spawnat :any expr. This assigns a task to compute the expression expr on a worker process chosen by the system. The worker processes are identified by integers starting with 1; using an integer in place of :any spawns the task in that specific process.

More common is the use of a distributed version of map() called pmap(), which gets imported by the -p flag. pmap() performs the same calculation, but first breaks the array into pieces, sending an equally-sized piece to each worker process where the maps are performed in parallel, then gathers and assembles the results. For an example calculation we'll estimate π using the notoriously slowly converging Leibniz sum (seen at right).

The function leibniz(first, last, nterms) returns an array with the results for every nterms terms between the first two arguments:

    @everywhere function leibniz(first, last, nterms)
        r = [] 
        for i in first:nterms:last # Step from first to last by nterms
            i1 = i
            i2 = i nterms-1
            # Append the next partial sum to r:
            push!(r, (i1, i2, 4*sum((-1)^(n 1) * 1/(2n-1) for n in i1:i2)))
        end  
        return r  
    end

The @everywhere macro, also imported by the -p flag, broadcasts function and variable definitions to all workers, which is necessary because the worker processes do not share memory with the main process. The leibniz() function is a direct translation of the Leibniz sum, using the sum() function to add up the terms in a generator expression. After every group of nterms terms, it pushes the cumulative partial result into the r array, which it returns when done.

In order to test the performance of pmap(), we can use it to simultaneously calculate the first 2 × 10⁸ terms:

    julia> @btime r = pmap(n -> leibniz(n, n 10^8-1, 1000), [1, 10^8 1]);
      7.127 s (1600234 allocations: 42.21 MiB)

The version using the normal map, which uses a single process, takes almost twice as long (but uses 23% of the memory):

    julia> @btime r = map(n -> leibniz(n, n 10^8-1, 1000), [1, 10^8 1]);
      13.630 s (200024 allocations: 9.76 MiB)

The figure below shows the results of summing the first 2 × 10⁸ terms of the Leibniz series.

The distributed version consumed significantly more memory than the normal map(), something that I observed in most cases. pmap() seems best suited to coarse-grained multiprocessing; it's an effective way to speed up an expensive operation over a small collection of inputs.

Julia's machinery for distributed processing also works across machines. When given a file of internet addresses, Julia will spawn jobs over the internet using SSH and public-key authentication, allowing distributed code developed locally to run with little modification on a network of suitably equipped computers.

Tasks and task migration

Another type of concurrency that Julia supports can be described as asynchronous multithreading. It computes with Julia tasks, but assigned to multiple threads within a single process. Therefore all tasks can access the same memory and cooperatively multitask. For the examples in this section I started Julia with julia -t2, so I had one thread for each CPU core.

We can start a task, and automatically schedule it on an available thread, with a macro: a = Threads.@spawn expr. This returns a Task object; we can wait for the Task and read the return value of expr with fetch(a).

For calculations involving a set of tasks with roughly equal run times, we should expect a close-to-linear speedup with respect to the number of CPU threads. In previous versions of Julia, once a task was scheduled on a thread, it was stuck there until it finished. This could create scheduling problems in cases where some tasks take longer to run than others; the threads hosting the lighter tasks may become idle while the heavy tasks continue to compete for time on their original threads. In such cases involving tasks with very unequal run times, we might no longer observe linear speedups; there are resources going to waste.

In version 1.7 of Julia, tasks are allowed to migrate among threads. Tasks can be rescheduled to available threads rather than remaining stuck where they were originally assigned. This feature is still largely undocumented, but I verified that tasks launched with Threads.@spawn can migrate upon calling yield(), which is the function that suspends a task and allows another scheduled task to run.

In order to directly observe the effect of thread migration, I needed to be able to turn it on and off. It's possible to compile a version of Julia without the feature enabled, but there is an easier way. The ThreadPools package provides a set of useful macros for concurrency. I used its @tspawnat macro in my experiments. Tasks spawned with this macro can migrate, just as tasks spawned using Threads.@spawn, but using @tspawnat allowed me to easily make a modified version that spawns tasks that are not allowed to migrate.

Monte Carlo physics simulations or probabilistic modeling are ideal types of calculations for asynchronous multithreading. This approach involves running a set of simulations that are identical aside from the series of pseudo-random numbers that control the details of each task's calculation. All the tasks are independent of each other, and when they are complete, various averages and their distributions can be examined.

For my experiments I wrote a simple program to look at the statistical properties of Julia's pseudo-random number generation function rand(), which returns a uniformly distributed random number between 0 and 1 when called with no arguments. The function below, when passed a number n, generates a random number n times, reporting the mean for each group of n/10 numbers:

    function darts(n)
          if n % 10 != 0 || n <= 0
              return nothing
          end
        a = Threads.threadid()
        means = zeros(10)
        for cyc in 1:10
            yield()
            s = 0
            for i in 1:Int(n/10)
                s  = rand()
            end
            means[cyc] = s/(n/10)
        end
        return (a, Threads.threadid(), time() - t0, n, means)
    end

The program only works with an n that is a multiple of 10. It has a few additional lines for keeping track of which thread it ran on and to record the completion time, which it calculates as an elapsed time from a global t0. Before each one of the 10 cycles, it calls yield(), which allows the scheduler to switch to another task on the thread, and, with thread migration, to move the task to another thread.

As an example, if we run darts(1000), we'll get 10 mean values back, each one the result of picking 100 random numbers. According to the central limit theorem these mean values should have a normal distribution with a mean that is the same as that of the underlying (uniform) distribution.

In order to get a sample and plot its distribution, I launched 20 tasks, each running the darts() program with n = 10⁷. This ran 200 random trials, each one picking one million random numbers, then calculating and recording their mean. We also need a highly accurate estimate of the distribution's mean, so I launched one additional task with n = 10⁹ to calculate this. We know that the theoretical mean should be 0.5, so this will check for bias in rand(). With a trivial modification the procedure can be used with other underlying distributions whose means we may not know analytically.

I launched the tasks with the following:

    begin
        t0 = time()
        tsm = [@tspawnat rand(1:2) darts(n) for n in [repeat([1e7], 20); 1e9]]
    end

All of the tasks have access to t0, so they can all report the time at which they finished. The call rand(1:2) assigns each task initially to one thread or the other with equal probability. I repeated this experiment five times, and ran another experiment five times, identical except for the use of my modified @tspawnat macro that disables thread migration.

To look at the distribution from any one of the experiments we can collect the sample means into one array and plot it with the histogram() function that comes with the Plots package; an overview of the plotting system explains how to use histogram() and the other plotting functions used in this article. The result is shown in the next figure, with a bar superimposed showing the mean from the task using 10⁹ numbers.

The distribution is approximately normal with the correct mean.

The experiment consists of 20 short tasks and one task that takes 100 times as long. I was optimistic that this would exhibit the effects of thread migration, as the scheduler would have the opportunity to move tasks off of the thread containing the long-running calculation. This is the type of situation that thread migration was intended to improve. Calculations involving a group of tasks with similar resource requirements would not be expected to benefit from migration.

The average of the 21 times for completion, over five experiments, with thread migration enabled was 0.71 seconds, and 1.5 seconds with migration disabled. A detailed look at the timings shows how thread migration allows the job to complete twice as quickly. The next figure shows the return times for all 10 experiments, with a line drawn to guide the eye through the times for the fastest one.

A separate timing showed that single runs take 0.024 seconds for darts(1e7) and 2.4 seconds for darts(1e9), the ratio of run times that we would expect. Trial number 21 is the task calling darts(1e9), which takes about 2.5 seconds when it is run in parallel. The plot makes it clear that, without migration, about half the tasks get stuck sharing the thread with the single heavy task. The other half return quickly, after which an available thread goes to waste. In contrast, thread migration moves tasks onto the free thread, leading to better load balancing and a quicker overall completion. Looking at the values returned by the Threads.threadid() calls confirms that about half the tasks migrate.

This experiment shows the new thread migration feature is working as intended. It can speed up asynchronous multithreaded calculations in cases where tasks have significantly unequal run times, or where some tasks block waiting for I/O or other events.

In calculations like the ones in this and the previous section, the tasks proceeded completely independently of each other; no coordination was required. If a program has locations where it must wait for all the asynchronous tasks to complete before moving on to the next stage, it can use the @sync macro. This will wait until all tasks spawned in the scope where the macro appears have finished.

Conclusion

Concurrency is inherently difficult. Concurrent programs are harder to debug and to reason about. Julia doesn't erase this difficulty; the programmer still needs to beware of race conditions, consistency, and correctness. But Julia's macros and functions for tasks and data parallelism make simple things easy in many cases, and more complex problems approachable.

This is also an area where development is active, from packages such as Folds to implementation details like the operation of the scheduler, which has most recently led to the task migration feature. We can expect Julia's concurrency situation to continue to improve. We can also expect some of the best parts of the concurrency ecosystem to remain outside of the standard library and the official documentation; the programmer who wants to take the best advantage of these offerings will need to make an effort to keep up with the changing landscape.

Comments (14 posted)

Page editor: Jake Edge
Next page: Brief items>>