-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
user-defined allocators #538
Comments
I've played around with the concept a bit in Jurily/rust-allocator, but many of my use cases would be suboptimal with panic. While it might make sense to kill everything when we run out of global heap, it would be unnecessarily fragile for size-limited pools (sandbox) or allocators that don't implement some operations (arena cannot reallocate). What exactly is the scope of this API? |
All Collections and SmahtPtr types should absolutely support the custom allocator API, as far as I'm aware. This results in some... nasty... details as you're finding. |
I don't think it's reasonable to expect allocation errors to bubble up all the way through to users of collections and smart pointers since in the vast majority of cases the only reasonable thing to do is abort or panic and we'll just end up with a million meaningless and annoying |
There's often a sane way for the application to handle it, and in some niches it must be handled. Rust isn't entirely ruling out those niches by doing this because the |
@thestinger The allocator API should definitely support handling OOM - I'm just saying it's unreasonable for most collections and smart pointers to propagate those errors in any way other than panicking or aborting. You could still write custom collections or smart pointers which exposed the error. |
@reem You can do anything under OOM as long as you don't allocate more memory. For example, you can free memory or wait for some other thread to do so. I find it a perfectly reasonable expectation from a modern systems language to not need to fork the standard library for the cases where that's important. It's not like we don't have enough precedents on how to solve this. We even have default type params now. The main problem is that bubbling up errors from multiple sources is an unsolved ergonomic problem in |
I agree with @reem. Rust can't enforce a constraint like "doesn't allocate memory" in destructors for generic collection types, and panicking can allocate memory too, so if you really need to handle OOM I don't think you can rely on the standard library anyway. |
The inability to enforce error handling in the type system certainly doesn't prevent handling that class of error correctly. We can't enforce that other types of errors are handled correctly either...
This means that panicking is not a solution to handling out-of-memory errors, which is already a given considering that the niches where this needs to be handled are often the same ones where unwinding isn't supported / tolerable.
The |
My point was that for an application that cares about OOM, it won't be able to use any data structures without auditing the code to make sure it can't allocate memory in destructors, since Rust can't enforce that behavior. Once that code is audited, it has to be forked anyway, unless there are actual guarantees that that won't occur in the future. The standard library (outside of libcore and liballoc) could be changed to guarantee it wouldn't allocate in destructors as long as the generic destructors didn't allocate, I guess. I'm pretty sure that's not currently a guarantee. |
I don't know why you're getting caught up on destructors. Rust can't enforce effects for any functions but that doesn't mean that effects can't be documented / guaranteed.
No, it just has to make the guarantee that it won't allocate. It's not possible to write any code at all if you're unable to trust APIs to uphold their contracts... this is no different than any other part of the API.
The
It's already a strong guarantee in |
"it's not implemented" makes no sense as the rationale for "it shouldn't be implemented" |
Okay, you've convinced me, as long as the guarantee only holds if unwinding is turned off. |
Are there plans for Box to be able to reference the allocator that allocated it? I am thinking of arena style allocators. Drop would need to be self.myalloc.free(self.ptr) instead of free(self.ptr) |
@rrichardson one example from RFC PR 244 shows embedding "the allocator" into the allocated object. (And then the embedded "allocator" is really just a reference to the actual shared state of the allocator itself.) I don't know how much the eventual allocator design will continue to look like RFC PR 244, but I think we do intend to continue to allow the encoding of such back-references (with the expectation that many allocators will be represented by zero-sized types to avoid unnecessary overhead in the allocated state). |
We have a simulation benchmark, with plausible memory usage patterns, that in some circumstances runs eight times faster with custom allocators than with global malloc, on modern hardware. The program spends a negligible amount of time actually allocating and freeing memory; effectively all of the runtime difference is the result of better or worse cache locality resulting from the elements used together being allocated together, despite being allocated at widely different times. The simulation benchmark models a system composed of subsystems that are activated, one at a time, in random order. When a subsystem is activated, it runs for a while touching memory allocated during previous activations, and doing some further allocating and freeing before returning control to the dispatcher. This usage pattern is far from unusual in real systems: imagine a web service that handles various POSTs that touch different database tables, or a client with multiple (tab or window) workspaces. Fully integrating allocators into a foundation library has profound architectural costs, particularly as regards safety. E.g., moving an object with parts that live in one allocator into a container constructed to use another to has typically catastrophic consequences. The need to store an allocator pointer in the container and in elements adds space overhead and cache pressure. Sharing an allocator object among different containers or elements in a container presents complex ownership questions. Nonetheless, the profound performance consequences (8x!) justify quite a lot of pain. These consequences are not likely to fade away with future changes in system design. |
Out of curiosity, was this in the BDE? :) Threading allocators through object hierarchies was a bit of a PITA, but sometimes useful. It is an interesting design choice, to make most objects take ptrs ot alloc instances at construction time. Some questions: Were the allocators at each subsystem arena, slab or otherwise? Thread safe, or intended to be use thread-local? Cache locality is nice when you can get it, but if the objects are too large to fit in a couple cache lines that benefit is gone.
|
Good guess. Yes, if you clone bloomberg/bde-allocator-benchmarks here, you can see the program, benchmarks/allocators/locality.cc. It's a little program with 5 knobs. Runtimes for a series of runs are in (e.g.) benchmarks/allocators/results/intel-clang-20150521/locality.csv. You can see, in the line 4th from the bottom (arguments "-21 17 1 5 256"), a 6x speedup vs. malloc, i.e. 7.5s vs. 45s. (This is on an Ivybridge Xeon. The 8x result was from a series run on a Haswell desktop box.) It's notable that there are plenty of cases where using separate allocators was actually slower, and you would use the default global allocator in those circumstances. Trafficking in raw pointers to allocators is admittedly retro, but the CPU doesn't care. The allocators used in the benchmarks were what we call "multipools", which segregate allocated blocks by size, and re-use freed memory only for subsequent allocations of the same size. Each block has a size prefix which Rust wouldn't need, so its gains might be larger. The allocators we used are thread-safe, but the simulation was single-threaded, and anyway time spent actually in the allocator was negligible. Most of the time in the slow runs was spent waiting on cache misses. The amount of memory involved is a lot more than a few cache lines. We have runs that straddle L1, L2, and L3 cache size boundaries, with visible consequences. The elements allocated are 20 bytes; 28 bytes with the prefix, or two to a cache line. I think malloc overhead nowadays is the same. There are legitimate reasons to want polymorphic allocators in real programs, so that the same container implementation (i.e. it lives in a ".o" file) may be passed allocators for different strategies in different circumstances. The strategies we find most commonly used for performance tuning are multipool and monotonic. In the latter, free is a no-op and the memory is all reclaimed when the allocator goes away; people make a buffer on the stack that is usually big enough. Polymorphic dispatch turns out to add negligible overhead with modern branch predictors. Transitively, anything that is put in a container and uses allocation must use identically the same allocator instance as the container itself, or catastrophic things happen at free time. E.g., in a hash (string -> pair(vector, string)), the hash, the strings, and the vector need to be enforced using the same allocator instance, even though the pair does not itself do allocation. Compiler support would go a long way to making that requirement practical. It's hard to get any help from C on that point. |
The primary usefulness of polymorphic allocators is to avoid code duplication for every allocator type. In C it also serves the purpose that it allows you to avoid templating everything that requires an allocator. This is very important and should definitely be considered as such. In my experience, allocators state comes in three flavors: global (eg: jemalloc like we have now), local (eg: monotonic on the stack), and thread local (eg: stack frame allocator). While global/thread local allocators do not require a pointer to the allocator all the time, it is still required when mixed allocator strategies are used (in a function or structure) or when freeing does not happen on the same thread that allocated. A pointer is always required when the allocator is hidden from the user via polymorphic allocators which is typically required for things like containers and high level data structures. While having contained objects having the same allocator as the container object is probably best regarding the TLB/cache, I'm not sure if this is really an issue that needs to be enforced by the language/library. Surely the increased TLB/cache cost of having 2 or more separate allocators involved is similar in cost to polymorphic dispatch or am I missing something? I have never experienced major issues regarding mixed allocators involved with containers/containees. I presume how your containers are implemented might be a large part of the reason. The impact on the TLB is probably much more important than on the cache since data between the container/containee is unlikely to share identical cache lines. Adding a pointer to every allocation simply to point to the owning allocator is a bit overkill. I see at least 2 other alternatives (which could be mixed): the allocator must be provided when freeing memory (using the wrong allocator is cause for panic, omitting the allocator we might assume the global heap and use that instead) or a management structure performs a look up based on the pointer address to be freed and find which allocator it belongs to. The later approach is how I believe jemalloc functions to find which segment/block of memory owns the allocation. In practice, something like a judy array might be used. Additionally, ideally containers might not always directly store an allocator pointer inline since in practice it is rarely accessed and it will simply bloat all owner objects that contain such containers. It might be best to hide the pointer as part of the allocated memory either as a header or a footer once the memory is allocated. For example, a vector that is uninitialized has no data and no allocator assossiated, an initialized vector has no data but the data pointer points to the allocator to use (vector capacity is 0), and an allocated vector has data and the pointer is stored at data - sizeof(ptr). Also having a single allocator interface might be hard. Alignment is a must for embedded platforms and many high performance applications, and simply aligning to natural alignment is not always desirable or possible. Further flags are also required and it is not clear how many such flags are enough or required. For example, embedded platforms can routinely specify which page size to use and other things such as write combining, cpu/gpu memory, etc. There are also the cpu/gpu access flags (which might be separate). An allocator might be created to handle a certain specific set of flags (eg: r/w gpu memory, write combining) at which point adding these in the interface might not make as much sense. Another common flag is one of lifetime. EA STL for example allows you to specify if your allocation is long lived or short lived in which case different heap sections are used to help reduce fragmentation. Adding a typed flags argument to the API might not be feasible but if it is not typed, it is certainly less safe. To complicate things further, not all memory allocation strategies agree on what freeing memory entails. Heaps and pool type allocators typically support freeing at the pointer granularity while monotonic and frame based allocators do not and instead free in frames (typically in C applications, destructors are not even called on objects allocated by these allocators since it is generally not known how many objects are freed when it happens). Whether or not this is an issue is debatable. I have begun discussing similar issues on my blog and I offer a separate C API that differs from jemalloc and bloombergs approach a bit in that I offer polymorphic allocators but avoid using the vtable as much as possible. Details can be found here. It is nowhere near finished nor production ready but it offers different trade-offs. |
(I read the blog entry, but Disqus does not let me comment there... Merging allocation, deallocation, and re-sizing into one function is clever, and support for resizing is welcome, but it is probably not necessary to second-guess a virtual-function dispatch mechanism: modern CPUs cache recent destinations, tagged to addresses of branch instructions, eliminating most pipeline stalls in loops where they matter most. The vtbl itself eats just one cache line. Since each operation would, in practice, always be one or the other, merging them adds overhead. Also: "IsOwnerOf" (which seems like a long way to say "Owns") seems like it could be a very expensive operation to be obliged to support universally. Is an allocator that can't tell allowed to return false?) In C we have burnt bridges that Rust has not (yet) such as the need for objects have a recoverable size. In practice making the allocator track size need not add overhead, as multiple small objects in a page can share a size annotation (located by address arithmetic) while larger objects can afford their own annotations, and their own pages. Still, it is better for the user of storage to keep track of where they got it and how much is there--presuming they can be trusted to get it right. Can forcing correctness be automated? E.g. each allocation with a static size is matched to a corresponding deallocation, and each pair of dynamic-sized operations identifies a machine word shared between them that records the size. (Tracking reallocation would need some trickery because there is temporarily a second size that, at least momentarily, lives elsewhere.) Memory allocation seems important and hazardous enough to deserve core language support to verify correctness. The matter of contained objects obliged to "bloat" themselves with copies of an allocator identifier is difficult. In principle, objects contained could instead point to the outermost container, and keep one copy of the allocator identifier there, but that doesn't seem to save any bloat. It may be that objects which do no re-allocation during their lifetime could avoid keeping such an identifier by requiring somebody else (e.g. the container) to provide it at destruction time. |
Very strange about the commenting, I'll have to double check what is up The main idea of inlining the realloc function pointer is, as I mention in Indeed, memory allocation is tricky to get right and there are many edge Dealing with OOM is also a major pain point... I am unsure what the best On Fri, Jun 5, 2015 at 7:02 PM, nmyers12 [email protected] wrote:
|
(When we think about performance, we don't worry about the first cache miss or the first pipeline stall. We worry about the following million accesses: how many operations do I get per stall or miss? Runtime dispatching in your pointed-to function guarantees plenty of stalls in not-unusual programs. Second-guessing your caches without comprehensive measurements is always wrong. But let's not get distracted.) My touchpoint for allocators in containers is map : string -> pair ( string, vector ). It seems to me a meaningful distinction whether an object uses an allocator only at construction and destruction time, vs. possibly other times. In the example, the key string is presumably constant, the value maybe not. The pair doesn't know about allocators, but its components do, and they (or, maybe, whoever mutates them?) need to know the map's allocator. When the key is constructed and when it's destroyed, the container is in control and can supply an allocator. When the vector is mutated, finding and conveying to it the right allocator to use for resizing, if it doesn't store one itself, might be prohibitively tricky. This suggests to me that mutable objects which may allocate again, after construction is over, need space to store (a pointer to) an allocator, but when the same sort of object is not mutable, it should not need to store an allocator. (This implies it has a different, but related type.) The constructor and destructor of a const object that might allocate during construction (including aggregators of such types) need an allocator argument; if mut, it might allocate at other times and need to keep a copy, too. A container or aggregator of a mutable object (anywhere down the hierarchy) needs to be equipped to ensure, on construction, that the contained objects' allocators match, and that they match whatever container it's moved into. The type system needs to be able to answer, for any given object, whether any subobject might need an allocator, and when. Types that never use an allocator, then, are easy. Those that do need an allocator, but only at construction and destruction (such as constant objects, but many mut objects too), need to be supplied one at the right times. (Maybe some mut subobjects can share a copy if the definitions of allocating operations are visible.) Separately allocated mut subobjects that might need to allocate again each need their own copy. C is stuck with keeping copies of allocator pointers everywhere. In Rust it might not be too late to do something smarter. |
Keeping pointers to allocators is a way to abstract it from the usage
locations. It means that only the code that creates the objects needs to
know about the allocator used and everything else doesn't need to care
(whether they mutate it or destroy it). Requiring the user to provide the
correct allocator for all containers at the destruction site as well will
be an additional burden to manage and a potential source of error. Unless
the compiler can manage this on its own, it is likely a bad idea to force
all code to use this.
While it is true that it is less efficient in theory to keep a pointer per
container, in practice it isn't necessarily a clear win to store the
allocator pointer externally. For objects larger than 16/32 bytes
(depending on avx usage/support), if the alignment is not provided to the
allocator, it cannot make any assumptions and as such must return memory
aligned to at least 16 bytes to avoid potential issues with sse types. In
these cases, depending on the size, padding will remain and be left unused.
8 extra bytes could very well fit inside that padding. Even if the
alignment is provided and it is less than 16 bytes, heap type allocators
might elect to only support alignment on 16/32 bytes anyway in order to
simplify management or reduce bookkeeping overhead.
For something like a standard library, a clean and simple API might be
better. Something that imposes the least amount of burden on the
programmer. Perhaps a separate container implementation might support a
more verbose API. For example, keep the current container APIs and simply
add an allocator type (either typed on the container or provided to the
constructor) while providing a low level variant which requires the
allocator at destruction (and perhaps whenever required) and which returns
options/results to indicate OOM situations (where appropriate). The first
would be intended to everyday usage by applications that do not care or do
not need that extra performance boost while the later would be intended at
embedded/low level systems that need that fine grained support. Perhaps 3
types might be provided: the current one, left unchanged which would use
the global heap, a second using a custom allocator but with identical api
to the first, and the third verbose api. Underneath, both of the first two
types could use the 3rd underneath to ensure as little code duplication as
possible and ease maintenance.
IMO the best way to offer power, control and simplicity is to package it in
separate types. The hardest part might be to name them in a clear way to
avoid confusion.
|
All of that sidesteps the difficult problem of ensuring that the same allocator is used throughout the container and all its elements, transitively, including elements that, at top level, appear not to need an allocator at all. Obviously everything can be left to be done manually, with massive duplication and no help for correctness. That is what we are left with if nothing smart surfaces. The topic here is what a smart design might look like, with help from the core language if necessary. |
This can be closed now that the RFC is accepted? |
@Ericson2314 I think that the merge of #1398 probably means we can close this, though I want to review the thread here and make sure that other concerns that were not addressed by that RFC (e.g. swapping out the default global allocator) are addressed by either an RFC or a follow up issue. |
Whats the status of this? Its been three years =/ |
Triage ping @pnkfelix - have you reviewed this so that we can close the issue? |
We will want a user-defined allocator API in our standard library. It has been moved off of the 1.0 requirements, but it remains an important post 1.0 task.
See for reference:
Also, it is good to be aware of whatever happens with placement new and the
box
syntax: See e.g. #413 (though that is arguably a draft RFC that has been filed as an issue).The text was updated successfully, but these errors were encountered: