Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Spotting and avoiding heap fragmentation in Rust applications (svix.com)
138 points by tasn on April 6, 2023 | hide | past | favorite | 68 comments


> Gaps that are too small and scattered throughout the heap can lead to new "fresh" blocks of memory being allocated to accommodate a new value that won’t fit otherwise. Though unfortunately because of how memory management works a "defrag" is not possible.

This is how memory management works now, but some older systems like classic Mac OS and Palm OS used a design that did make it possible to compact the heap.

See https://en.wikipedia.org/wiki/Classic_Mac_OS_memory_manageme...

When you allocated memory, instead of getting a pointer back, you'd get a handle. While the handle was unlocked, the operating system was free to move the memory around. If you wanted to access the memory, you'd lock the handle, giving a pointer with the actual address, then unlock the handle when done.

To the extent that you kept handles unlocked, the system could fight fragmentation.

It was tedious, and your code had lots of lock/unlock clutter in it. It also led to bugs where you'd accidentally use a pointer value after unlocking its handle, which makes the pointer value invalid because your data might have been moved somewhere else. Worse, usually your data was not moved, so these bugs were hard to detect, much like use-after-free bugs.

But, when memory is very limited and you also don't have an MMU, life isn't easy.


Rust's borrow checker should make it impossible to reference an unlocked pointer in safe code. I suspect it should be possible to have a compacting heap as a library.


Yeah. I wonder if something like this might be interesting for a pure Rust program / what computing would look like if Rust were available back then (we now have a path dependency on modern memory allocation patterns that are unlikely to change where we rely on the allocator to do really heavy lifting). I wonder how expensive the extra handle indirection would be (I imagine modern CPUs should be an able to speculate through it without much difficulty).


Early Windows version worked like that too.


If you're doing memory-intensive tasks that are clearly bounded (e.g., handling an http request), I wonder if it's worth looking at something like an arena allocator [0] that can free the entire task's memory at once.

[0] https://docs.rs/bumpalo/latest/bumpalo/


> In the case of this old PC hard drive, files of varying sizes were written to disk then later moved or deleted, leaving a "hole" of available space between other used regions. As the disk starts to fill up, you might try to create a new file that doesn’t quite fit in one of those smaller areas, and you’d be out of luck. You’d need to "defrag" in order to reclaim those open blocks which are too small to hold the new file when they are split up, but could be big enough when they are contiguous.

While that is pretty much what heap fragmentation is about, the failure mode of disk fragmentation is less drastic. The file system will just split the file contents across multiple smaller free spots, making it possible to use the whole disk no matter your write pattern. The issue is that now the file is no longer contiguous, so reading the entire file takes longer. Much longer if we are talking about old HDDs.


Good point, thanks! Will update the article to clarify the difference.


The failure mode for filesystems is to get slow. Same as for heap allocation.


No - heap fragmentation actually leaves memory unusable by the workload. It may or may not slow down allocation depending on the design of the allocator.

(The analogy to file system fragmentation for memory is that when physical pages are allocated in a discontinuous manner, it prevents some optimizations like coalescing them into hugepages, which for some workloads can help with TLB hit rate.)


Filesystem may also run out of inodes, even though there is still plenty of space.


I don't think either FAT32 or NTFS suffer this problem, they allow metadata to become fragmented instead. Certainly a different design philosophy to the unixy file systems.


NTFS most definitely suffers from this[1]. We ran into it with a customer who had a particularly fragmented database file.

IIRC it's due to a combination of relying on fixed-sized "pages" to hold fragment pointers and a limited number of page indirections. That is the root page can point to a sub-pages, which again can point to sub-pages, but those sub-sub-pages have to point to the actual fragments. Or something along those lines.

[1]: https://support.microsoft.com/en-au/topic/a-heavily-fragment...


Or you should use ZFS and not worry about that.


In some ways Rust is in the worst possible position in terms of language design when it comes to fragmentation.

In a completely manually managed language like C or C++, you can handle fragmentation problems yourself by writing your own allocators or doing object pools to reuse previously allocated memory. You have control over fragmentation.

In a garbage collected language like Java or C#, the runtime is able to move objects around in memory and update their pointers. So, as long as your language has a sufficiently advanced implementation, it may defragment on the fly for you.

But Rust is sort of stuck in the middle. It's safe enough that it's hard to write your own allocators or easily reuse previously allocated memory. But it's low level enough that the runtime doesn't have the freedom to move things around in memory under the program.

It's a hard problem.


>In a completely manually managed language like C or C++, you can handle fragmentation problems yourself by writing your own allocators or doing object pools to reuse previously allocated memory. You have control over fragmentation.

It's done constantly in Rust. Create a vector of items you want to allocate. Reference to them by their id, i.e. int representing them. There're libraries supporting this style of development, for example, slab: https://crates.io/crates/slab


I assume there might be performance issues using a vector that’s gonna reallocate when capacity is full?

That being said vectors seem like a gift in rust. I feel like it’s the easiest way to get around some ownership rules when you need to work with many pointers. You can just use indices.


You can just allocate a large buffer and serialize/deserialize everything at various indices in that buffer


I hear what you're saying, and I think I agree - I just frame it slightly differently. As others have mentioned, you can use custom allocators & object pooling in rust. I'm a big fan of bumpalo.

But its awkward. I think this awkwardness comes because rust is trying to straddle the gap between being a low level systems language and a high level language for application developers.

In a systems language, I want full control over the allocator for each allocation my program does. Do I want to use the system allocator, or an arena or an object pool? Different tasks call for different tools. I really like Zig's approach here where every collection type which allocates takes an allocator as a parameter when the object is created. Rust's borrow checker can be super helpful here in making all of this stuff correct.

But application developers don't want to think about allocators and memory fragmentation. Adding allocator traits to everything will add yet more things to learn in rust. And rust's standard library doesn't support that (yet).

The result is that allocation crates like bumpalo need to ship with & maintain their own copy of the standard library's collection types. Its kind of a mess. This might eventually be fixed, but at the rate rust's development has been going lately, I suspect it'll take years for something to land on stable.

But people in the community are talking about it, at least. Eg this article from the rust subreddit: https://nical.github.io/posts/rust-custom-allocators.html


It's not really that hard of a problem. You just use a different allocator (like the author did in this article). In fact, you can even write your own allocator with unsafe if you want (jemalloc is already using a lot of unsafe in the code anyway, you need to use it to build any real allocator).

'unsafe' doesn't equal 'don't use'. I'll admit that unsafe Rust has a lot of rough and unergonomic features compared to C, but it's conceptually and algorithmically just as easy (or hard) in Rust to build an allocator as it is in C and C++.

You also still get the help of the borrow checker in unsafe rust, so there are even some advantages.


Doesn't the normal approach in C++ run into similar issues? The workarounds in C++ are more ergonomic than in Rust, but can still require a lot of refactoring.

One of the performance bottlenecks that I ran into while reimplementing clox in C++ was using `new` and `delete` instead of `realloc` for arrays and hash tables. By the time that I figured out it was an issue, I was already using `operator new` to track heap allocations for the garbage collector.


Rust lets you specify your own allocator. You can also swap it deeper down as this article describes. It’s no worse than C or C++.

Also unsafe Rust is a thing and exists precisely to enable programming at the extremely low level when abstractions tend to leak badly. It’s considered bad form to have it in web apps and other normal code but seeing it in a low level data structure or an allocator would not be unusual.

Managed languages and high level languages definitely do have an advantage here.


This article would've been a bit cooler if the conclusion wasn't "switch from default allocator to jemalloc" but instead "use jemalloc to prove something is wrong in the default allocator and track down + find a fix for what's wrong in the default allocator"

Unless I misunderstood that the default Rust allocator, with high request bodies and concurrency, is always going to suffer unfixable heap fragmentation like displayed in the article?


I agree that there could have been a more satisfying conclusion, but it is worth noting that jemalloc isn't a panacea. I've seen issues similar to Svix in both Rust and C++ applications that were heavy on ephemeral allocations, and have fixed it by doing all of the following, depending on the specific process:

  * Switching from libc malloc to jemalloc
  * Switching from libc malloc to tcmalloc (dating myself a little bit)
  * Switching from libc malloc to mimalloc
  * Switching from jemalloc to mimalloc
  * Switching from jemalloc to libc malloc
  * Switching from mimalloc to jemalloc
Possibly others; I only want to list cases I'm 100% certain of.

Heap fragmentation is just a reality of some allocation patterns without a GC runtime.

One certainly can (and, in some cases, should) make their application more allocator-friendly, but - aside from some often-low-hanging fruit - this is a time-intensive process involving a bit of, for lack of a better word, arcane knowledge (I should inline all my fields and allocate on the stack as much as possible, right? Yes, well, except ...)

If you already have a halfway decent benchmark suite or workload generator, which you'll want for other purposes anyway, it's often a lot quicker to just try a few other allocators and select the one that handles your workload best.


> * Switching from libc malloc to tcmalloc (dating myself a little bit)

If you think of tcmalloc as an old crusty allocator, you've probably only seen the gperftools version of it.

This is the version Google now uses internally: https://github.com/google/tcmalloc

It's worth a fresh look. In particular, it supports per-CPU caches as an alternative to per-thread caches. Those are fantastic if you have a lot more threads than CPUs. I haven't checked if it's been adapted for the latest upstream kernel API, but there's also the idea of "vcpu"-based caches: basically rather than a physical cpu id, it's an (optionally per-numa-node-based) dense id assigned to active threads, so that it still works well if you have a small cpu allocation for this process on a many-core machine.


I think we can safely say that jemalloc has much better marketing than tcmalloc. If you search on github there is only 1 real project with a bazel WORKSPACE that mentions tcmalloc ... the rest are either forks of tcmalloc itself or trivial toy programs that I personally put on github to share with someone. There are zillions of users of jemalloc and everyone has heard of it.


We use it here at Cloudflare on every single machine as part of Workers. So that’s two major hyperscalers running large RAM multi tenant workloads.

Jemalloc may more recognition in the broader community, but the largest workloads seem to be running mimalloc / tcmalloc (I don’t know what Facebook uses internally). The libc malloc probably has even more users than either as it’s the default allocator for iOS and Android.


Agreed. I think it's a wider phenomenon: Google is culturally uninterested in / incapable of marketing an open source project that's less ambitious / strategic than e.g. TensorFlow, Kubernetes, Go, Dart/Flutter, or (looking back a few years...) Angular. The next tier I guess is Abseil or Bazel, which have a nice website, nice docs, and some papers/talks but I think still aren't super widely used outside Google. I can't think of anything smaller than those that's gotten any marketing at all. Can you?

tcmalloc at least gets internal changes regularly synced to github (since just a couple years ago iirc), vs. the ancient gperftools snapshot that's more widely known. And there are many other projects of potential interest to the outside world (Fibers...) that have been mentioned publicly but not open sourced at all.


> Heap fragmentation is just a reality of some allocation patterns without a GC runtime.

It is also a reality with a GC runtime, even a compacting one. Tracing GCs have acceptable performance only if the available memory is many times larger than the memory used. Technically this maybe isn't called fragmentation, but the effects are just as bad - the application uses much more memory than really needed.


All of the allocators are improved over time (against their own their own goals). It is always important to specify the libc version and version of alternative allocators. Otherwise the results might not be reproducible a year later.


The default allocator is whatever is provided by the platform libc. In this case, i would guess that's the glibc allocator.

It's possible that the glibc allocator contains some simple bug. It's more likely that it doesn't contain any simple bugs, but makes different tradeoffs to jemalloc, which make it less suitable to this particular slice of applications.


I'm not sure it's the case that there's something "wrong" with the default allocator, but rather that there are different tradeoffs at play. There's an old but good discussion of the issue here https://github.com/rust-lang/rfcs/blob/master/text/1183-swap...


A sharp increase that never comes back down over time seems more "wrong" than right to me, no?


It's not good, but only looking at the bad half of a tradeoff will always look bad. Without diving into the glibc source code to understand what it's doing, it is hard to say if this is actually wrong.


I’ve recently been doing a lot of work that involves algorithmic design with non-technical people and this is the fun thing. It’s very easy to identify expected scenarios, harder to identify how to achieve them and sometimes actually impossible on e you consider things like “the algorithm doesn’t know what happens next”.


Without more context I think it's unclear. I know linux tends to avoid freeing memory until/unless the system is near capacity, so this test may be running on a system with low memory pressure.


All allocators make different trade-offs in terms of memory usage, min / max / mean time to allocate and optimizations for certain allocation patterns (small, large, mixed, ...).

There might not be anything wrong with the default allocator, it just isn't the best suited for that particular use.


Reading the article reminded me of https://sourceware.org/bugzilla/show_bug.cgi?id=23416 and I wouldn't be surprised if it's the same underlying issue.


Shouldn't this also be a much larger problem when deploying rust on embedded systems or in the kernel? How is that solved in these cases? Or is this not a problem at all?


Rust by default uses the platform allocator. Anything that's true of it's allocator performance is also true of C.

Also, as the sibling comment said, when you are doing tight embedded, the solution is not to allocate things dynamically.


I’m curious when not allocating things dynamically, does that just get around all memory safety (other than temporal with array bounds checking).

Essentially any language that does array bounds checking would be as safe as rust? Or am I missing anything?

C does not technically do that, so rust is still safer. But zig would be the same in the embedded space as rust? Or is there anything I am missing?


Rust's compiler also makes sure you never modify a place in memory from more than one place.


That’s true, but that would be fairly trivial to do in a C program with only stack variables. Just always have only 1 mutable reference.

Something like C++ or Zig could enforce those rules even more easily.


Yeah, I don't disagree. Shame they never went direction and instead were like "REAL MEN DON'T PROTECT AGAINST BUGS! WE DON'T WRITE BUGS!". But oh well. That's why we have the more modern languages these days.


Both the kernel and embedded systems manage memory very differently to "normal" applications. The kernel has its own allocation functions, and embedded systems often don't "allocate" but rather just have fixed memory regions they use for things.


I recall from playing with Arduino that using heap-allocated strings will cause heap fragmentation, which is why they should be avoided. This was C++, doesn't this count for rust as well?

Do kmalloc/kzmalloc do magic to circumvent this problem, similar to jemalloc?


The embedded code I write doesn’t use an allocator at all, so it’s not a problem there.


The day I started (time ago) to learn Rust I had trouble learning to map Error(s) and dynamic dispatching. My first thought was: "if I cannot escape dynamic dispatching (for example with crates that use it), I will never understand how to use Rust in embedded, because of Box<T>". And as an embedded C developer, this is something I never understood. For example:

"Values can be boxed (allocated on the heap) by creating a Box<T>" (https://doc.rust-lang.org/rust-by-example/std/box.html)

But then I see

    Box<dyn std::error::Error>
everywhere, for example in https://github.com/oxidecomputer/hubris/search?q=dyn

Is Box really allocating here? Is the "Rust By Example" text incomplete?

Then I had to stop learning Rust for other reasons, but this doubt really hit me at the time.


Dynamic dispatch (the `dyn` keyword) and dynamic allocation are distinct in Rust. You'll see Box<dyn SomeTrait> a lot in application-level code because the developer UX of boxed dynamic resembles a GC'd language like Java/Go, but they're not really coupled under the hood.

Dynamic dispatch allows you to call trait methods on values without knowing the concrete type of that value, similar to C++ virtual methods or a C-style manually constructed vtable. In an embedded context this will mostly show up as function parameters, since you can't move things around on the stack without knowing their size. You might also see references to a dyn value, like `&dyn SomeTrait` -- this is similar to `Box<dyn SomeTrait>`, but is a borrowed reference and therefore doesn't imply allocation.

Your link to the Oxide Hubris repo is misleading you because the hits seem to be pretty much all in build-time helper scripts (`build.rs`) and other minor tooling. You'd want to look through the source of their main binary, which probably wouldn't use allocation just for error propagation.


> Your link to the Oxide Hubris repo is misleading you

Sorry about this. It wasn't intentional.

But just to confirm, Box<T> always allocates? So dynamic dispatch is not recommended in embedded, and thus crates that uses dynamic dispatch and/or Box<T>?


Dynamic dispatch (dyn) in embedded is fine.

Dynamic allocation (Box) in embedded is not recommended.

`&dyn SomeTrait` is dynamic dispatch without dynamic allocation.

`Box<i32>` is dynamic allocation without dynamic dispatch.


Same way you solve it in C. Don't use the heap.


"The specific cause for the fragmentation could be any number of things: JSON parsing with serde, something at the framework-level in axum, something deeper in tokio, or even just a quirk of the specific allocator implementation for the given system. Even without knowing the root cause (if there is such a thing) the behavior is observable in our environment and somewhat reproducible in a bare-bones app."

So what is the ultimate cause of the fragmentation in this case? You can blame your allocator implementation, but how do you know it's not your use of the allocator? It feels like you are just slapping jemalloc on to solve this, which I suppose shows that the allocator you were using was causing problems before, but it doesn't really explain how? I suppose what I'm wondering is why specifically the allocator you were using before was causing this fragmentation... isn't that a bigger problem than you're making it sound?

Also, what else can an allocator do other than coalescing free blocks to decrease fragmentation? Does it involve occasional checks to defragment the heap ie moving separated blocks so they are adjacent?


Our assumption, which turned out to be true, is that it's due to the JSON parsing code. Rust (serde) is very efficient with parsing JSON to a predefined structure, but when it comes to parsing to a "generic object", which we need for part of the payload, it's not as much. We are going to deploy a full fix for this issue too, but jemalloc already solved it as well.

Though I disagree with saying it's "just slapping jemalloc on to solve this". The piece of code in question definitely made the fragmentation issue worse, as it was making a lot of allocations of varying sizes, but the underlying issue of memory fragmentation because of the allocator was still there, and it would have just triggered later by a different code path.


Heap fragmentation often comes from allocating objects with different lifetimes at the same time on the same pages. Parsing is a common case of this because you allocate the whole object tree then only keep some of it.


Not in the context of an HTTP server. As we just parse, use it in the request, and then return (freeing all the memory). I think the problem is because we have multiple requests being handled in tandem and Rust doesn't know it's probably better off allocating all of the data together and then freeing this big block.

That's what's nice about jemalloc, it has a more generic algorithm for reusing allocated blocks.


Well, it's good as a drop in solution but an arena allocator or malloc zones would really be best.


The gold standard for avoiding fragmentation is what jemalloc does, that is, only allocating objects of similar size from a chunk of memory. That is, instead of a single global heap there exists a pool for every valid size of object (and to keep the numbers low, object sizes are rounded up to some set of buckets).

This means that there is more memory wasted for small programs, but as memory use grows the wastage caused by this remains constant and allocation and deallocation will always remain fast.


This isn't good enough because size of an allocation says nothing about what its lifetime is. If you know lifetimes or types then you can segregate those and it does help.

(It does help in that if you have fixed size slabs, you can't waste space on that page, but you can still waste the entire page.)


Any thoughts about snmalloc as an alternative? Can't quite recall the differences but seemed to look good on benchmarks.


Not OP, but I've tried it in a handful of projects and haven't seen a measurable improvement to performance, memory utilization, or heap fragmentation. That said, it's easy to try out different allocators; I recommend giving it a shot for your specific workload, because IME they tend to perform very differently in different applications.


was curious about

> When services exit abruptly, this can lower availability... which is bad for business.

interesting company

> Do you offer SLAs? We offer uptime SLAs of 99.999% for our enterprise customers, 99.99% for our business tier customers, and 99.9% for our startup tier users.

99.999 is miserable to support. how do you even get the granularity to measure, freedom to try anything? i dont think cloud services even provide slas like that for their services.

>Can you handle our scale? We process billions of webhooks a year for our customers,

1 billion a year would average out to 32 a second. if its 10 billion? 320/second. surprising to see rust for this. gc languages should be able to handle it.

would have been interesting to see how much fragmentation there is. reads like there was a debugging step missing.


> 99.999 is miserable to support. how do you even get the granularity to measure, freedom to try anything? i dont think cloud services even provide slas like that for their services.

A lot of redundancies and testing.

> 1 billion a year would average out to 32 a second. if its 10 billion? 320/second. surprising to see rust for this. gc languages should be able to handle it.

This assumes even distribution, though traffic is much much more spiky than this. We started with Python and switched to Rust. FWIW, Rust is great for many other reasons, not just the efficiency. For example, we absolutely love the type system.

> would have been interesting to see how much fragmentation there is. reads like there was a debugging step missing.

I'm not the engineer that did the investigation, so can't comment directly about what he checked. Maybe it's not written there, but he also measured allocator statistics and indeed saw that the memory used according to the US was much higher than what the allocator stats said.


Oh maybe I missed but I didn't think that info was there. Would have been interesting to see thr extent of fragmentation or what the cause was.

5 9s of uptime means 800ms of down time a day. Is this not in a cloud, a load balancer health check can't react that fast. Vms will get randomly killed if something is wrong.


Yet again, averages don't always represent real life accurately.


theyre usually close enough, especially for estimating needs. 1 billion/year averages out to 30/second like i said before. You think they're bursting to 1000? maybe i could believe that. 10,000?100,000/second? its unlikely. this is all still scale that is handled by stable, gc languages.


Speed is not the only advantage of using Rust on the backend.


The SLA is monthly, not daily.


that's still like 30 seconds. <1 of 100,000 requests ever failing. im surprised the company finds the tradeoff of providing that worth the engineering cost to attempt to provide something like that.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: