I have a fair few years of BEAM development under my belt, but in moving most of my work over to pony, I missed a lot of the creature-comforts that the BEAM provided. What I miss the most from that ecosystem is the interactive shell you can use to interactively examine your application while it is running.
So, I’m building it.
To do this, we need:
This post describes how I built a package to do per-object memory instrumentation akin to the BEAM’s erts_debug:size/1 and friends. On the way I’ll give a short tour of how the pony allocator does its thing, and does it safely and quickly.
Safety Warning
This package reads the Pony runtime’s private internal data structures, and that is inherently dangerous. It is not safe. It is not stable. It must never go into a production build.
If you choose to use this package, please be aware that it (ab)uses some of pony’s internal structures and other implementation details that are NOT considered a part of the API. In other words, version updates could very easily work perfectly, return incorrect data, or crash so badly your pet rabbit becomes pregnant again. This is a tool sane people could run while debugging, while insane people YOLO in prod.
As cute as they are, I regret to inform you that I can’t help you rehouse any baby rabbits that result from using this package.
Pony’s Existing Instrumentation
Pony’s raison d’être is Correctness > Performance > Everything Else
Therefore, pony’s existing instrumentation is disabled by default. In order to get these values, you have to compile your own version of ponyc with certain compile-time options enabled. It provides the ability to measure how much memory an actor has requested and has allocated on its heap. What I really want is the ability to have more granularity. I want to know per object.
Plus, I wanted to avoid needing a special build. Using a stock ponyc and runtime, just a package.
Memory Allocation Primer:
Note: I am going to explicitly describe how allocation works under 64bit Linux. If you’re on any other OS/Architecture, the principles are the same but the numbers / alloc functions may differ
- Memory allocation is expensive: It’s slow. Allocating and de-allocating memory via mmap etc… kills performance.
- Objects in memory need to be aligned: When objects are written to memory they are sized and positioned on powers of two. This means that there is almost always “wasted space”. We trade a little wasted memory for efficiency in reading / writing in single operations.
- Smaller Allocations are significantly more frequent than large: One size does not fit all. Different sizes of allocations use different strategies, even in glibc’s malloc.
Pony’s Memory Allocator
I’m going to <handwave> Pony’s “pool allocator”</handwave>. Just model it in your mind as a way for an actor to get a chunk of memory that could have been recycled from elsewhere.
Almost every high-performance program above a certain size maintains its own allocator. Pony’s allocation strategy branches into three size-based buckets:
flowchart TD
A["Allocation request
(size bytes)"] --> B{"size ≤ 512?"}
B -->|yes| C["Small tier
slotted into a shared 1 KB chunk
size class: 32 / 64 / 128 / 256 / 512"]
B -->|no| D{"size ≤ 1 MB?"}
D -->|yes| E["Large tier
its own pooled chunk"]
D -->|no| F["Huge tier
straight to the OS via mmap"]
Small Allocator
By far the most common. Almost every object created is represented by a struct that fits here. This memory has the most “churn”.
Pony snags a 1kb chunk and carves it up into slots, from 32bytes (32 slots) to 512bytes (2 slots). A 40byte allocation goes into a 64byte slotted chunk, and a 400 byte allocation goes into a 512 byte slotted chunk. Which slots are active verses available in a chunk are stored in a bitmap.
Chunks are threaded onto per-size-class linked lists hanging off the actor’s heap — small_free[] for chunks with room left, small_full[] for the ones that have filled up. Finding a slot to allocate is a simple and exceptionally fast heap->size->free lookup.
Large Allocations
No sharing, each gets its <handwave> own chunk sized to a power of two </handwave>. As these are not slotted, they are directly acquired and released to the pool allocator.
<handwave> Chunks are in a linked list </handwave>
Huge Allocations
Falls directly to mmap and threaded onto the large linked list from above.
Measuring A Whole Actor’s Heap
flowchart LR
subgraph H["actor heap_t"]
direction TB
SF["small_free[HEAP_SIZECLASSES]"]
SU["small_full[HEAP_SIZECLASSES]"]
LG["large"]
end
SF -. "64B class head" .-> A1
A1["chunk
(2 slots free)"] --> A2["chunk
(5 slots free)"] --> AN(("NULL"))
SU -. "64B class head" .-> B1
B1["chunk
(0 slots free)"] --> BN(("NULL"))
LG --> C1["large chunk"] --> C2["large chunk"] --> CN(("NULL"))
By simply walking all 11 linked lists, we can calculate the memory utilization in an Actor’s heap.
But here’s the issue. It is only safe to do this to yourself. Yes, in theory if you have a reference to another actor you could reach in and read their heap. BUT, if that heap is modified while you are reading it’s baby bunny time. It is safe(r)™ to read our own heap as we have control of our actor’s thread of execution mid-behaviour so no changes to the heap can be made. Note, we have to do all the math in C, as if we did the analysis in pony, we would be modifying the heap as we went - spawning even more bunnies.
We never cross an actor boundary.
We do all our work in C-FFI.
Measuring a single data structure
Actor-level numbers are useful, but the main driver of course was per-object measurement.
Pony datastructures can of course take arbitrary shapes, have arbitrary depths, and most fun of all - have circular references. That seems like a really hard problem to solve correctly until we look at the Pony’s sourcecode for cut and pasteinspiration:
/// Describes a type to the runtime.
typedef const struct _pony_type_t
{
uint32_t id; // unique type id (used for pattern matching / type tests)
uint32_t size; // instance size in bytes
uint32_t field_count; // number of fields (used by the tracing GC)
uint32_t field_offset; // byte offset where the traced fields begin
void* instance; // singleton instance, for types that have one (e.g. primitives)
pony_trace_fn trace; // GC trace function for this type's fields
pony_dispatch_fn dispatch; // actor message dispatch (NULL for non-actors)
pony_final_fn final; // finaliser, or NULL
uint32_t event_notify; // field index of the ASIO event notifier, or -1
bool might_reference_actor; // GC optimisation: skip if it can't reach an actor
uintptr_t** traits; // trait ids this type implements (for trait-based matching)
void* fields; // field descriptor table (field type descriptors)
void* vtable; // method vtable — variable-length, must be last
} pony_type_t;
Won’t you look at that. pony_trace_fn already exists and provides a method to trace all its fields. The GC already knows how to do this, we simply copybecome inspired by the existing GC logic and instead of doing GC “stuff”, do our accounting. It will navigate the object graph for us. We don’t need to know how, we just reuse what’s already there. Whatever the GC would consider to be a part of the object graph, we enumerate.
In Conclusion
In writing this I did find some runtime optimizations. Specifically I noticed that when people wrote code like this:
var buf = recover String end
for f in /* some loop here */ do
/* some stuff */
buf.push(/* some character */)
end
That since small pony allocations had a minimum size of 32 bytes, and String expanded on powers of two - we had four unnecessary allocations totalling 160 bytes, when it was always capable of expanding in the fist allocation.
| String “Size” | String “_alloc” | small Allocated | Total Allocation |
|---|---|---|---|
| 0 | 2 | 32 new | 32 |
| 1 | 2 | 32 | 32 |
| 2 | 4 | 32 new | 64 |
| 3 | 4 | 32 | 64 |
| 4 | 8 | 32 new | 96 |
| 5 | 8 | 32 | 96 |
| 6 | 8 | 32 | 96 |
| 7 | 8 | 32 | 96 |
| 8 | 16 | 32 new | 128 |
| 9 | 16 | 32 | 128 |
| 10 | 16 | 32 | 128 |
| 11 | 16 | 32 | 128 |
| 12 | 16 | 32 | 128 |
| 13 | 16 | 32 | 128 |
| 14 | 16 | 32 | 128 |
| 15 | 16 | 32 | 128 |
| 16 | 32 | 32 new | 160 |
Fixed in PR#5518
How much of a difference did that make?
In one of my production applications that parsed stupidly large json files, it reduced the maximum RSS for the process from 3.2G to 1.8G.
Substantial.
What’s next?
Now we have all the pieces for remote instrumentation of applications. Watch this space!