We Forked Our Serialization Library. It Was 21% Slower Than It Needed to Be.

Photo by Caleb Jones on Unsplash

When you're processing 18 million records per second, everything matters. Every allocation. Every interface conversion. Every byte copied.

Arc uses MessagePack as its binary serialization format for WAL writes and HTTP ingestion. The Go implementation we use — https://github.com/vmihailenco/msgpack — is the most popular msgpack library in the Go ecosystem. It's well-designed, feature-rich, and has been battle-tested across thousands of projects.

But when we started profiling Arc's hot path, something jumped out.

The Profiler Doesn't Lie

Arc's ingestion pipeline is straightforward: receive an HTTP body, deserialize it from msgpack, validate the data, and write it to the WAL. The msgpack step should be fast — it's a binary format, after all. No JSON parsing, no string escaping, no Unicode normalization.

But pprof told a different story.

Every call to msgpack.Unmarshal() was allocating a bytes.NewReader and wrapping it in a bufio.NewReader. Every single call. Even though the library pools its Decoder objects via sync.Pool, the reader it wraps around your input is freshly allocated every time.

On the encode side, msgpack.Marshal() was allocating a new bytes.Buffer per call. And WriteByte() — called for every nil, bool, fixint, and type code in the msgpack format — was heap-allocating a []byte{c} slice on every invocation.

At 37,000 Unmarshal calls per second with 500-record batches, these allocations add up fast.

Why Fork Instead of Contributing Upstream?

The honest answer: we tried.

The upstream repository has 35 open issues, some dating back years. Pull requests sit without review. The maintainer has moved on to other projects — which is completely understandable. Open source maintenance is thankless work, and we respect that vmihailenco/msgpack has served the Go community well for a long time.

But we needed changes now. Not in six months. Not after a review cycle that might never happen. Arc is a commercial product with customers who care about throughput and memory efficiency.

So we https://github.com/Basekick-Labs/msgpack.

What We Changed: Decode Path

Zero-Allocation Byte-Slice Reader

The original Unmarshal() creates a bytes.NewReader (which allocates) and wraps it in a bufio.NewReader (which allocates again, plus a 4KB internal buffer). We replaced both with a byteSliceReader — a struct that holds a pointer to the input []byte and tracks a read offset. No allocations. No copies.

*interface Fast Path

Arc's most common decode pattern is msgpack.Unmarshal(body, &result) where result is an interface{}. The original code goes through reflect.ValueOf(), does a sync.Map lookup for the encoder function, and then calls back into the same decode logic anyway. We added a type-switch check for *interface{} that skips all the reflection machinery.

Goroutine Leak Fix in Value Pools

The upstream uses a goroutine-per-type pattern for caching decoded values — spawning a goroutine with a channel for each type encountered. This leaks goroutines and introduces channel synchronization overhead on the decode hot path. We replaced it with a straightforward sync.Pool, which is what Go provides for exactly this use case.

Recording Buffer Pool

Unmarshaler.UnmarshalMsgpack needs to buffer the raw msgpack bytes before passing them to the user's implementation. The upstream allocates a fresh []byte every time. We pool these buffers via sync.Pool and drop oversized ones (>4KB) to prevent memory leaks from occasional large payloads.

Inline Nil Checks for Byte-Slice Reader

hasNilCode() is called on every field during struct decoding to check for nil values. The upstream implementation does two interface method calls (ReadByte + UnreadByte) per check. When the input is a byte slice (the Unmarshal path), we peek directly at the underlying data — no interface dispatch, no virtual calls.

Microbenchmarks

Benchmark	Before	After	Change
Unmarshal (time)	358 ns/op	283 ns/op	-21%
Unmarshal (memory)	160 B/op	80 B/op	-50%
Unmarshal (allocs)	6 allocs/op	4 allocs/op	-33%

What We Changed: Encode Path

Pooled Byte Buffer

Instead of allocating a bytes.Buffer on every Marshal() call, we embedded a reusable []byte buffer directly into the pooled Encoder struct. Since encoders are returned to sync.Pool between calls, the buffer capacity persists across invocations. A byteSliceWriter wraps the buffer and implements the writer interface with zero allocations.

MarshalAppend API

We added MarshalAppend(dst, v) — a new API that appends encoded bytes to a caller-provided buffer. This eliminates the final make+copy that Marshal() performs to return a fresh []byte. For callers who reuse buffers (like Arc's WAL writer), this is a significant win: -26% latency, -94% memory.

WriteByte Scratch Fix

For the streaming encoder path (when users call NewEncoder(w) with a custom writer), the original byteWriter shim allocated a []byte{c} on every WriteByte() call. We replaced it with a [1]byte scratch field on the struct — same semantics, zero allocations.

Encode Type-Switch Fast Paths

Arc's most common Marshal payloads are map[string]interface{}, map[string]string, and []interface{}. These types weren't in the Encode() type-switch, so they fell through to reflect.ValueOf() followed by a sync.Map lookup for the appropriate encoder function. Adding three cases to the type-switch eliminates all that overhead. The map[string]string fast path alone is 15% faster.

Two-Pass OmitEmpty

The upstream OmitEmpty allocates a []*field slice on every struct encode to hold the filtered fields. But in Arc's analytical workloads, most records have all fields populated — meaning the filter is a no-op that allocates for nothing. We added a two-pass approach: first count how many fields survive, and if all do, return the original slice with zero allocations. When fields are actually omitted, we pool the filtered slice via sync.Pool.

Skip reflect.Convert for Exact Types

The upstream calls reflect.Convert() on every map and slice encode to normalize types. But when the value's type already matches the target (e.g., it's literally map[string]string, not a named alias), the conversion is wasted work. We added v.Type() == targetType checks for map[string]string, map[string]bool, map[string]interface{}, and []string. This saves 7-8% on map-heavy benchmarks.

Cache isZeroer Interface Check

The upstream checks whether each struct field implements the isZeroer interface during every OmitEmpty evaluation, which involves v.Interface() boxing. We pre-compute this check once at struct-discovery time and store a boolean flag on the field metadata.

Pool Sorted Map Key Slices

When SetSortMapKeys(true) is enabled, the upstream allocates a []string slice for keys on every map encode. We pool these via sync.Pool, eliminating one allocation per sorted map encode.

Microbenchmarks

Benchmark	Before	After	Change
StructMarshal (time)	461 ns/op	405 ns/op	-12%
StructMarshal (memory)	1456 B/op	1224 B/op	-16%
StructMarshal (allocs)	7 allocs/op	4 allocs/op	-43%
Discard (time)	20 ns/op	8.6 ns/op	-57%
Discard (allocs)	2 allocs/op	0 allocs/op	-100%

Bug Fixes: 15 Issues the Upstream Missed

While optimizing, we found and fixed 15 bugs — some security-critical, some causing panics, all open for years.

Security: OOM from Untrusted Input

decodeSlice() reads the array length from the wire and calls make([]interface{}, 0, n) with no bounds check. A malicious payload claiming an array of 2 billion elements triggers an immediate out-of-memory crash. Same issue for DecodeMap(). We capped both at 1M elements. The actual data is still decoded element-by-element, so legitimate large arrays work fine — the cap only limits the initial pre-allocation.

Security: Allocation Limit That Never Worked

There's a disableAllocLimitFlag that lets users opt out of allocation limits. The flag is defined as 1 << 3 (value 8). But the check was != 1 — since d.flags & 8 yields either 0 or 8, never 1, the allocation limit was never applied for reflect-based slice decoding.

Correctness Fixes

Float-to-integer decoding — float64-encoded values now decode into int64/uint64 with proper validation, rejecting NaN, Inf, fractional, and out-of-range values (https://github.com/Basekick-Labs/msgpack/issues/2)
Float64-to-float32 narrowing — float64-encoded values decode into float32 with overflow check (https://github.com/Basekick-Labs/msgpack/issues/12)
Non-addressable pointer encode — types with pointer receivers now encode correctly via ensureAddr instead of returning an error (https://github.com/Basekick-Labs/msgpack/issues/3)
reflect.Value marshal panic — marshalling a reflect.Value now unwraps and encodes the underlying value instead of panicking (https://github.com/Basekick-Labs/msgpack/issues/15)
Custom error types preserved — encoding error types no longer reduces them to plain strings via .Error() (https://github.com/Basekick-Labs/msgpack/issues/22)
Non-string map keys — decoding maps with integer keys into interface{} now works correctly instead of panicking (https://github.com/Basekick-Labs/msgpack/issues/21)
OmitEmpty with unexported fields — structs with non-zero unexported fields are no longer incorrectly omitted (https://github.com/Basekick-Labs/msgpack/issues/6)
Nested integer-keyed maps — interface{} value type is now used for non-string-keyed typed maps (https://github.com/Basekick-Labs/msgpack/issues/20)
TextUnmarshaler priority — TextUnmarshaler is now chosen over BinaryUnmarshaler when the wire format is str (https://github.com/Basekick-Labs/msgpack/issues/10)
Decoder memory leak — oversized decoder buffers (>32KB) are now dropped from sync.Pool to prevent memory leaks from large decode operations (https://github.com/Basekick-Labs/msgpack/issues/19)
Float64 error message — DecodeFloat64 no longer says "decoding float32" (https://github.com/Basekick-Labs/msgpack/issues/13)

Production Results: Arc Benchmarks

After deploying v6.0.0 to Arc's ingestion pipeline, we ran our standard 60-second throughput benchmark. The results speak for themselves:

Metric	Before (upstream v5)	After (v6.0.0)	Change
Avg throughput	16.78M rec/s	18.23M rec/s	+8.6%
p50 latency	0.52ms	0.47ms	-9.6%
p99 latency	3.72ms	3.58ms	-3.8%
60s degradation	22%	13%	-41% relative

The degradation metric is the one we care about most. It measures how much throughput drops over a sustained 60-second burst as GC pressure builds, the CPU caches get cold, and sync.Pool victims get collected. Going from 22% to 13% means Arc holds its peak throughput significantly longer under sustained load.

v6.0.0: A Proper Release

With 15 performance optimizations and 15 bug fixes shipped, we tagged https://github.com/Basekick-Labs/msgpack/releases/tag/v6.0.0 — the first stable release of the fork. The module path is now github.com/Basekick-Labs/msgpack/v6, and it's available on the Go module proxy.

To migrate from the upstream:

// Before
import "github.com/vmihailenco/msgpack/v5"
 
// After
import "github.com/Basekick-Labs/msgpack/v6"

go get github.com/Basekick-Labs/msgpack/v6

The API is fully compatible — it's a drop-in replacement with the same types, methods, and behavior (minus the bugs).

What's Next

We're already working on v6.1 with three more optimizations targeting the decode hot path:

readCode bsr fast path — the single most-called function in the decoder reads one byte via interface dispatch. When the input is a byte slice, we can skip the interface and read directly from the underlying array. Early microbenchmarks show -7.5% on struct unmarshal.
PeekCode bsr fast path — same pattern for PeekCode(), which currently does ReadByte + UnreadByte (two interface calls) when a single array index check suffices.
Pool OmitEmpty filtered slices — when OmitEmpty does filter out fields, the allocated slice is now returned to a sync.Pool for reuse.

Beyond that, we have https://github.com/Basekick-Labs/msgpack/issues?q=is%3Aissue+is%3Aopen+label%3Aperformance — from decode-side type-switch fast paths to zero-copy byte slice decoding to readN inlining. The goal is to push that 60-second degradation curve below 10%.

The fork is open source under the same BSD-2 license. If you're using vmihailenco/msgpack in a performance-sensitive Go application, https://github.com/Basekick-Labs/msgpack.

Analytical Database

Streaming

AI Memory

By industry

Explore

Read

Migrate from…

Forum

Source & Issues

Real-time chat

We Forked Our Serialization Library. It Was 21% Slower Than It Needed to Be.

The Profiler Doesn't Lie

Why Fork Instead of Contributing Upstream?

What We Changed: Decode Path

Zero-Allocation Byte-Slice Reader

*interface Fast Path

Goroutine Leak Fix in Value Pools

Recording Buffer Pool

Inline Nil Checks for Byte-Slice Reader

Microbenchmarks

What We Changed: Encode Path

Pooled Byte Buffer

MarshalAppend API

WriteByte Scratch Fix

Encode Type-Switch Fast Paths

Two-Pass OmitEmpty

Skip reflect.Convert for Exact Types

Cache isZeroer Interface Check

Pool Sorted Map Key Slices

Microbenchmarks

Bug Fixes: 15 Issues the Upstream Missed

Security: OOM from Untrusted Input

Security: Allocation Limit That Never Worked

Correctness Fixes

Production Results: Arc Benchmarks

v6.0.0: A Proper Release

What's Next

Links

Ready to handle billion-record workloads?