We Forked Our Serialization Library. It Was 21% Slower Than It Needed to Be.

#Arc#msgpack#performance#Go#open-source#serialization#Basekick Labs
Cover image for We Forked Our Serialization Library. It Was 21% Slower Than It Needed to Be.

Photo by Caleb Jones on Unsplash

When you're processing 18 million records per second, everything matters. Every allocation. Every interface conversion. Every byte copied.

Arc uses MessagePack as its binary serialization format for WAL writes and HTTP ingestion. The Go implementation we use — https://github.com/vmihailenco/msgpack — is the most popular msgpack library in the Go ecosystem. It's well-designed, feature-rich, and has been battle-tested across thousands of projects.

But when we started profiling Arc's hot path, something jumped out.

The Profiler Doesn't Lie

Arc's ingestion pipeline is straightforward: receive an HTTP body, deserialize it from msgpack, validate the data, and write it to the WAL. The msgpack step should be fast — it's a binary format, after all. No JSON parsing, no string escaping, no Unicode normalization.

But pprof told a different story.

Every call to msgpack.Unmarshal() was allocating a bytes.NewReader and wrapping it in a bufio.NewReader. Every single call. Even though the library pools its Decoder objects via sync.Pool, the reader it wraps around your input is freshly allocated every time.

On the encode side, msgpack.Marshal() was allocating a new bytes.Buffer per call. And WriteByte() — called for every nil, bool, fixint, and type code in the msgpack format — was heap-allocating a []byte{c} slice on every invocation.

At 37,000 Unmarshal calls per second with 500-record batches, these allocations add up fast.

Why Fork Instead of Contributing Upstream?

The honest answer: we tried.

The upstream repository has 35 open issues, some dating back years. Pull requests sit without review. The maintainer has moved on to other projects — which is completely understandable. Open source maintenance is thankless work, and we respect that vmihailenco/msgpack has served the Go community well for a long time.

But we needed changes now. Not in six months. Not after a review cycle that might never happen. Arc is a commercial product with customers who care about throughput and memory efficiency.

So we https://github.com/Basekick-Labs/msgpack.

What We Changed: Decode Path

Zero-Allocation Byte-Slice Reader

The original Unmarshal() creates a bytes.NewReader (which allocates) and wraps it in a bufio.NewReader (which allocates again, plus a 4KB internal buffer). We replaced both with a byteSliceReader — a struct that holds a pointer to the input []byte and tracks a read offset. No allocations. No copies.

*interface Fast Path

Arc's most common decode pattern is msgpack.Unmarshal(body, &result) where result is an interface{}. The original code goes through reflect.ValueOf(), does a sync.Map lookup for the encoder function, and then calls back into the same decode logic anyway. We added a type-switch check for *interface{} that skips all the reflection machinery.

Goroutine Leak Fix in Value Pools

The upstream uses a goroutine-per-type pattern for caching decoded values — spawning a goroutine with a channel for each type encountered. This leaks goroutines and introduces channel synchronization overhead on the decode hot path. We replaced it with a straightforward sync.Pool, which is what Go provides for exactly this use case.

Recording Buffer Pool

Unmarshaler.UnmarshalMsgpack needs to buffer the raw msgpack bytes before passing them to the user's implementation. The upstream allocates a fresh []byte every time. We pool these buffers via sync.Pool and drop oversized ones (>4KB) to prevent memory leaks from occasional large payloads.

Inline Nil Checks for Byte-Slice Reader

hasNilCode() is called on every field during struct decoding to check for nil values. The upstream implementation does two interface method calls (ReadByte + UnreadByte) per check. When the input is a byte slice (the Unmarshal path), we peek directly at the underlying data — no interface dispatch, no virtual calls.

Microbenchmarks

BenchmarkBeforeAfterChange
Unmarshal (time)358 ns/op283 ns/op-21%
Unmarshal (memory)160 B/op80 B/op-50%
Unmarshal (allocs)6 allocs/op4 allocs/op-33%

What We Changed: Encode Path

Pooled Byte Buffer

Instead of allocating a bytes.Buffer on every Marshal() call, we embedded a reusable []byte buffer directly into the pooled Encoder struct. Since encoders are returned to sync.Pool between calls, the buffer capacity persists across invocations. A byteSliceWriter wraps the buffer and implements the writer interface with zero allocations.

MarshalAppend API

We added MarshalAppend(dst, v) — a new API that appends encoded bytes to a caller-provided buffer. This eliminates the final make+copy that Marshal() performs to return a fresh []byte. For callers who reuse buffers (like Arc's WAL writer), this is a significant win: -26% latency, -94% memory.

WriteByte Scratch Fix

For the streaming encoder path (when users call NewEncoder(w) with a custom writer), the original byteWriter shim allocated a []byte{c} on every WriteByte() call. We replaced it with a [1]byte scratch field on the struct — same semantics, zero allocations.

Encode Type-Switch Fast Paths

Arc's most common Marshal payloads are map[string]interface{}, map[string]string, and []interface{}. These types weren't in the Encode() type-switch, so they fell through to reflect.ValueOf() followed by a sync.Map lookup for the appropriate encoder function. Adding three cases to the type-switch eliminates all that overhead. The map[string]string fast path alone is 15% faster.

Two-Pass OmitEmpty

The upstream OmitEmpty allocates a []*field slice on every struct encode to hold the filtered fields. But in Arc's analytical workloads, most records have all fields populated — meaning the filter is a no-op that allocates for nothing. We added a two-pass approach: first count how many fields survive, and if all do, return the original slice with zero allocations. When fields are actually omitted, we pool the filtered slice via sync.Pool.

Skip reflect.Convert for Exact Types

The upstream calls reflect.Convert() on every map and slice encode to normalize types. But when the value's type already matches the target (e.g., it's literally map[string]string, not a named alias), the conversion is wasted work. We added v.Type() == targetType checks for map[string]string, map[string]bool, map[string]interface{}, and []string. This saves 7-8% on map-heavy benchmarks.

Cache isZeroer Interface Check

The upstream checks whether each struct field implements the isZeroer interface during every OmitEmpty evaluation, which involves v.Interface() boxing. We pre-compute this check once at struct-discovery time and store a boolean flag on the field metadata.

Pool Sorted Map Key Slices

When SetSortMapKeys(true) is enabled, the upstream allocates a []string slice for keys on every map encode. We pool these via sync.Pool, eliminating one allocation per sorted map encode.

Microbenchmarks

BenchmarkBeforeAfterChange
StructMarshal (time)461 ns/op405 ns/op-12%
StructMarshal (memory)1456 B/op1224 B/op-16%
StructMarshal (allocs)7 allocs/op4 allocs/op-43%
Discard (time)20 ns/op8.6 ns/op-57%
Discard (allocs)2 allocs/op0 allocs/op-100%

Bug Fixes: 15 Issues the Upstream Missed

While optimizing, we found and fixed 15 bugs — some security-critical, some causing panics, all open for years.

Security: OOM from Untrusted Input

decodeSlice() reads the array length from the wire and calls make([]interface{}, 0, n) with no bounds check. A malicious payload claiming an array of 2 billion elements triggers an immediate out-of-memory crash. Same issue for DecodeMap(). We capped both at 1M elements. The actual data is still decoded element-by-element, so legitimate large arrays work fine — the cap only limits the initial pre-allocation.

Security: Allocation Limit That Never Worked

There's a disableAllocLimitFlag that lets users opt out of allocation limits. The flag is defined as 1 << 3 (value 8). But the check was != 1 — since d.flags & 8 yields either 0 or 8, never 1, the allocation limit was never applied for reflect-based slice decoding.

Correctness Fixes

Production Results: Arc Benchmarks

After deploying v6.0.0 to Arc's ingestion pipeline, we ran our standard 60-second throughput benchmark. The results speak for themselves:

MetricBefore (upstream v5)After (v6.0.0)Change
Avg throughput16.78M rec/s18.23M rec/s+8.6%
p50 latency0.52ms0.47ms-9.6%
p99 latency3.72ms3.58ms-3.8%
60s degradation22%13%-41% relative

The degradation metric is the one we care about most. It measures how much throughput drops over a sustained 60-second burst as GC pressure builds, the CPU caches get cold, and sync.Pool victims get collected. Going from 22% to 13% means Arc holds its peak throughput significantly longer under sustained load.

v6.0.0: A Proper Release

With 15 performance optimizations and 15 bug fixes shipped, we tagged https://github.com/Basekick-Labs/msgpack/releases/tag/v6.0.0 — the first stable release of the fork. The module path is now github.com/Basekick-Labs/msgpack/v6, and it's available on the Go module proxy.

To migrate from the upstream:

// Before
import "github.com/vmihailenco/msgpack/v5"
 
// After
import "github.com/Basekick-Labs/msgpack/v6"
go get github.com/Basekick-Labs/msgpack/v6

The API is fully compatible — it's a drop-in replacement with the same types, methods, and behavior (minus the bugs).

What's Next

We're already working on v6.1 with three more optimizations targeting the decode hot path:

  • readCode bsr fast path — the single most-called function in the decoder reads one byte via interface dispatch. When the input is a byte slice, we can skip the interface and read directly from the underlying array. Early microbenchmarks show -7.5% on struct unmarshal.
  • PeekCode bsr fast path — same pattern for PeekCode(), which currently does ReadByte + UnreadByte (two interface calls) when a single array index check suffices.
  • Pool OmitEmpty filtered slices — when OmitEmpty does filter out fields, the allocated slice is now returned to a sync.Pool for reuse.

Beyond that, we have https://github.com/Basekick-Labs/msgpack/issues?q=is%3Aissue+is%3Aopen+label%3Aperformance — from decode-side type-switch fast paths to zero-copy byte slice decoding to readN inlining. The goal is to push that 60-second degradation curve below 10%.

The fork is open source under the same BSD-2 license. If you're using vmihailenco/msgpack in a performance-sensitive Go application, https://github.com/Basekick-Labs/msgpack.

Ready to handle billion-record workloads?

Deploy Arc in minutes. Own your data in Parquet. Use for analytics, observability, AI, IoT, or data warehousing.

Get Started ->