We Forked Our Serialization Library. It Was 21% Slower Than It Needed to Be.

Photo by Caleb Jones on Unsplash
When you're processing 18 million records per second, everything matters. Every allocation. Every interface conversion. Every byte copied.
Arc uses MessagePack as its binary serialization format for WAL writes and HTTP ingestion. The Go implementation we use — https://github.com/vmihailenco/msgpack — is the most popular msgpack library in the Go ecosystem. It's well-designed, feature-rich, and has been battle-tested across thousands of projects.
But when we started profiling Arc's hot path, something jumped out.
The Profiler Doesn't Lie
Arc's ingestion pipeline is straightforward: receive an HTTP body, deserialize it from msgpack, validate the data, and write it to the WAL. The msgpack step should be fast — it's a binary format, after all. No JSON parsing, no string escaping, no Unicode normalization.
But pprof told a different story.
Every call to msgpack.Unmarshal() was allocating a bytes.NewReader and wrapping it in a bufio.NewReader. Every single call. Even though the library pools its Decoder objects via sync.Pool, the reader it wraps around your input is freshly allocated every time.
On the encode side, msgpack.Marshal() was allocating a new bytes.Buffer per call. And WriteByte() — called for every nil, bool, fixint, and type code in the msgpack format — was heap-allocating a []byte{c} slice on every invocation.
At 37,000 Unmarshal calls per second with 500-record batches, these allocations add up fast.
Why Fork Instead of Contributing Upstream?
The honest answer: we tried.
The upstream repository has 35 open issues, some dating back years. Pull requests sit without review. The maintainer has moved on to other projects — which is completely understandable. Open source maintenance is thankless work, and we respect that vmihailenco/msgpack has served the Go community well for a long time.
But we needed changes now. Not in six months. Not after a review cycle that might never happen. Arc is a commercial product with customers who care about throughput and memory efficiency.
So we https://github.com/Basekick-Labs/msgpack.
What We Changed: Decode Path
Zero-Allocation Byte-Slice Reader
The original Unmarshal() creates a bytes.NewReader (which allocates) and wraps it in a bufio.NewReader (which allocates again, plus a 4KB internal buffer). We replaced both with a byteSliceReader — a struct that holds a pointer to the input []byte and tracks a read offset. No allocations. No copies.
*interface Fast Path
Arc's most common decode pattern is msgpack.Unmarshal(body, &result) where result is an interface{}. The original code goes through reflect.ValueOf(), does a sync.Map lookup for the encoder function, and then calls back into the same decode logic anyway. We added a type-switch check for *interface{} that skips all the reflection machinery.
Goroutine Leak Fix in Value Pools
The upstream uses a goroutine-per-type pattern for caching decoded values — spawning a goroutine with a channel for each type encountered. This leaks goroutines and introduces channel synchronization overhead on the decode hot path. We replaced it with a straightforward sync.Pool, which is what Go provides for exactly this use case.
Recording Buffer Pool
Unmarshaler.UnmarshalMsgpack needs to buffer the raw msgpack bytes before passing them to the user's implementation. The upstream allocates a fresh []byte every time. We pool these buffers via sync.Pool and drop oversized ones (>4KB) to prevent memory leaks from occasional large payloads.
Inline Nil Checks for Byte-Slice Reader
hasNilCode() is called on every field during struct decoding to check for nil values. The upstream implementation does two interface method calls (ReadByte + UnreadByte) per check. When the input is a byte slice (the Unmarshal path), we peek directly at the underlying data — no interface dispatch, no virtual calls.
Microbenchmarks
| Benchmark | Before | After | Change |
|---|---|---|---|
| Unmarshal (time) | 358 ns/op | 283 ns/op | -21% |
| Unmarshal (memory) | 160 B/op | 80 B/op | -50% |
| Unmarshal (allocs) | 6 allocs/op | 4 allocs/op | -33% |
What We Changed: Encode Path
Pooled Byte Buffer
Instead of allocating a bytes.Buffer on every Marshal() call, we embedded a reusable []byte buffer directly into the pooled Encoder struct. Since encoders are returned to sync.Pool between calls, the buffer capacity persists across invocations. A byteSliceWriter wraps the buffer and implements the writer interface with zero allocations.
MarshalAppend API
We added MarshalAppend(dst, v) — a new API that appends encoded bytes to a caller-provided buffer. This eliminates the final make+copy that Marshal() performs to return a fresh []byte. For callers who reuse buffers (like Arc's WAL writer), this is a significant win: -26% latency, -94% memory.
WriteByte Scratch Fix
For the streaming encoder path (when users call NewEncoder(w) with a custom writer), the original byteWriter shim allocated a []byte{c} on every WriteByte() call. We replaced it with a [1]byte scratch field on the struct — same semantics, zero allocations.
Encode Type-Switch Fast Paths
Arc's most common Marshal payloads are map[string]interface{}, map[string]string, and []interface{}. These types weren't in the Encode() type-switch, so they fell through to reflect.ValueOf() followed by a sync.Map lookup for the appropriate encoder function. Adding three cases to the type-switch eliminates all that overhead. The map[string]string fast path alone is 15% faster.
Two-Pass OmitEmpty
The upstream OmitEmpty allocates a []*field slice on every struct encode to hold the filtered fields. But in Arc's analytical workloads, most records have all fields populated — meaning the filter is a no-op that allocates for nothing. We added a two-pass approach: first count how many fields survive, and if all do, return the original slice with zero allocations. When fields are actually omitted, we pool the filtered slice via sync.Pool.
Skip reflect.Convert for Exact Types
The upstream calls reflect.Convert() on every map and slice encode to normalize types. But when the value's type already matches the target (e.g., it's literally map[string]string, not a named alias), the conversion is wasted work. We added v.Type() == targetType checks for map[string]string, map[string]bool, map[string]interface{}, and []string. This saves 7-8% on map-heavy benchmarks.
Cache isZeroer Interface Check
The upstream checks whether each struct field implements the isZeroer interface during every OmitEmpty evaluation, which involves v.Interface() boxing. We pre-compute this check once at struct-discovery time and store a boolean flag on the field metadata.
Pool Sorted Map Key Slices
When SetSortMapKeys(true) is enabled, the upstream allocates a []string slice for keys on every map encode. We pool these via sync.Pool, eliminating one allocation per sorted map encode.
Microbenchmarks
| Benchmark | Before | After | Change |
|---|---|---|---|
| StructMarshal (time) | 461 ns/op | 405 ns/op | -12% |
| StructMarshal (memory) | 1456 B/op | 1224 B/op | -16% |
| StructMarshal (allocs) | 7 allocs/op | 4 allocs/op | -43% |
| Discard (time) | 20 ns/op | 8.6 ns/op | -57% |
| Discard (allocs) | 2 allocs/op | 0 allocs/op | -100% |
Bug Fixes: 15 Issues the Upstream Missed
While optimizing, we found and fixed 15 bugs — some security-critical, some causing panics, all open for years.
Security: OOM from Untrusted Input
decodeSlice() reads the array length from the wire and calls make([]interface{}, 0, n) with no bounds check. A malicious payload claiming an array of 2 billion elements triggers an immediate out-of-memory crash. Same issue for DecodeMap(). We capped both at 1M elements. The actual data is still decoded element-by-element, so legitimate large arrays work fine — the cap only limits the initial pre-allocation.
Security: Allocation Limit That Never Worked
There's a disableAllocLimitFlag that lets users opt out of allocation limits. The flag is defined as 1 << 3 (value 8). But the check was != 1 — since d.flags & 8 yields either 0 or 8, never 1, the allocation limit was never applied for reflect-based slice decoding.
Correctness Fixes
- Float-to-integer decoding —
float64-encoded values now decode intoint64/uint64with proper validation, rejecting NaN, Inf, fractional, and out-of-range values (https://github.com/Basekick-Labs/msgpack/issues/2) - Float64-to-float32 narrowing —
float64-encoded values decode intofloat32with overflow check (https://github.com/Basekick-Labs/msgpack/issues/12) - Non-addressable pointer encode — types with pointer receivers now encode correctly via
ensureAddrinstead of returning an error (https://github.com/Basekick-Labs/msgpack/issues/3) - reflect.Value marshal panic — marshalling a
reflect.Valuenow unwraps and encodes the underlying value instead of panicking (https://github.com/Basekick-Labs/msgpack/issues/15) - Custom error types preserved — encoding error types no longer reduces them to plain strings via
.Error()(https://github.com/Basekick-Labs/msgpack/issues/22) - Non-string map keys — decoding maps with integer keys into
interface{}now works correctly instead of panicking (https://github.com/Basekick-Labs/msgpack/issues/21) - OmitEmpty with unexported fields — structs with non-zero unexported fields are no longer incorrectly omitted (https://github.com/Basekick-Labs/msgpack/issues/6)
- Nested integer-keyed maps —
interface{}value type is now used for non-string-keyed typed maps (https://github.com/Basekick-Labs/msgpack/issues/20) - TextUnmarshaler priority —
TextUnmarshaleris now chosen overBinaryUnmarshalerwhen the wire format isstr(https://github.com/Basekick-Labs/msgpack/issues/10) - Decoder memory leak — oversized decoder buffers (>32KB) are now dropped from
sync.Poolto prevent memory leaks from large decode operations (https://github.com/Basekick-Labs/msgpack/issues/19) - Float64 error message —
DecodeFloat64no longer says "decoding float32" (https://github.com/Basekick-Labs/msgpack/issues/13)
Production Results: Arc Benchmarks
After deploying v6.0.0 to Arc's ingestion pipeline, we ran our standard 60-second throughput benchmark. The results speak for themselves:
| Metric | Before (upstream v5) | After (v6.0.0) | Change |
|---|---|---|---|
| Avg throughput | 16.78M rec/s | 18.23M rec/s | +8.6% |
| p50 latency | 0.52ms | 0.47ms | -9.6% |
| p99 latency | 3.72ms | 3.58ms | -3.8% |
| 60s degradation | 22% | 13% | -41% relative |
The degradation metric is the one we care about most. It measures how much throughput drops over a sustained 60-second burst as GC pressure builds, the CPU caches get cold, and sync.Pool victims get collected. Going from 22% to 13% means Arc holds its peak throughput significantly longer under sustained load.
v6.0.0: A Proper Release
With 15 performance optimizations and 15 bug fixes shipped, we tagged https://github.com/Basekick-Labs/msgpack/releases/tag/v6.0.0 — the first stable release of the fork. The module path is now github.com/Basekick-Labs/msgpack/v6, and it's available on the Go module proxy.
To migrate from the upstream:
// Before
import "github.com/vmihailenco/msgpack/v5"
// After
import "github.com/Basekick-Labs/msgpack/v6"go get github.com/Basekick-Labs/msgpack/v6The API is fully compatible — it's a drop-in replacement with the same types, methods, and behavior (minus the bugs).
What's Next
We're already working on v6.1 with three more optimizations targeting the decode hot path:
- readCode bsr fast path — the single most-called function in the decoder reads one byte via interface dispatch. When the input is a byte slice, we can skip the interface and read directly from the underlying array. Early microbenchmarks show -7.5% on struct unmarshal.
- PeekCode bsr fast path — same pattern for
PeekCode(), which currently doesReadByte+UnreadByte(two interface calls) when a single array index check suffices. - Pool OmitEmpty filtered slices — when
OmitEmptydoes filter out fields, the allocated slice is now returned to async.Poolfor reuse.
Beyond that, we have https://github.com/Basekick-Labs/msgpack/issues?q=is%3Aissue+is%3Aopen+label%3Aperformance — from decode-side type-switch fast paths to zero-copy byte slice decoding to readN inlining. The goal is to push that 60-second degradation curve below 10%.
The fork is open source under the same BSD-2 license. If you're using vmihailenco/msgpack in a performance-sensitive Go application, https://github.com/Basekick-Labs/msgpack.
Links
- https://github.com/Basekick-Labs/msgpack (branch
v6) - https://github.com/Basekick-Labs/msgpack/releases/tag/v6.0.0
- https://github.com/Basekick-Labs/msgpack/blob/v6/CHANGELOG.md
- https://github.com/Basekick-Labs/msgpack/issues?q=is%3Aissue+is%3Aopen+label%3Aperformance
- https://github.com/Basekick-Labs/arc
- Discord
Ready to handle billion-record workloads?
Deploy Arc in minutes. Own your data in Parquet. Use for analytics, observability, AI, IoT, or data warehousing.
