Why Your AI Agent Observability Tool Is Wrong About Memory

I have an agent running in production right now. It's a CRM I built for myself called Aura. It uses memory to remember everything: every conversation I've had with a prospect, every deal I've moved, every activity I've logged. It runs across days, not within a single request. It learns. It forgets things I want it to forget. It has state that persists.

I also have a personal trainer agent called Voya. It remembers my last workout, what I ate, how I slept, what I said about my knee three weeks ago. Every session is informed by every previous session.

Both of these agents need something that almost nobody in the AI agent observability space is talking about: memory as a first-class data object. Not traces. Not spans. Not request-response logs. Memory.

This post is about why current agent observability tools have the wrong data model, and what the right one looks like.

What Current Agent Observability Tools Optimize For

Read the marketing pages of LangSmith, Langfuse, Helicone, Arize Phoenix, AgentOps, Maxim AI, Latitude, or Datadog LLM Observability and you'll see the same primitives over and over:

Traces. A timeline of LLM calls, tool invocations, retrieval steps, and final responses for a single agent run.
Spans. Individual operations inside a trace, with latency and cost attribution.
Evaluations. Scoring agent outputs against quality criteria, either via LLM-as-judge or human review.
Prompts. Versioned templates that produce LLM inputs.

Every one of these is a primitive for observing and improving a single agent run. They all assume the same data model: an agent receives a request, executes a chain of operations, returns a response, and then gets observed.

This is the request-response model of agent behavior. It's a natural extension of how we think about web applications. It works well when your agent answers one question.

It fails when your agent has to remember things.

The Model Break

A real agent, the kind of agent I have running in production, doesn't operate like a web request. It operates like a person.

When I sit down with my CRM agent, I don't restart it. It already knows that a contact at one of our aerospace customers responded "compelling" to my proposal three weeks ago. It already knows the technical lead at another customer got Arc running on Azure last Friday. It already knows that customer's deal closes May 15 and my new business ARR is at 30% of the year's target.

When my trainer agent talks to me, it knows that yesterday I rode my motorcycle for two hours and that's why my back is tight today. It knows I prefer kettlebells to dumbbells. It knows I had a bad night of sleep and to scale today's intensity accordingly.

These agents have memory. Not context windows. Memory. The kind that persists across sessions, across days, across version upgrades. The kind that grows. The kind that gets pruned when something becomes irrelevant.

And when I want to debug or improve these agents, the question I ask is rarely "what happened in this trace?" The question is almost always:

What does this agent know about this user as of right now?
How did its understanding of this customer evolve over the last three weeks?
Why did it think this prospect was high-priority? When did it form that opinion?
What memories is it using to make this recommendation?

None of these are trace questions. They're all memory queries.

The Shape of the Gap

Look at the current tools through this lens:

LangSmith stores traces. Sessions are groupings of traces. There's no first-class memory primitive. Memory is whatever your agent's framework saves to a database, and that database is opaque to LangSmith.

Langfuse does the same, with sessions and traces. Recently acquired by ClickHouse, which gives them serious infrastructure for trace storage. But trace storage is not memory storage.

Latitude is the most sophisticated of the new generation. They explicitly call out cross-turn state and tool-call failures as a gap, and they auto-generate evaluations from production traces. But the data model is still trace-centric.

Maxim AI focuses on the full lifecycle (simulation, evaluation, observability) but also fundamentally treats each agent run as a discrete, observable event.

Datadog LLM Observability and New Relic are bolt-ons to existing APM. Even less suited to memory.

The ones that come closest to acknowledging memory as a thing:

AgentOps has "session management" that groups related agent interactions into sessions. Still trace-shaped underneath.
Arize Phoenix has embedding clustering and drift detection over time. Useful for ML drift; not useful for "what does this agent remember about this user."

There is no tool in this list that lets me ask: "Show me everything Aura knows about this customer, ordered by importance, as of right now."

What Memory Actually Needs From Infrastructure

When I built Memtrace, the memory layer for Aura, Voya, and a few internal projects, I had to decide what memory infrastructure looks like for a production AI agent. Here's what I landed on, and why.

1. Memory Is Temporal, Not Semantic

Vector databases are great for semantic search ("find documents similar to this query"). They're terrible for temporal queries ("what happened in the last two hours"). Most operational memory is fundamentally a time-windowed lookup, not a similarity search.

When my trainer agent wants to know "what did the user say about their knee in the last 30 days," that's a temporal query with a filter. It does not need an embedding. It does not need a vector index. It needs a fast time-range scan over structured data.

This is why I built Memtrace on top of Arc. Arc is a columnar analytical database optimized for time-series queries. Memtrace memories are stored as time-partitioned Parquet files, queried with SQL, with native time-window predicates. A query like "memories about <customer> in the last week with importance > 0.7" runs in 20–50 milliseconds.

2. Memory Has Types

Not all memories are the same. A decision (the agent chose path A over path B because of X) is structurally different from an episodic event (something happened at this time) which is different from an entity record (this customer is an aerospace company with these attributes) which is different from a session (this conversation happened, here's what was discussed).

Memtrace has four memory types: Episodic, Decision, Entity, and Session. Each has its own schema. Each has its own query patterns. Treating them all as undifferentiated trace events loses information.

3. Memory Has Importance

Not all memories deserve equal weight. The fact that the customer's contact responded "compelling" to my proposal is more important than the fact that I sent them an email at 9:47 AM. Importance is a feature of the memory, not just a metadata tag.

Every Memtrace memory has an importance score from 0 to 1. Queries can filter by it. Retention can prune by it. The agent can reason about its own most-important memories without re-processing the entire log.

4. Memory Is Shared, Sometimes

When two agents need to coordinate, they need a shared memory pool. The customer support agent that hands off to the technical specialist agent needs the new agent to know everything the previous agent knew. That's not a trace handoff. That's a memory namespace that both agents read from and write to.

Memtrace has shared memory pools. Two agents collaborate without needing custom pub/sub or messaging infrastructure between them.

5. Memory Is Plain Text for the LLM, Not Embeddings

LLMs read text. Every memory in Memtrace can be rendered as plain markdown and dropped directly into a prompt. No embedding API call. No transformation layer. No vector retrieval step before the agent can use its own memory.

This is the part that quietly removes the most cost from production agent infrastructure. Every embedding API call is money. Every vector lookup is latency. Memtrace has neither.

What This Looks Like in Practice

Aura, the CRM agent I use every day, runs on Memtrace. Here's roughly what happens when I ask it about a customer:

Aura queries Memtrace: "Memories tagged with <customer>, ordered by importance, last 90 days."
Memtrace returns ~12 memories: the contact's "compelling" reply, a colleague joining the May 6 call, the license extension email, the spoofed telemetry flag, my pre-call prep notes, the deal stage transitions, and so on.
Aura formats those memories as plain markdown.
Aura drops them into the LLM context window.
The LLM responds with knowledge that feels coherent across weeks of activity.

There is no embedding step. There is no vector search. There is a SQL query against time-partitioned Parquet files, and the results are ready in 20–50ms.

When I want to debug Aura, I don't need a trace viewer. I need a memory browser. "Show me what you know about this person." "Show me your decisions in the last 24 hours, sorted by importance." "Show me which memories you used to generate this last response."

These are SQL queries against Memtrace. They're trivial. They don't require special tooling beyond what any analyst would expect.

The Data Architecture

Here's the architecture I ended up with:

Storage: Apache Parquet files, partitioned by hour, on S3 (or local disk, or Azure Blob). Columnar, compressed, immutable. The same format every analytical tool on earth can read.
Query engine: Arc, which uses DuckDB underneath. Standard SQL. Time-window predicates are first-class.
Memory schema: Strongly typed memory records with type (episodic, decision, entity, session), agent ID, importance score, timestamp, content, and metadata.
Ingestion: Plain HTTP endpoint. Insert a memory. It's queryable in real-time.
Retention: Policies that delete memories older than X days, or with importance below Y, or matching some predicate. Filesystem-level deletion of partition files. No row-level rewrites.
Sharing: Namespaces. Multiple agents can read and write to the same namespace.

This is not a vector database. It's not a trace database. It's a time-series database for agent memory, and it turns out time-series is the natural shape for what agents actually need.

If you want a concrete example of an agent running on this architecture, I wrote about an autonomous agent that posted to social media for three days using Arc as its memory earlier this year. Same primitives, different surface.

Why Current Observability Tools Don't Fit This

If you're building agent observability today, you're working backward from a request-response model. Your data structure is the trace. Your queries are "show me this trace's spans" or "show me traces with errors in the last hour."

If you're building agent memory infrastructure, you're working backward from a temporal-state model. Your data structure is the memory. Your queries are "show me this agent's relevant memories about this entity" or "show me all decisions made in the last 24 hours."

These are not the same problem. The fact that the industry is grouping them under "agent observability" is a category error. Memory needs its own infrastructure layer. Observability sits on top of it.

What This Means for AI Infrastructure in 2026

The next wave of agent infrastructure is going to need three things current tools don't provide:

A first-class memory layer that's queryable, durable, and not stuck inside an opaque framework.
Time-series-shaped storage for agent state, because agents are temporal entities, not request-response handlers.
Standard formats (Parquet, SQL) so memory can be queried by any tool, not just the framework that wrote it.

I built Memtrace to do these three things. It's open source under Apache 2.0, available now. The managed cloud version is on the waitlist.

But more importantly: whatever you build, get the data model right. Don't graft memory on top of trace tools. Don't store memory in vector databases. Don't keep memory inside your agent framework's process memory. Build it as a first-class data layer with proper temporal queries.

Your agents will thank you. So will the next agent you build, when it can read what the previous one remembered.

Memtrace is open source under Apache 2.0.

https://github.com/Basekick-Labs/memtrace
Cloud waitlist
Built on Arc, a columnar analytical database for time-series workloads

If you're running production agents and want to talk about memory architecture, find me in Discord.

Analytical Database

Streaming

AI Memory