Memory, self-tuning caching, and retrieval for your AI agents

Durable memory with scoped, ranked recall; multi-tier caching that answers repeat work in under a millisecond; typed retrieval over your own data.
One SDK, observable at every layer, deployed anywhere you run Valkey.

Sub-ms cache hits•Redis-compatible•TypeScript + Python•No vendor lock-in

Long-term memory with scoped, ranked recall across sessions.

@betterdb/agent-memory

npm install @betterdb/agent-memory iovalkey

Docs →Examples →

The context layer for AI agents

Memory, cache, and retrieval in one SDK, on open Valkey. TypeScript and Python, observable at every layer, and yours to run anywhere.

@betterdb/agent-memory

Agent Memory

Long-term memory for agents. remember() and recall() scoped to threads and agents, ranked by similarity, recency, and importance, with consolidation so your agent remembers across sessions.

View on GitHub →

@betterdb/semantic-cache

Semantic Cache

Similarity-based response caching on valkey-search. Sub-millisecond lookups, bundled cost tracking, and a threshold engine that tunes itself as traffic shifts.

View on GitHub →

@betterdb/retrieval

Retrieval SDK

Typed vector retrieval over Valkey: upsert, filter, and query with hybrid dense + lexical rerank. The same engine behind memory recall, exposed directly for RAG.

View on GitHub →

Agent memory

Memory your agent actually keeps

Most agents forget everything between turns. BetterDB gives them durable, long-term memory with remember() and recall() - scoped to the right thread, ranked by what matters, and consolidated over time so it stays useful as history grows.

Give your agent memory in a few lines

One client, two methods. Point it at any Valkey and your agent remembers across sessions, no new database to operate.

@betterdb/agent-memory

import { MemoryStore } from '@betterdb/agent-memory'
import Valkey from 'iovalkey'

const memory = new MemoryStore({
  client: new Valkey(process.env.VALKEY_URL),
  name: 'support-agent',
  embedFn,
})

// Remember a fact, scoped to a thread
await memory.remember('Prefers metric units and dark mode', {
  threadId: 'user-42',
  importance: 0.8,
})

// Recall what matters for the next turn
const hits = await memory.recall('user preferences?', {
  threadId: 'user-42',
  k: 5,
})

Scoped recall
Partition memory by thread, agent, and namespace so every recall stays on-topic.
Ranked, not just nearest
A composite score blends similarity, recency, and importance, then reinforces what gets used.
Observable by default
Every remember and recall is traced and exported, so you can see exactly what your agent knows.

Built for how agents actually remember

Vector search alone is not memory. Recall has to be scoped, ranked, kept fresh, and observable.

Scoped recall

Partition memory by thread, agent, and namespace so every recall stays on-topic instead of dragging in unrelated history.

Ranked, not just nearest

A composite score blends semantic similarity, recency, and importance, then reinforces the memories that actually get used.

Consolidation

consolidate() summarizes and dedupes accumulated memories over time, so recall stays sharp as history grows instead of degrading.

Forgetting and TTL

forget() removes memories on demand and per-memory TTL expires them automatically, so stale or sensitive context does not linger.

Observable by default

Every remember and recall emits OpenTelemetry spans and Prometheus metrics. You can see exactly what your agent knows and why it recalled it.

Yours to run anywhere

No new managed service. Point it at any Valkey in your own infrastructure - ElastiCache, Memorystore, MemoryDB, or self-hosted. No vendor lock-in.

Measured, not asserted

Recall barely drops as the haystack grows 10× on the same benchmark the category is measured by, while lookups stay sub-millisecond on infrastructure you control.

~93%: holds from the small split (~98%) to LongMemEval-M — a ~10× larger haystack, 500 questions, ~475 sessions each — with hybrid rerank
0.7ms: semantic-cache hit latency, roughly 100× faster than a hosted cache
Anywhere: run it on a Valkey you already operate, or let us provision a managed one — open core, built on open-source Valkey, no lock-in

Retrieval measured on the public LongMemEval-M split (k=10, hybrid rerank, text-embedding-3-small); cache latency from our Upstash comparison. Read the benchmarks →

How BetterDB memory compares

Assume comparable recall and answer quality. The differences that persist are structural - Valkey-native, open, self-hosted, observable.

Capability	@betterdb/agent-memory	Mem0	Zep	Redis Iris
Runs on Valkey and Redis you operate		Pluggable backends	Graph DB	Redis Enterprise / Cloud
Open core, self-host with a real exit	MIT core	Apache 2.0; graph gated to Pro	Graphiti OSS; CE deprecated	Proprietary
Own your data, no lock-in		Cloud for full features	Graphiti OSS; product is Cloud	Proprietary, hosted only
Bundled semantic LLM cache	Exact + semantic, multi-tier	None	None	LangCache, separate
Typed retrieval in the same SDK		Via memory API	Via memory/graph API	Separate retriever
Observability: OTel + Prometheus	Every layer	Dashboard analytics	Enterprise governance	Redis tooling
TypeScript + Python parity	Full parity	Python primary	Python-first; Cloud multi-language	Client ecosystem
Temporal knowledge graph (what was true when)	Vector recall	Graph memory (Pro)	Bi-temporal	None
Managed-service maturity	Managed newer; self-host GA	Mem0 Cloud, GA	Zep Cloud, GA	Agent Memory in preview
Ecosystem and distribution	Newer	Large community, broad integrations		Redis ecosystem

Caching

Stop paying for the same answer twice

The other half of the context layer: multi-tier caching that returns repeated LLM calls, tool results, and session state from Valkey in under a millisecond - exact-match and semantic, with cost tracking and self-tuning thresholds built in.

See what caching saves you

Built-in cost tracking shows exactly how much you're saving per model and per tool.

const stats = await cache.stats();
// {
//   llm:  { hits: 150, misses: 50, hitRate: 0.75 },
//   tool: { hits: 300, misses: 100, hitRate: 0.75 },
//   session: { reads: 1000, writes: 500 },
//   costSavedMicros: 12500000,  // $12.50 saved
//   perTool: {
//     get_weather: { hits: 200, misses: 50, hitRate: 0.8 },
//   }
// }

75%

LLM hit rate in this example

$12.50

Saved from 150 cache hits at gpt-4o pricing

<1ms

Cache hit latency vs seconds for a full LLM call

agent-cache also includes toolEffectiveness() which ranks your cached tools by hit rate and recommends TTL adjustments - increase, optimal, or decrease/disable - so caching stays efficient as your workload evolves.

You can also see the benefits live — we use this caching in our own BetterDB Chat.

The cache tunes itself

No other Valkey or Redis cache library does this.

Agent MCP call

// Agent MCP call
await mcp.callTool(
  'cache_propose_threshold_adjust',
  {
    cache_name: 'prod-semantic',
    new_threshold: 0.075,
    reasoning:
      'hit rate 28% over 7d,'
      + ' tighten threshold',
  }
)

Pending proposal

// Pending proposal (API response)
{
  "id": "prop_01jwx3krq5",
  "status": "pending",
  "cache_name": "production-semantic",
  "new_threshold": 0.075,
  "expires_at": "2026-05-06T12:00:00Z",
  "warnings": []
}

After approval

// Dispatcher writes to Valkey:
HSET production-semantic:__config \
  threshold 0.075

// Library picks up the change
// within seconds. No restart.

The agent observes live cache metrics via MCP read tools, proposes a config change with reasoning, and a human approves it in BetterDB Monitor. The cache library polls its config key in Valkey and swaps the policy atomically - no restart, no redeploy. Config polling is live in @betterdb/semantic-cache@0.4.0 and @betterdb/agent-cache@0.6.0. See the full closed-loop example

TTL policies and self-optimization

Hit rate drives TTL. No manual tuning required.

const effectiveness = await cache.toolEffectiveness();
// [
//   { tool: 'get_weather', hitRate: 0.85, costSaved: 5.00,
//     recommendation: 'increase_ttl' },
//   { tool: 'search', hitRate: 0.6, costSaved: 2.50,
//     recommendation: 'optimal' },
//   { tool: 'rare_api', hitRate: 0.1, costSaved: 0.10,
//     recommendation: 'decrease_ttl_or_disable' },
// ]

Recommendation	Criteria
increase_ttl	Hit rate > 80% and current TTL < 1 hour
optimal	Hit rate 40-80%
decrease_ttl_or_disable	Hit rate < 40%

TTL follows a clear precedence: per-call TTL overrides per-tool policy, which overrides tier default, which overrides global default. When toolEffectiveness() recommends increase_ttl, apply it with cache.tool.setPolicy('get_weather', { ttl: 3600 }) - the policy persists to Valkey and takes effect immediately without restarting your application.

Agent-driven cache optimization

The cache is one more system the agent should be able to operate, not just call. An agent reads live cache state via MCP, proposes config changes with reasoning attached, and the system applies them - with built-in safety logic that prevents the tuning loop from making things worse.

What the agent sees

cache_listList all caches visible to the agent with basic metadata

cache_healthHit rate, miss rate, latency, and key counts for a named cache

cache_threshold_recommendationMulti-signal threshold recommendation with adaptive dampening, oscillation detection, outcome tracking, and recall cost guards

cache_tool_effectivenessPer-tool hit rates, cost-weighted savings in dollars, and TTL recommendations

cache_similarity_distributionRolling histogram of similarity scores for semantic caches

cache_recent_changesAudit trail of recent config changes with measured outcomes - did the last adjustment actually improve the metric it targeted?

What the agent can propose

cache_propose_threshold_adjustPropose a new similarity threshold with machine-generated reasoning

cache_propose_tool_ttl_adjustPropose a TTL change for a specific tool based on hit rate data

cache_propose_invalidatePropose targeted invalidation for a cache namespace or key pattern

Built-in safety logic

The recommendation engine does not just pick a direction. It tracks whether its own adjustments worked and stops when they do not.

Outcome trackingEach adjustment is evaluated against the metric it targeted. The engine requires a 20% improvement to continue in the same direction.

Adaptive dampeningStep sizes shrink with each consecutive adjustment. After five same-direction moves, the engine declares optimal and stops.

Oscillation detectionIf the threshold flips direction three or more times in the recent window, the loop caps itself rather than chasing noise.

Recall cost guardThe engine will not tighten the threshold if doing so would lose more than 15% of current cache hits.

Cost-weighted decisionsThreshold recommendations factor in dollars saved per hit, not just similarity scores, so adjustments account for cost impact.

Human overrideProposals can be reviewed, edited, approved, or rejected via MCP before they take effect. The loop does not require human approval to run, but supports it.

Full working examples for both packages: semantic-cache example and agent-cache example. Cache intelligence requires the Feature.CACHE_INTELLIGENCE entitlement, which is part of the Pro tier.

Semantic prompt caching for similar queries

“What is the capital of France?” and “Capital city of France?” are the same question. Prompt caching via semantic similarity catches what exact-match misses.

Valkey-native

Handles valkey-search API differences explicitly. Works on ElastiCache, Memorystore, or self-hosted. Not a Redis port. Visualized in BetterDB Monitor's Vector / AI tab.

9 framework adapters

OpenAI, OpenAI Responses, OpenAI Agents SDK, Pydantic AI, Anthropic, LangChain, LlamaIndex, LangGraph, and Vercel AI SDK — no framework lock-in for direct SDK use.

Full observability

Every check() and store() emits OTel spans and Prometheus metrics. Hit rate, similarity scores, latency - zero extra instrumentation.

Cost tracking, zero config

Bundled LiteLLM price table, 1,900+ models. Store token counts at cache time and get exact dollars saved on every hit — including cumulative stats via cache.stats().

TypeScript + Python

Same API shape and features in both languages, each with adapters for its framework ecosystem. Install with npm or pip.

Auto-tuning thresholds

thresholdEffectiveness() analyzes the rolling similarity score window and returns a tighten/loosen/optimal recommendation. With MCP-driven cache intelligence, an agent reads this recommendation and proposes a threshold adjustment - a human approves in BetterDB Monitor, and the library picks it up within seconds. See the closed-loop example.

No other semantic cache library checks all six.

Capability	RedisVL SemanticCache	LangChain RedisSemanticCache	LiteLLM redis-semantic	Upstash semantic-cache	Redis LangCache
Agent-tunable via MCP
Live config updates (no restart)
Valkey-native	Redis only	Redis only	Redis only	Upstash only	Redis Cloud only
Standalone		Requires LangChain	Requires LiteLLM		Managed only
Built-in OTel + Prometheus			Partial		Dashboard only
TypeScript + Python	Python only	Requires LangChain	Python only	JS/TS only	Managed only
Cost tracking (bundled)			Via LiteLLM only

import { SemanticCache } from '@betterdb/semantic-cache';
import { createOpenAIEmbed } from '@betterdb/semantic-cache/embed/openai';

const cache = new SemanticCache({
  client: new Valkey({ host: 'localhost', port: 6399 }),
  embedFn: createOpenAIEmbed(), // or Voyage, Cohere, Bedrock, Ollama
  defaultThreshold: 0.15,       // catch paraphrases with high confidence
});

await cache.initialize();
await cache.store('What is the capital of France?', 'Paris', {
  model: 'gpt-4o', inputTokens: 20, outputTokens: 5,
});

const result = await cache.check('Capital city of France?');
// result.hit === true
// result.confidence === 'high'
// result.costSaved === 0.000105

Five embedding helpers included: createOpenAIEmbed, createVoyageEmbed, createCohereEmbed, createBedrockEmbed, createOllamaEmbed. Requires valkey-search (Valkey 8+ or via modules). For environments without search modules, use @betterdb/agent-cache for exact-match caching.

View @betterdb/semantic-cache on npm →View betterdb-semantic-cache on PyPI →

AI SDK caching for every framework

Drop-in adapters for LangChain, LangGraph (caching and checkpointer), LlamaIndex, OpenAI, OpenAI Responses, and Anthropic — plus the Vercel AI SDK (TypeScript) and the OpenAI Agents SDK and Pydantic AI (Python). One cache library across your entire AI stack. Pick a framework to see the code and what the adapter does.

import OpenAI from 'openai';
import { hashOpenAIRequest } from '@betterdb/agent-cache/openai';

const openai = new OpenAI();
const params = {
  model: 'gpt-4o',
  messages: [{ role: 'user', content: 'What is Valkey?' }],
};

const cached = await cache.llm.check(hashOpenAIRequest(params));
if (cached.hit) return cached.response;

const response = await openai.chat.completions.create(params);
await cache.llm.store(hashOpenAIRequest(params), response.choices[0].message.content);

OpenAI cache

Hash chat completion requests — model, messages, temperature, and tools — into a stable cache key with hashOpenAIRequest (TypeScript) or prepare_params (Python). Check before the call, store the response after. Text, image, audio, and tool blocks are all part of the key. TypeScript and Python at parity.

Three cache tiers behind one connection

LLM Response Cache

Cache LLM responses by exact match on model, messages, temperature, and tools. Handles text, images, audio, and file content natively, and caches tool_use and tool_result blocks the same way. Second call returns from Valkey in under 1ms. Cost tracking per model built in.

{prefix}:llm:{sha256}

Tool Result Cache

Cache tool/function call results by tool name and argument hash. Per-tool TTL policies. Invalidate by tool or by specific arguments.

{prefix}:tool:{name}:{sha256}

Session State

Key-value storage for agent session state with sliding window TTL. Individual field expiry. LangGraph checkpoint support on vanilla Valkey - no RedisJSON, no RediSearch.

{prefix}:session:{thread}:{field}

Pluggable binary normalizer. Images, audio, and file content in multi-modal requests are included in the cache key by default. For image-heavy workloads, swap in a custom BinaryNormalizer to store blobs externally (S3, object storage) and cache by reference instead of by content - so Valkey memory stays bounded even as your multi-modal traffic grows.

Quick start

Up and running in under five minutes. No modules required.

import Valkey from 'iovalkey';
import { AgentCache } from '@betterdb/agent-cache';

const client = new Valkey({ host: 'localhost', port: 6379 });

const cache = new AgentCache({
  client,
  tierDefaults: {
    llm:     { ttl: 3600 },
    tool:    { ttl: 300 },
    session: { ttl: 1800 },
  },
});

// LLM response caching
const params = {
  model: 'gpt-4o',
  messages: [{ role: 'user', content: 'What is Valkey?' }],
  temperature: 0,
};

const result = await cache.llm.check(params);
if (!result.hit) {
  const response = await callLlm(params);
  await cache.llm.store(params, response);
}

// Tool result caching
const weather = await cache.tool.check('get_weather', { city: 'Sofia' });
if (!weather.hit) {
  const data = await getWeather({ city: 'Sofia' });
  await cache.tool.store('get_weather', { city: 'Sofia' }, JSON.stringify(data));
}

// Session state
await cache.session.set('thread-1', 'last_intent', 'book_flight');
const intent = await cache.session.get('thread-1', 'last_intent');

Works on vanilla Valkey 7+, ElastiCache, Memorystore, MemoryDB, and any Redis-compatible endpoint.

Why teams choose BetterDB for agent caching

Three cache tiers behind one Valkey connection. No modules required.

Capability	@betterdb/agent-cache	LangChain RedisCache	LangGraph checkpoint-redis	AutoGen RedisStore	LiteLLM Redis	Upstash + Vercel AI SDK
Agent-tunable via MCP
Live config updates (no restart)
Multi-tier (LLM + Tool + State)		LLM only	State only	LLM only	LLM only	LLM only
Built-in OTel + Prometheus					Partial
No modules required			Redis 8 + modules			Upstash only
Base SDK support (OpenAI, Anthropic)
Multi-modal (images, audio, files)
Language support	TypeScript + Python	TS only	TS only	Python only	Python only	TS only
Framework adapters	OpenAI, Anthropic, LangChain, LangGraph, LlamaIndex, Vercel	LC only	LG only	AutoGen only	LiteLLM only	AI SDK only
Zero-config cost tracking	Bundled LiteLLM table, 1,900+ models

View on npm View on PyPI View in action

Full observability out of the box

Every cache operation emits an OpenTelemetry span and updates Prometheus metrics. Zero additional instrumentation.

OpenTelemetry spans

agent_cache.llm.check
agent_cache.llm.store
agent_cache.tool.check
agent_cache.tool.store
agent_cache.session.get
agent_cache.session.set
agent_cache.session.destroyThread

Prometheus metrics

agent_cache_requests_total- Total cache requests (hit/miss by tier)
agent_cache_operation_duration_seconds- Operation latency histogram
agent_cache_cost_saved_total- Estimated cost saved in dollars
agent_cache_stored_bytes_total- Total bytes stored
agent_cache_active_sessions- Approximate active session count

Cost tracking out of the box

Most caching libraries make you maintain your own pricing table. We ship one.

1,900+ models, zero config

Bundled pricing table sourced from LiteLLM's model_prices_and_context_window.json, refreshed on every release. GPT-4o, Claude, Gemini, and everything else LiteLLM tracks.

Override what you need

Pass a costTable to override pricing for specific models. Your entries merge on top of the defaults. Other models keep working.

Turn it off if you want to

Set useDefaultCostTable: false (TypeScript) or use_default_cost_table=False (Python) to bring your own table. Same behaviour as before 0.4.0.

See your cache working in the BetterDB monitor

Cache hit rate, similarity latency, and index health show up natively in BetterDB Monitor's Vector / AI tab. @betterdb/semantic-cache uses FT.SEARCH under the hood, so the monitor sees it automatically. One instance, no extra wiring.

Explore the monitor

Known limitations

These apply to @betterdb/agent-cache and betterdb-agent-cache.

LangChain streaming is not cached. The BetterDBLlmCache adapter caches complete generation results including token counts. If your LangChain model uses .stream() instead of .invoke(), responses bypass the cache. Use .invoke() for cacheable calls.

LangGraph list() loads all checkpoint data for a thread into memory before filtering. Fine for typical agent deployments. For threads with thousands of large checkpoints, consider langgraph-checkpoint-redis with Redis 8+.

Add memory and caching to your agents

Install the SDK and get agent memory, semantic caching, and retrieval in one library. Self-host on a Valkey you already run — or let us provision a managed Valkey with the search module, no setup required.

Get Started for Free View Documentation View on GitHub