Self-tuning Valkey for AI agents
Exact-match and semantic caching for AI agents, backed by Valkey. Works with the OpenAI and Anthropic SDKs directly, plus adapters for LangChain, LangGraph, LlamaIndex, and (TS) Vercel AI SDK.
npm install @betterdb/agent-cache iovalkey
TypeScript. LLM, tool, and session tiers.
pip install betterdb-agent-cache
Python. Same three tiers, same adapters.
npm install @betterdb/semantic-cache iovalkey
TypeScript. Similarity-based caching with valkey-search.
pip install betterdb-semantic-cache
Python. Full feature parity.
AI SDK caching for every framework
Drop-in adapters for LangChain, LangGraph (caching and checkpointer), LlamaIndex, Vercel AI SDK, OpenAI, and Anthropic. TypeScript and Python at full parity. One cache library across your entire AI stack.
LangChain
LangGraph
LlamaIndex
Vercel AI SDK
OpenAI
Anthropic
Vercel AI SDK cache
The Vercel AI SDK adapter caches complete responses from generateText and generateObject calls. TypeScript only, since that is where the Vercel AI SDK lives. Known limitation: streaming responses via streamText are not cached - accumulate the full response before storing.
LangGraph cache and checkpointer
LLM and tool result caching for LangGraph agents, plus a full checkpoint saver that runs on vanilla Valkey - no RedisJSON, no RediSearch modules required. Thread state, channel data, and pending writes are persisted to Valkey with per-thread TTL. TypeScript and Python at parity.
LangChain caching
The BetterDBLlmCache adapter plugs into LangChain's cache interface and stores complete generation results including token counts. Works with any LangChain model via .invoke(). Streaming via .stream() bypasses the cache - use .invoke() for cacheable calls. TypeScript and Python.
Reduce LLM costs without changing your prompts
AI agent costs are dominated by repeated LLM calls and tool invocations. Exact-match caching returns identical responses from Valkey instead of calling the model again. Semantic caching catches paraphrases that exact-match misses. Together they typically produce cache hit rates of 20-70%, depending on how repetitive your traffic is.
LLM cost tracking in BetterDB Monitor
BetterDB Monitor surfaces your cache hit rate and cumulative cost saved in real time, so you can see the dollar impact of caching without wiring up a separate dashboard. See it live at chat.betterdb.com.
Three cache tiers behind one connection
LLM Response Cache
Cache LLM responses by exact match on model, messages, temperature, and tools. Handles text, images, audio, and file content natively, and caches tool_use and tool_result blocks the same way. Second call returns from Valkey in under 1ms. Cost tracking per model built in.
{prefix}:llm:{sha256}
Tool Result Cache
Cache tool/function call results by tool name and argument hash. Per-tool TTL policies. Invalidate by tool or by specific arguments.
{prefix}:tool:{name}:{sha256}
Session State
Key-value storage for agent session state with sliding window TTL. Individual field expiry. LangGraph checkpoint support on vanilla Valkey - no RedisJSON, no RediSearch.
{prefix}:session:{thread}:{field}
Pluggable binary normalizer. Images, audio, and file content in multi-modal requests are included in the cache key by default. For image-heavy workloads, swap in a custom BinaryNormalizer to store blobs externally (S3, object storage) and cache by reference instead of by content - so Valkey memory stays bounded even as your multi-modal traffic grows.
Quick start
Up and running in under five minutes. No modules required.
import Valkey from 'iovalkey';
import { AgentCache } from '@betterdb/agent-cache';
const client = new Valkey({ host: 'localhost', port: 6379 });
const cache = new AgentCache({
client,
tierDefaults: {
llm: { ttl: 3600 },
tool: { ttl: 300 },
session: { ttl: 1800 },
},
});
// LLM response caching
const params = {
model: 'gpt-4o',
messages: [{ role: 'user', content: 'What is Valkey?' }],
temperature: 0,
};
const result = await cache.llm.check(params);
if (!result.hit) {
const response = await callLlm(params);
await cache.llm.store(params, response);
}
// Tool result caching
const weather = await cache.tool.check('get_weather', { city: 'Sofia' });
if (!weather.hit) {
const data = await getWeather({ city: 'Sofia' });
await cache.tool.store('get_weather', { city: 'Sofia' }, JSON.stringify(data));
}
// Session state
await cache.session.set('thread-1', 'last_intent', 'book_flight');
const intent = await cache.session.get('thread-1', 'last_intent');Works on vanilla Valkey 7+, ElastiCache, Memorystore, MemoryDB, and any Redis-compatible endpoint.
Drop-in framework adapters
Works with the tools you already use. No framework lock-in.
import OpenAI from 'openai';
import { hashOpenAIRequest } from '@betterdb/agent-cache/openai';
const openai = new OpenAI();
const params = {
model: 'gpt-4o',
messages: [{ role: 'user', content: 'What is Valkey?' }],
};
const cached = await cache.llm.check(hashOpenAIRequest(params));
if (cached.hit) return cached.response;
const response = await openai.chat.completions.create(params);
await cache.llm.store(hashOpenAIRequest(params), response.choices[0].message.content);See what caching saves you
Built-in cost tracking shows exactly how much you're saving per model and per tool.
const stats = await cache.stats();
// {
// llm: { hits: 150, misses: 50, hitRate: 0.75 },
// tool: { hits: 300, misses: 100, hitRate: 0.75 },
// session: { reads: 1000, writes: 500 },
// costSavedMicros: 12500000, // $12.50 saved
// perTool: {
// get_weather: { hits: 200, misses: 50, hitRate: 0.8 },
// }
// }75%
LLM hit rate in this example
$12.50
Saved from 150 cache hits at gpt-4o pricing
<1ms
Cache hit latency vs seconds for a full LLM call
agent-cache also includes toolEffectiveness() which ranks your cached tools by hit rate and recommends TTL adjustments - increase, optimal, or decrease/disable - so caching stays efficient as your workload evolves.
You can also see the benefits live — we use this caching in our own BetterDB Chat.
TTL policies and self-optimization
Hit rate drives TTL. No manual tuning required.
const effectiveness = await cache.toolEffectiveness();
// [
// { tool: 'get_weather', hitRate: 0.85, costSaved: 5.00,
// recommendation: 'increase_ttl' },
// { tool: 'search', hitRate: 0.6, costSaved: 2.50,
// recommendation: 'optimal' },
// { tool: 'rare_api', hitRate: 0.1, costSaved: 0.10,
// recommendation: 'decrease_ttl_or_disable' },
// ]| Recommendation | Criteria |
|---|---|
| increase_ttl | Hit rate > 80% and current TTL < 1 hour |
| optimal | Hit rate 40-80% |
| decrease_ttl_or_disable | Hit rate < 40% |
TTL follows a clear precedence: per-call TTL overrides per-tool policy, which overrides tier default, which overrides global default. When toolEffectiveness() recommends increase_ttl, apply it with cache.tool.setPolicy('get_weather', { ttl: 3600 }) - the policy persists to Valkey and takes effect immediately without restarting your application.
Agent-driven cache optimization
The cache is one more system the agent should be able to operate, not just call. An agent reads live cache state via MCP, proposes config changes with reasoning attached, and the system applies them - with built-in safety logic that prevents the tuning loop from making things worse.
What the agent sees
cache_listList all caches visible to the agent with basic metadatacache_healthHit rate, miss rate, latency, and key counts for a named cachecache_threshold_recommendationMulti-signal threshold recommendation with adaptive dampening, oscillation detection, outcome tracking, and recall cost guardscache_tool_effectivenessPer-tool hit rates, cost-weighted savings in dollars, and TTL recommendationscache_similarity_distributionRolling histogram of similarity scores for semantic cachescache_recent_changesAudit trail of recent config changes with measured outcomes - did the last adjustment actually improve the metric it targeted?What the agent can propose
cache_propose_threshold_adjustPropose a new similarity threshold with machine-generated reasoningcache_propose_tool_ttl_adjustPropose a TTL change for a specific tool based on hit rate datacache_propose_invalidatePropose targeted invalidation for a cache namespace or key patternBuilt-in safety logic
The recommendation engine does not just pick a direction. It tracks whether its own adjustments worked and stops when they do not.
Full working examples for both packages: semantic-cache example and agent-cache example. Cache intelligence requires the Feature.CACHE_INTELLIGENCE entitlement, which is part of the Pro tier.
Why teams choose BetterDB for agent caching
Three cache tiers behind one Valkey connection. No modules required.
| Capability | @betterdb/agent-cache | LangChain RedisCache | LangGraph checkpoint-redis | AutoGen RedisStore | LiteLLM Redis | Upstash + Vercel AI SDK |
|---|---|---|---|---|---|---|
| Agent-tunable via MCP | ||||||
| Live config updates (no restart) | ||||||
| Multi-tier (LLM + Tool + State) | LLM only | State only | LLM only | LLM only | LLM only | |
| Built-in OTel + Prometheus | Partial | |||||
| No modules required | Redis 8 + modules | Upstash only | ||||
| Base SDK support (OpenAI, Anthropic) | ||||||
| Multi-modal (images, audio, files) | ||||||
| Language support | TypeScript + Python | TS only | TS only | Python only | Python only | TS only |
| Framework adapters | OpenAI, Anthropic, LangChain, LangGraph, LlamaIndex, Vercel | LC only | LG only | AutoGen only | LiteLLM only | AI SDK only |
| Zero-config cost tracking | Bundled LiteLLM table, 1,900+ models |
Full observability out of the box
Every cache operation emits an OpenTelemetry span and updates Prometheus metrics. Zero additional instrumentation.
OpenTelemetry spans
agent_cache.llm.check agent_cache.llm.store agent_cache.tool.check agent_cache.tool.store agent_cache.session.get agent_cache.session.set agent_cache.session.destroyThread
Prometheus metrics
- agent_cache_requests_total- Total cache requests (hit/miss by tier)
- agent_cache_operation_duration_seconds- Operation latency histogram
- agent_cache_cost_saved_total- Estimated cost saved in dollars
- agent_cache_stored_bytes_total- Total bytes stored
- agent_cache_active_sessions- Approximate active session count
Cost tracking out of the box
Most caching libraries make you maintain your own pricing table. We ship one.
1,900+ models, zero config
Bundled pricing table sourced from LiteLLM's model_prices_and_context_window.json, refreshed on every release. GPT-4o, Claude, Gemini, and everything else LiteLLM tracks.
Override what you need
Pass a costTable to override pricing for specific models. Your entries merge on top of the defaults. Other models keep working.
Turn it off if you want to
Set useDefaultCostTable: false (TypeScript) or use_default_cost_table=False (Python) to bring your own table. Same behaviour as before 0.4.0.
See your cache working in the BetterDB monitor
Cache hit rate, similarity latency, and index health show up natively in BetterDB Monitor's Vector / AI tab. @betterdb/semantic-cache uses FT.SEARCH under the hood, so the monitor sees it automatically. One instance, no extra wiring.
Semantic prompt caching for similar queries
“What is the capital of France?” and “Capital city of France?” are the same question. Prompt caching via semantic similarity catches what exact-match misses.
Valkey-native
Handles valkey-search API differences explicitly. Works on ElastiCache, Memorystore, or self-hosted. Not a Redis port. Visualized in BetterDB Monitor's Vector / AI tab.
7 framework adapters
OpenAI, OpenAI Responses, Anthropic, LangChain, LlamaIndex, LangGraph, and Vercel AI SDK — no framework lock-in for direct SDK use.
Full observability
Every check() and store() emits OTel spans and Prometheus metrics. Hit rate, similarity scores, latency - zero extra instrumentation.
Cost tracking, zero config
Bundled LiteLLM price table, 1,900+ models. Store token counts at cache time and get exact dollars saved on every hit — including cumulative stats via cache.stats().
TypeScript + Python
Full parity. Same adapters, same API shape, same features in both languages. Install with npm or pip.
Auto-tuning thresholds
thresholdEffectiveness() analyzes the rolling similarity score window and returns a tighten/loosen/optimal recommendation. With MCP-driven cache intelligence, an agent reads this recommendation and proposes a threshold adjustment - a human approves in BetterDB Monitor, and the library picks it up within seconds. See the closed-loop example.
No other semantic cache library checks all six.
| Capability | @betterdb/semantic-cache | RedisVL SemanticCache | LangChain RedisSemanticCache | LiteLLM redis-semantic | Upstash semantic-cache | Redis LangCache |
|---|---|---|---|---|---|---|
| Agent-tunable via MCP | ||||||
| Live config updates (no restart) | ||||||
| Valkey-native | Redis only | Redis only | Redis only | Upstash only | Redis Cloud only | |
| Standalone | Requires LangChain | Requires LiteLLM | Managed only | |||
| Built-in OTel + Prometheus | Partial | Dashboard only | ||||
| TypeScript + Python | Python only | Requires LangChain | Python only | JS/TS only | Managed only | |
| Cost tracking (bundled) | Via LiteLLM only |
import { SemanticCache } from '@betterdb/semantic-cache';
import { createOpenAIEmbed } from '@betterdb/semantic-cache/embed/openai';
const cache = new SemanticCache({
client: new Valkey({ host: 'localhost', port: 6399 }),
embedFn: createOpenAIEmbed(), // or Voyage, Cohere, Bedrock, Ollama
defaultThreshold: 0.15, // catch paraphrases with high confidence
});
await cache.initialize();
await cache.store('What is the capital of France?', 'Paris', {
model: 'gpt-4o', inputTokens: 20, outputTokens: 5,
});
const result = await cache.check('Capital city of France?');
// result.hit === true
// result.confidence === 'high'
// result.costSaved === 0.000105Five embedding helpers included: createOpenAIEmbed, createVoyageEmbed, createCohereEmbed, createBedrockEmbed, createOllamaEmbed. Requires valkey-search (Valkey 8+ or via modules). For environments without search modules, use @betterdb/agent-cache for exact-match caching.
Known limitations
These apply to @betterdb/agent-cache and betterdb-agent-cache.
Streaming responses are not cached by the Vercel AI SDK adapter. Accumulate the full response before caching.
LangChain streaming is similarly not cached. The BetterDBLlmCache adapter caches complete generation results including token counts. If your LangChain model uses .stream() instead of .invoke(), responses bypass the cache. Use .invoke() for cacheable calls.
LangGraph list() loads all checkpoint data for a thread into memory before filtering. Fine for typical agent deployments. For threads with thousands of large checkpoints, consider langgraph-checkpoint-redis with Redis 8+.
Ready to get started?
Start monitoring in minutes - no infrastructure to maintain. Team collaboration, agent-based monitoring for private databases, and more. Or self-host - open source core, zero lock-in.