Part 1 of 2 in the “AI Agent Memory” series. Short-Term Memory with SpringBoot + LangChain4j.
The Demo that wasn’t
We got back from our offsite pretty excited. We had just walked through a POC showing how metadata actually earns its keep in an AI-first world. With a solid semantic layer, our analysts could spend more time cooking and less time wandering around the data pantry asking, “where did we put the salt again?” The demo was slick. Agents could find the right data, make suggestions, even tap into technical metadata to build and run queries for users. It felt like we cracked something.
Naturally, we fast-tracked it into our test environment so the product team could start showing it off. And then… the call came in from my product owner. Turns out, with multiple users, the system had the memory of a goldfish. Each interaction politely erased the previous one. Users were basically stepping on each other’s conversations like it was a group chat gone wrong. Nothing like a demo working perfectly… until more than one person uses it.
Good reminder: getting a POC to work is the easy part. Getting it to behave in the real world – hold context, handle multiple users – that’s where the real engineering starts.
The solution is easy; add memory. This is a 2-part series on how we tackled memory: the difference between short-term vs long-term, when you actually need each, and what it takes to build them properly (so your system remembers more than a goldfish).
Why Local?
At work, I use OpenAI. I’ve gravitated toward local models for tutorials for a few reasons. First, it removes the “I don’t have an API key” barrier for readers. Second, running locally forces you to think about latency and resource constraints – things that matter enormously in production. Third, Ollama has gotten remarkably good, and qwen3 is a genuinely capable model for this kind of work.
The patterns in this series work identically with any LLM provider. Swapping Ollama for OpenAI or Anthropic is a trivial change. So learn the patterns here, then deploy against whatever your organization uses in production. The assumption here is that you have a GPU that Ollama can use.
Prerequisites: Getting Ollama Running
At some point, these prerequisites will disappear. I should probably follow the DRY principle on my own blog, If you’ve been following along, you’ve seen this step more times than you signed up for; it’s starting to feel like déjà vu.
Before we write a line of code, you need Ollama installed and the models pulled:
# Install Ollama (Linux/macOS)
curl -fsSL https://ollama.ai/install.sh | sh
# Pull the chat model
ollama pull qwen3
# Verify it works
ollama run qwen3 "Say hello in one sentence."
Ollama runs as a local HTTP server on http://localhost:11434. LangChain4j talks to it through the same REST API it would use for any remote provider.
Resource note: qwen3 (7B parameters) needs about 6GB of VRAM/RAM. If you’re memory-constrained,
qwen3:0.6borqwen3:1.7brun on almost anything. I run latter on my Nvidia GTX 1065 with 4GB of VRAM.
A Quick Note on qwen3’s Thinking Mode
qwen3 is a reasoning model. By default it wraps its internal reasoning in <think>...</think> blocks before the final response. Most of the time you’ll want to strip those. We’ll handle this in our service layer with a one-liner:
private static String clean(String response) {
return response.replaceAll("(?s)<think>.*?</think>\\s*", "").trim();
}
What Is Short-Term Memory in an AI Agent?
Think of short-term memory as the agent’s working notepad. Its scratchpad for whatever’s going on in the current conversation. Basically, the digital equivalent of scribbling things down on a sticky note… and for anyone who still remembers what paper notepads look like. It’s the difference between:
Every LLM, at its core, is stateless. It processes a prompt and returns a completion – no memory of anything before that prompt. Short-term memory is the engineering layer we build around the model to simulate continuity. We do this by injecting conversation history back into each request.
[System Prompt]
[Previous Turn 1: User]
[Previous Turn 1: Assistant]
[Previous Turn 2: User]
[Previous Turn 2: Assistant]
...
[Current User Message]
The model reads the whole thread every time. From its perspective, it has “always known” what was said earlier. We’re cheating, in the best possible way.
Short-Term vs. Long-Term Memory: When to Use Which
Before we write a single line of code, let’s be precise about when short-term memory is the right tool:
| Scenario | Short-Term | Long-Term |
|---|---|---|
| Multi-turn conversation in one session | ✅ | ❌ |
| Remembering user said “I prefer Python” last week | ❌ | ✅ |
| A customer support chat resolving one ticket | ✅ | ❌ |
| A personal assistant that knows your name and goals | ❌ | ✅ |
| A coding assistant mid-refactor | ✅ | ❌ |
| Storing summaries of past user interactions | ❌ | ✅ |
Use short-term memory when:
- Context is session-scoped and ephemeral
- The interaction is a focused, bounded task
- You don’t want the overhead of embedding calls on every turn
- Privacy is a concern (in-session data, not persisted)
Skip short-term memory (or truncate aggressively) when:
- Conversations get very long (token costs, even local, add up in context length)
- The task is single-turn by nature
- You’re running batch processing pipelines
Long-term memory will be covered in Part 2. The punchline: most production agents need both, layered together.
LangChain4j’s Memory Model
LangChain4j gives us two concrete strategies for short-term memory, both implementing the ChatMemory interface:
1. MessageWindowChatMemory
Keeps the last N messages (user + assistant turns). Simple, predictable, and the most common choice. When the window is full, the oldest messages are evicted.
Window size: 6 messages
[Msg 1 - User] ← evicted when Msg 7 arrives
[Msg 2 - AI] ← evicted when Msg 8 arrives
[Msg 3 - User]
[Msg 4 - AI]
[Msg 5 - User]
[Msg 6 - AI]
When to use: Default choice. Customer support, conversational Q&A, interactive agents where conversation depth is predictable.
2. TokenWindowChatMemory
Keeps messages up to a token budget. More accurate for controlling cost and staying within model context limits. Requires a Tokenizer.
When to use: When message length is highly variable (some turns are one word, others are paragraphs) or when working with models that have tight context windows. With local models, context window size is a hard constraint – qwen3’s context window is 32k tokens but running at 32k will max your RAM.
Tutorial: A Multi-User AI Finance Advisor
Let’s build something real: a personal finance advisor that can hold stateful conversations with multiple concurrent users. Each user gets their own isolated conversation memory. The complete code lives in the companion repository at part1-short-term-memory. Here’s the walkthrough.
Project Structure
part1-short-term-memory/
├── pom.xml
└── src/
├── main/
│ ├── java/me/johnra/tutorial/advisor/
│ │ ├── AdvisorApplication.java
│ │ ├── config/
│ │ │ ├── AiConfig.java
│ │ │ └── WebConfig.java
│ │ ├── service/
│ │ │ ├── FinanceAdvisor.java
│ │ │ └── AdvisorService.java
│ │ └── web/
│ │ └── AdvisorController.java
│ └── resources/
│ └── application.yml
└── test/
└── java/me/johnra/tutorial/advisor/
└── AdvisorIntegrationTest.java
Step 1: Dependencies
<dependencies>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-web</artifactId>
</dependency>
<!-- LangChain4j core + Spring Boot integration -->
<dependency>
<groupId>dev.langchain4j</groupId>
<artifactId>langchain4j-spring-boot-starter</artifactId>
<version>${langchain4j.version}</version>
</dependency>
<!-- Ollama provider — no API key needed -->
<dependency>
<groupId>dev.langchain4j</groupId>
<artifactId>langchain4j-ollama-spring-boot-starter</artifactId>
<version>${langchain4j.version}</version>
</dependency>
</dependencies>
Note what’s not here: no OpenAI dependency, no spring.ai dependency, no API key management. This runs entirely on localhost.
Step 2: Application Configuration
langchain4j:
ollama:
chat-model:
base-url: http://localhost:11434
model-name: qwen3
temperature: 0.7
timeout: PT120S # local models can be slow on first load
advisor:
memory:
max-messages: 20 # 10 user + 10 assistant turns
The PT120S timeout follows ISO-8601 duration format. Local models cold-start slowly – especially on first request when the model loads into memory. After the first call, subsequent responses are much faster.
Step 3: The AI Service Interface
This is where LangChain4j’s magic lives. The @MemoryId annotation tells the framework to maintain separate memory per unique ID. Without it, all users share one conversation history – a spectacular way to leak PII.
// FinanceAdvisor.java
public interface FinanceAdvisor {
@SystemMessage("""
You are a knowledgeable and empathetic personal finance advisor.
Your role is to help users understand budgeting, investments,
debt management, and financial planning.
Guidelines:
- Ask clarifying questions to understand the user's situation
- Reference details the user has shared earlier in this conversation
- Never provide specific stock picks or guaranteed returns
- Keep advice grounded in widely accepted financial principles
- Be concise and direct in your responses
""")
String advise(@MemoryId String sessionId, @UserMessage String userMessage);
}
Step 4: Wiring the Memory Provider
// AiConfig.java
@Configuration
public class AiConfig {
@Value("${advisor.memory.max-messages}")
private int maxMessages;
@Bean
public InMemoryChatMemoryStore chatMemoryStore() {
return new InMemoryChatMemoryStore();
}
// Factory: called once per unique @MemoryId value
@Bean
public ChatMemoryProvider chatMemoryProvider(InMemoryChatMemoryStore store) {
return memoryId -> MessageWindowChatMemory.builder()
.id(memoryId)
.maxMessages(maxMessages)
.chatMemoryStore(store)
.build();
}
@Bean
public FinanceAdvisor financeAdvisor(ChatLanguageModel model,
ChatMemoryProvider memoryProvider) {
return AiServices.builder(FinanceAdvisor.class)
.chatLanguageModel(model)
.chatMemoryProvider(memoryProvider)
.build();
}
}
What’s happening here:
InMemoryChatMemoryStoreis a shared, thread-safe map:memoryId → List<ChatMessage>ChatMemoryProvideris a functional interface – a factory that LangChain4j calls the first time it sees a new@MemoryIdvalue- The
ChatLanguageModelbean is auto-configured by the Ollama Spring Boot starter from yourapplication.yml
Step 5: The Service Layer
The service layer wraps the AI interaction and owns session lifecycle. Notice where we strip qwen3’s think tags:
// AdvisorService.java
@Service
public class AdvisorService {
private final FinanceAdvisor advisor;
private final InMemoryChatMemoryStore memoryStore;
public String startSession() {
return UUID.randomUUID().toString();
}
public String chat(String sessionId, String message) {
String raw = advisor.advise(sessionId, message);
return stripThinkTags(raw);
}
public void clearSession(String sessionId) {
memoryStore.deleteMessages(sessionId);
}
// qwen3 wraps reasoning in <think>...</think> — strip for clean output
private static String stripThinkTags(String response) {
return response.replaceAll("(?s)<think>.*?</think>\\s*", "").trim();
}
}
Step 6: Watching It Work
Fire up the application (mvn spring-boot:run) and run this sequence:
# 1. Start a session
SESSION_ID=$(curl -s -X POST http://localhost:8080/api/advisor/sessions \
| jq -r '.sessionId')
echo "Session: $SESSION_ID"
# 2. Share some context
curl -s -X POST "http://localhost:8080/api/advisor/sessions/$SESSION_ID/messages" \
-H "Content-Type: application/json" \
-d '{"message": "Hi! I earn $95,000 a year and have $12,000 in credit card debt at 22% APR."}' \
| jq -r '.reply'
# 3. Follow-up — no re-introduction of context needed
curl -s -X POST "http://localhost:8080/api/advisor/sessions/$SESSION_ID/messages" \
-H "Content-Type: application/json" \
-d '{"message": "Should I focus on paying that off before investing in my 401k?"}' \
| jq -r '.reply'
# 4. A third turn that references earlier details
curl -s -X POST "http://localhost:8080/api/advisor/sessions/$SESSION_ID/messages" \
-H "Content-Type: application/json" \
-d '{"message": "What if I get a balance transfer card at 0% for 18 months?"}' \
| jq -r '.reply'
# 5. Clean up
curl -X DELETE "http://localhost:8080/api/advisor/sessions/$SESSION_ID"
On Turn 3, the advisor references “your $12,000 at 22% APR” and your income – because the conversation history is in the memory window. The goldfish problem is solved.
Token Window Memory: When Message Lengths Vary
MessageWindowChatMemory works well when turns are roughly uniform. But sometimes one turn is “yes” and the next is a 500-token explanation. TokenWindowChatMemory solves this:
@Bean
public ChatMemoryProvider tokenAwareChatMemoryProvider(InMemoryChatMemoryStore store) {
// Simple approximation tokenizer: 1 token ≈ 4 characters
// Use a proper tokenizer if you need precision
Tokenizer approximateTokenizer = text -> text.length() / 4;
return memoryId -> TokenWindowChatMemory.builder()
.id(memoryId)
.maxTokens(2048, approximateTokenizer)
.chatMemoryStore(store)
.build();
}
Note on local model context: qwen3 supports up to 32k token context, but that means ~32k tokens stay in RAM while the model processes. For conversational use, 2k–4k for history is a pragmatic limit that leaves plenty of headroom for the response.
Production Considerations
Scaling Across Instances
InMemoryChatMemoryStore dies with the JVM. For production with horizontal scaling, back the memory store with Redis:
@Bean
public ChatMemoryStore redisChatMemoryStore(RedisTemplate<String, String> redis) {
return new RedisChatMemoryStore(redis, Duration.ofHours(4));
}
The Redis implementation needs to serialize/deserialize List<ChatMessage> to JSON and set a TTL. The part1-short-term-memory companion code includes a working RedisChatMemoryStore implementation.
Memory Eviction and TTL
Always set a TTL aligned to your session timeout. For customer support, 30 minutes of inactivity is a good default. Without TTL, sessions accumulate indefinitely.
What’s missing?
What we don’t have is any memory of the user across sessions. Come back tomorrow and the advisor asks your name again like you’ve never met. That’s the problem Part 2 of the blog to solves.
In Part 2, we add persistent, semantic memory – the kind that lets your advisor say: “Welcome back. Last time we talked about your student loans and whether refinancing made sense. Have you looked into that?” We’ll use vector embeddings, semantic search via nomic-embed-text, and a persistent store to build memory that survives restarts, scales horizontally, and retrieves the most relevant past context rather than just the most recent.
