Part 1: Don’t Repeat Yourself – Teach It to Remember

Part 1 of 2 in the “AI Agent Memory” series. Short-Term Memory with SpringBoot + LangChain4j.

The Demo that wasn’t

We got back from our offsite pretty excited. We had just walked through a POC showing how metadata actually earns its keep in an AI-first world. With a solid semantic layer, our analysts could spend more time cooking and less time wandering around the data pantry asking, “where did we put the salt again?” The demo was slick. Agents could find the right data, make suggestions, even tap into technical metadata to build and run queries for users. It felt like we cracked something.

Naturally, we fast-tracked it into our test environment so the product team could start showing it off. And then… the call came in from my product owner. Turns out, with multiple users, the system had the memory of a goldfish. Each interaction politely erased the previous one. Users were basically stepping on each other’s conversations like it was a group chat gone wrong. Nothing like a demo working perfectly… until more than one person uses it.

Good reminder: getting a POC to work is the easy part. Getting it to behave in the real world – hold context, handle multiple users – that’s where the real engineering starts.

The solution is easy; add memory. This is a 2-part series on how we tackled memory: the difference between short-term vs long-term, when you actually need each, and what it takes to build them properly (so your system remembers more than a goldfish).

Why Local?

At work, I use OpenAI. I’ve gravitated toward local models for tutorials for a few reasons. First, it removes the “I don’t have an API key” barrier for readers. Second, running locally forces you to think about latency and resource constraints – things that matter enormously in production. Third, Ollama has gotten remarkably good, and qwen3 is a genuinely capable model for this kind of work.

The patterns in this series work identically with any LLM provider. Swapping Ollama for OpenAI or Anthropic is a trivial change. So learn the patterns here, then deploy against whatever your organization uses in production. The assumption here is that you have a GPU that Ollama can use.

Prerequisites: Getting Ollama Running

At some point, these prerequisites will disappear. I should probably follow the DRY principle on my own blog, If you’ve been following along, you’ve seen this step more times than you signed up for; it’s starting to feel like déjà vu.

Before we write a line of code, you need Ollama installed and the models pulled:

# Install Ollama (Linux/macOS)
curl -fsSL https://ollama.ai/install.sh | sh

# Pull the chat model
ollama pull qwen3

# Verify it works
ollama run qwen3 "Say hello in one sentence."

Ollama runs as a local HTTP server on http://localhost:11434. LangChain4j talks to it through the same REST API it would use for any remote provider.

Resource note: qwen3 (7B parameters) needs about 6GB of VRAM/RAM. If you’re memory-constrained, qwen3:0.6b or qwen3:1.7b run on almost anything. I run latter on my Nvidia GTX 1065 with 4GB of VRAM.

A Quick Note on qwen3’s Thinking Mode

qwen3 is a reasoning model. By default it wraps its internal reasoning in <think>...</think> blocks before the final response. Most of the time you’ll want to strip those. We’ll handle this in our service layer with a one-liner:

private static String clean(String response) {
    return response.replaceAll("(?s)<think>.*?</think>\\s*", "").trim();
}

What Is Short-Term Memory in an AI Agent?

Think of short-term memory as the agent’s working notepad. Its scratchpad for whatever’s going on in the current conversation. Basically, the digital equivalent of scribbling things down on a sticky note… and for anyone who still remembers what paper notepads look like. It’s the difference between:

The smart version

User: My order number is 98234.

Agent: Got it! What can I help you with regarding order 98234?

The goldfish version

User: What’s the status of my order?

Agent: Sure! Could you please provide your order number?

Every LLM, at its core, is stateless. It processes a prompt and returns a completion – no memory of anything before that prompt. Short-term memory is the engineering layer we build around the model to simulate continuity. We do this by injecting conversation history back into each request.

[System Prompt]
[Previous Turn 1: User]
[Previous Turn 1: Assistant]
[Previous Turn 2: User]
[Previous Turn 2: Assistant]
...
[Current User Message]

The model reads the whole thread every time. From its perspective, it has “always known” what was said earlier. We’re cheating, in the best possible way.

Short-Term vs. Long-Term Memory: When to Use Which

Before we write a single line of code, let’s be precise about when short-term memory is the right tool:

Scenario	Short-Term	Long-Term
Multi-turn conversation in one session	✅	❌
Remembering user said “I prefer Python” last week	❌	✅
A customer support chat resolving one ticket	✅	❌
A personal assistant that knows your name and goals	❌	✅
A coding assistant mid-refactor	✅	❌
Storing summaries of past user interactions	❌	✅

Use short-term memory when:

Context is session-scoped and ephemeral
The interaction is a focused, bounded task
You don’t want the overhead of embedding calls on every turn
Privacy is a concern (in-session data, not persisted)

Skip short-term memory (or truncate aggressively) when:

Conversations get very long (token costs, even local, add up in context length)
The task is single-turn by nature
You’re running batch processing pipelines

Long-term memory will be covered in Part 2. The punchline: most production agents need both, layered together.

LangChain4j’s Memory Model

LangChain4j gives us two concrete strategies for short-term memory, both implementing the ChatMemory interface:

1. `MessageWindowChatMemory`

Keeps the last N messages (user + assistant turns). Simple, predictable, and the most common choice. When the window is full, the oldest messages are evicted.

Window size: 6 messages
[Msg 1 - User]   ← evicted when Msg 7 arrives
[Msg 2 - AI]     ← evicted when Msg 8 arrives
[Msg 3 - User]
[Msg 4 - AI]
[Msg 5 - User]
[Msg 6 - AI]

When to use: Default choice. Customer support, conversational Q&A, interactive agents where conversation depth is predictable.

2. `TokenWindowChatMemory`

Keeps messages up to a token budget. More accurate for controlling cost and staying within model context limits. Requires a Tokenizer.

When to use: When message length is highly variable (some turns are one word, others are paragraphs) or when working with models that have tight context windows. With local models, context window size is a hard constraint – qwen3’s context window is 32k tokens but running at 32k will max your RAM.

Tutorial: A Multi-User AI Finance Advisor

Let’s build something real: a personal finance advisor that can hold stateful conversations with multiple concurrent users. Each user gets their own isolated conversation memory. The complete code lives in the companion repository at part1-short-term-memory. Here’s the walkthrough.

Project Structure

part1-short-term-memory/
├── pom.xml
└── src/
    ├── main/
    │   ├── java/me/johnra/tutorial/advisor/
    │   │   ├── AdvisorApplication.java
    │   │   ├── config/
    │   │   │   ├── AiConfig.java
    │   │   │   └── WebConfig.java
    │   │   ├── service/
    │   │   │   ├── FinanceAdvisor.java
    │   │   │   └── AdvisorService.java
    │   │   └── web/
    │   │       └── AdvisorController.java
    │   └── resources/
    │       └── application.yml
    └── test/
        └── java/me/johnra/tutorial/advisor/
            └── AdvisorIntegrationTest.java

Step 1: Dependencies

<dependencies>
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-web</artifactId>
    </dependency>

    <!-- LangChain4j core + Spring Boot integration -->
    <dependency>
        <groupId>dev.langchain4j</groupId>
        <artifactId>langchain4j-spring-boot-starter</artifactId>
        <version>${langchain4j.version}</version>
    </dependency>

    <!-- Ollama provider — no API key needed -->
    <dependency>
        <groupId>dev.langchain4j</groupId>
        <artifactId>langchain4j-ollama-spring-boot-starter</artifactId>
        <version>${langchain4j.version}</version>
    </dependency>
</dependencies>

Note what’s not here: no OpenAI dependency, no spring.ai dependency, no API key management. This runs entirely on localhost.

Step 2: Application Configuration

langchain4j:
  ollama:
    chat-model:
      base-url: http://localhost:11434
      model-name: qwen3
      temperature: 0.7
      timeout: PT120S      # local models can be slow on first load

advisor:
  memory:
    max-messages: 20       # 10 user + 10 assistant turns

The PT120S timeout follows ISO-8601 duration format. Local models cold-start slowly – especially on first request when the model loads into memory. After the first call, subsequent responses are much faster.

Step 3: The AI Service Interface

This is where LangChain4j’s magic lives. The @MemoryId annotation tells the framework to maintain separate memory per unique ID. Without it, all users share one conversation history – a spectacular way to leak PII.

// FinanceAdvisor.java
public interface FinanceAdvisor {

    @SystemMessage("""
        You are a knowledgeable and empathetic personal finance advisor.
        Your role is to help users understand budgeting, investments,
        debt management, and financial planning.

        Guidelines:
        - Ask clarifying questions to understand the user's situation
        - Reference details the user has shared earlier in this conversation
        - Never provide specific stock picks or guaranteed returns
        - Keep advice grounded in widely accepted financial principles
        - Be concise and direct in your responses
        """)
    String advise(@MemoryId String sessionId, @UserMessage String userMessage);
}

Step 4: Wiring the Memory Provider

// AiConfig.java
@Configuration
public class AiConfig {

    @Value("${advisor.memory.max-messages}")
    private int maxMessages;

    @Bean
    public InMemoryChatMemoryStore chatMemoryStore() {
        return new InMemoryChatMemoryStore();
    }

    // Factory: called once per unique @MemoryId value
    @Bean
    public ChatMemoryProvider chatMemoryProvider(InMemoryChatMemoryStore store) {
        return memoryId -> MessageWindowChatMemory.builder()
                .id(memoryId)
                .maxMessages(maxMessages)
                .chatMemoryStore(store)
                .build();
    }

    @Bean
    public FinanceAdvisor financeAdvisor(ChatLanguageModel model,
                                         ChatMemoryProvider memoryProvider) {
        return AiServices.builder(FinanceAdvisor.class)
                .chatLanguageModel(model)
                .chatMemoryProvider(memoryProvider)
                .build();
    }
}

What’s happening here:

InMemoryChatMemoryStore is a shared, thread-safe map: memoryId → List<ChatMessage>
ChatMemoryProvider is a functional interface – a factory that LangChain4j calls the first time it sees a new @MemoryId value
The ChatLanguageModel bean is auto-configured by the Ollama Spring Boot starter from your application.yml

Step 5: The Service Layer

The service layer wraps the AI interaction and owns session lifecycle. Notice where we strip qwen3’s think tags:

// AdvisorService.java
@Service
public class AdvisorService {

    private final FinanceAdvisor advisor;
    private final InMemoryChatMemoryStore memoryStore;

    public String startSession() {
        return UUID.randomUUID().toString();
    }

    public String chat(String sessionId, String message) {
        String raw = advisor.advise(sessionId, message);
        return stripThinkTags(raw);
    }

    public void clearSession(String sessionId) {
        memoryStore.deleteMessages(sessionId);
    }

    // qwen3 wraps reasoning in <think>...</think> — strip for clean output
    private static String stripThinkTags(String response) {
        return response.replaceAll("(?s)<think>.*?</think>\\s*", "").trim();
    }
}

Step 6: Watching It Work

Fire up the application (mvn spring-boot:run) and run this sequence:

# 1. Start a session
SESSION_ID=$(curl -s -X POST http://localhost:8080/api/advisor/sessions \
  | jq -r '.sessionId')
echo "Session: $SESSION_ID"

# 2. Share some context
curl -s -X POST "http://localhost:8080/api/advisor/sessions/$SESSION_ID/messages" \
  -H "Content-Type: application/json" \
  -d '{"message": "Hi! I earn $95,000 a year and have $12,000 in credit card debt at 22% APR."}' \
  | jq -r '.reply'

# 3. Follow-up — no re-introduction of context needed
curl -s -X POST "http://localhost:8080/api/advisor/sessions/$SESSION_ID/messages" \
  -H "Content-Type: application/json" \
  -d '{"message": "Should I focus on paying that off before investing in my 401k?"}' \
  | jq -r '.reply'

# 4. A third turn that references earlier details
curl -s -X POST "http://localhost:8080/api/advisor/sessions/$SESSION_ID/messages" \
  -H "Content-Type: application/json" \
  -d '{"message": "What if I get a balance transfer card at 0% for 18 months?"}' \
  | jq -r '.reply'

# 5. Clean up
curl -X DELETE "http://localhost:8080/api/advisor/sessions/$SESSION_ID"

On Turn 3, the advisor references “your $12,000 at 22% APR” and your income – because the conversation history is in the memory window. The goldfish problem is solved.

Token Window Memory: When Message Lengths Vary

MessageWindowChatMemory works well when turns are roughly uniform. But sometimes one turn is “yes” and the next is a 500-token explanation. TokenWindowChatMemory solves this:

@Bean
public ChatMemoryProvider tokenAwareChatMemoryProvider(InMemoryChatMemoryStore store) {
    // Simple approximation tokenizer: 1 token ≈ 4 characters
    // Use a proper tokenizer if you need precision
    Tokenizer approximateTokenizer = text -> text.length() / 4;

    return memoryId -> TokenWindowChatMemory.builder()
            .id(memoryId)
            .maxTokens(2048, approximateTokenizer)
            .chatMemoryStore(store)
            .build();
}

Note on local model context: qwen3 supports up to 32k token context, but that means ~32k tokens stay in RAM while the model processes. For conversational use, 2k–4k for history is a pragmatic limit that leaves plenty of headroom for the response.

Production Considerations

Scaling Across Instances

InMemoryChatMemoryStore dies with the JVM. For production with horizontal scaling, back the memory store with Redis:

@Bean
public ChatMemoryStore redisChatMemoryStore(RedisTemplate<String, String> redis) {
    return new RedisChatMemoryStore(redis, Duration.ofHours(4));
}

The Redis implementation needs to serialize/deserialize List<ChatMessage> to JSON and set a TTL. The part1-short-term-memory companion code includes a working RedisChatMemoryStore implementation.

Memory Eviction and TTL

Always set a TTL aligned to your session timeout. For customer support, 30 minutes of inactivity is a good default. Without TTL, sessions accumulate indefinitely.

What’s missing?

What we don’t have is any memory of the user across sessions. Come back tomorrow and the advisor asks your name again like you’ve never met. That’s the problem Part 2 of the blog to solves.

In Part 2, we add persistent, semantic memory – the kind that lets your advisor say: “Welcome back. Last time we talked about your student loans and whether refinancing made sense. Have you looked into that?” We’ll use vector embeddings, semantic search via nomic-embed-text, and a persistent store to build memory that survives restarts, scales horizontally, and retrieves the most relevant past context rather than just the most recent.