Skip to content

temporal-spring-ai: add side-effect replay tests for chat, activity tools, and @SideEffectTool#2856

Merged
donald-pinckney merged 10 commits intomasterfrom
spring-ai/sideeffect-replay-tests
Apr 24, 2026
Merged

temporal-spring-ai: add side-effect replay tests for chat, activity tools, and @SideEffectTool#2856
donald-pinckney merged 10 commits intomasterfrom
spring-ai/sideeffect-replay-tests

Conversation

@donald-pinckney
Copy link
Copy Markdown
Contributor

What was changed

Three new tests under src/test/.../replay/:

  • ChatModelSideEffectTest — register a ChatModel with an AtomicInteger counter. Run a workflow that makes one chat call, assert counter=1. Replay the captured history, assert counter still 1 — the activity result comes from history, not from re-invoking the ChatModel.
  • ActivityToolSideEffectTest — activity-backed @Tool whose impl increments a counter. A tool-calling stub ChatModel asks for the tool on the first call and returns final text on the second. Same shape: counter=1 after run, counter=1 after replay.
  • SideEffectToolReplayTest@SideEffectTool body increments a file-scope counter. The workflow drives one tool call. Assertion proves that Workflow.sideEffect's marker is consulted on replay rather than re-invoking the @Tool method.

MCP is intentionally omitted for this PR. spring-ai-mcp is compileOnly, and adding it just for one replay test isn't worth the dep weight — MCP tool calls go through the same Temporal activity machinery as ChatModel, which ChatModelSideEffectTest already exercises. The plan leaves this as an acceptable substitution.

Why?

The existing WorkflowDeterminismTest catches history-vs-command mismatches but does not prove the plugin isn't re-invoking user side effects on replay. Temporal's AI partner review standards require side-effect safety tests, and a regression — someone adding an in-workflow cache or dropping the Workflow.sideEffect wrap — would be easy to slip in without these.

I verified the SideEffectToolReplayTest actually catches a real regression: temporarily dropping the Workflow.sideEffect(...) wrap in SideEffectToolCallback.call(...) produces expected: <1> but was: <2> exactly as intended. Restored before this commit.

donald-pinckney and others added 3 commits April 21, 2026 15:54
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ools, and @SideEffectTool

Three new tests under src/test/.../replay/:

- ChatModelSideEffectTest: register a ChatModel with an AtomicInteger
  counter. Run a workflow that makes one chat call, assert counter=1.
  Replay the captured history, assert counter still 1 — the activity
  result comes from history, not from re-invoking the ChatModel.
- ActivityToolSideEffectTest: activity-backed @tool whose impl
  increments a counter. ToolCallingStubChatModel asks for the tool on
  the first call and returns final text on the second. Same assertion
  shape: counter=1 after run, counter=1 after replay.
- SideEffectToolReplayTest: @SideEffectTool body increments a counter
  via a file-scope static. Workflow drives a tool call through
  ToolCallingStubChatModel. The assertion proves that
  Workflow.sideEffect's marker is what's consulted on replay rather
  than re-invoking the @tool method.

MCP is intentionally omitted — spring-ai-mcp is compileOnly and adding
it just for one test isn't worth the dep weight. MCP tool calls go
through the same Temporal activity machinery as ChatModel, which
ChatModelSideEffectTest already covers.

I verified the SideEffectToolReplayTest catches a real regression by
temporarily dropping the Workflow.sideEffect wrap in
SideEffectToolCallback; the test correctly failed with
`expected: <1> but was: <2>`. Restored before this commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Planning scratchpad — not part of the shipped artifact. Removed before merge.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@donald-pinckney donald-pinckney marked this pull request as ready for review April 22, 2026 19:52
@donald-pinckney donald-pinckney requested a review from a team as a code owner April 22, 2026 19:52
assertEquals(
1,
model.callCount.get(),
"ChatModel must not be re-invoked during replay — activity results come from history");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds like you're testing Temporal itself rather than the plugin's code.

I understand that's not really your intent, but the line is thin, and future readers of this code (both humans and agents) might easily misunderstand this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, agreed. Assertion comments have been re-worked to make it more clear.

ChatWorkflow.class, WorkflowOptions.newBuilder().setTaskQueue(TASK_QUEUE).build());
assertEquals("pong", workflow.chat("ping"));
assertEquals(
1, model.callCount.get(), "ChatModel should be called once during the initial run");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test methodology isn't exact, which may result in flakes. It is actually possible that the activity function might end up being executed more than once, due to retry, and so the counter might be higher than one here.

A scheduled activity with maxAttempts > 1 will run exactly once to completion as observed by the Temporal server, but the activity itself has no way to know whether the server will have effectively observed the task completion.

Instead of this atomic counter, you can use a concurrency-safe Set that records the activity ID of all activity tasks that were executed. The size of that set is what you are looking for.

Alternatively, you could use a very different test approach: look at the workflow history and filter on ActivityTaskScheduled events. The expectation is that there should be one ActivityTaskScheduled event for this specific activity type.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats a very good point. I'll fix this, but I'll also work on updating the docs, since they are really vague about the right way to test this:

Check for duplicate side effects or other types of failures.

https://docs.temporal.io/develop/plugins-guide#side-effects-tests

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, ChatModelSideEffectTest and ActivityToolSideEffectTest have been changed to inspecting workflow history.

SideEffectToolReplayTest has been left as-is, I think its fine for Workflow.sideEffect.

@donald-pinckney
Copy link
Copy Markdown
Contributor Author

This PR definitely needs more work on the tests. Will re-work it some.

donald-pinckney and others added 5 commits April 23, 2026 16:08
Configure TestWorkflowEnvironment with WorkflowCacheSize(0) so the
worker replays from history on every workflow task instead of
resuming from in-memory cached state. That is the regime in which
side-effect safety actually has to hold: a missing Workflow.sideEffect
wrap or an un-guarded in-workflow mutation would run on each replay,
which cached resumes mask.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reviewer pointed out that "ChatModel must not be re-invoked during replay"
reads as testing Temporal's activity replay guarantee rather than the
plugin's behavior. Reword class-level javadoc and assertion messages so
the property under test is explicit: the plugin places ChatModel calls,
activity-stub tool calls, and @SideEffectTool bodies behind the correct
Temporal boundary. Temporal's replay/memoization semantics are assumed
correct. The counter-stays-at-1 observation is the signal we use to
detect a plugin regression, not the thing being tested.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…unters

Reviewer noted that counting invocations of the backing impl can
conflate two signals: the plugin inlining a call (what we want to
catch) vs. Temporal re-delivering an activity task to the worker
(which can legitimately happen with maxAttempts > 1, producing >1
impl executions for a single scheduled activity).

Switch ChatModelSideEffectTest and ActivityToolSideEffectTest to
scan workflow history for ActivityTaskScheduled events of the
expected activity type. That count is invariant under activity-task
redelivery and speaks directly to the plugin property under test
("the plugin scheduled the call as an activity"). Drops the impl
counters and the WorkflowReplayer pass — the former is no longer
needed, the latter is redundant with WorkflowDeterminismTest.

SideEffectToolReplayTest intentionally keeps its counter: the
Workflow.sideEffect path runs in workflow code under deterministic
replay semantics, not as an activity, so the retry-induced flake
mode doesn't apply.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@donald-pinckney
Copy link
Copy Markdown
Contributor Author

@mjameswh PR has been improved, and I also opened a PR on docs to explain more about how to write these types of tests: temporalio/documentation#4481

@donald-pinckney donald-pinckney merged commit 9eae4a8 into master Apr 24, 2026
18 checks passed
@donald-pinckney donald-pinckney deleted the spring-ai/sideeffect-replay-tests branch April 24, 2026 19:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants