Skip to content

[pull] main from danny-avila:main#68

Merged
pull[bot] merged 2 commits intoinnFactory:mainfrom
danny-avila:main
Apr 26, 2026
Merged

[pull] main from danny-avila:main#68
pull[bot] merged 2 commits intoinnFactory:mainfrom
danny-avila:main

Conversation

@pull
Copy link
Copy Markdown

@pull pull Bot commented Apr 26, 2026

See Commits and Changes for more details.


Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

* feat: add tool output reference registry with FIFO eviction

Introduces a run-scoped registry that stores successful tool outputs
under stable `tool<idx>turn<turn>` keys and resolves `{{…}}` placeholders
in string args. Keeps the registry disabled by default; opt-in via
`RunConfig.toolOutputReferences.enabled`. Size caps mirror
`calculateMaxToolResultChars` so substituted content cannot exceed what
the model has already seen.

* feat: wire tool output references through ToolNode and graphs

Thread `RunConfig.toolOutputReferences` from `Run` into both
`StandardGraph` and `MultiAgentGraph`, and from there into every
compiled `ToolNode`. On each batch, ToolNode increments a per-run turn
counter, resolves `{{tool<i>turn<n>}}` placeholders in args before
invoking, registers the un-annotated output under `tool<i>turn<n>`, and
annotates the LLM-visible `ToolMessage.content` via a `_ref` object key
(for JSON objects) or a `[ref: …]` prefix line (everything else).
Unresolved placeholders are surfaced to the model via a trailing
`[unresolved refs: …]` line on the same tool message. The event-driven
dispatch path gets the same treatment so subagent-as-tool calls benefit
automatically.

* docs: document tool output references in bash tool description

Adds a dedicated section to the bash tool's prompt-facing description
explaining the `tool<idx>turn<turn>` key format, where it surfaces
(prefix line vs `_ref` field), how to embed `{{…}}` placeholders in the
`command` string, and what happens when a reference is unknown. Any
tool can reference previous outputs, but bash is the primary beneficiary
since piping into `jq`/`grep`/`awk` is the main use case.

* fix: gate bash reference docs on feature enablement

The tool-output reference guidance lived inside the bash tool's static
description constant, so the LLM was told to emit `{{tool0turn0}}`
placeholders on every run regardless of whether
`RunConfig.toolOutputReferences.enabled` was set. Those placeholders
passed through unresolved and leaked into the shell command, producing
confusing errors.

Extract the guidance into a sibling `BashToolOutputReferencesGuide`
constant plus a `buildBashExecutionToolDescription({ enableToolOutputReferences })`
composer. Hosts enable the guidance only when they also enable the
registry; the default `BashExecutionToolDescription` no longer mentions
references.

* refactor: tighten ToolOutputReferenceRegistry internals

Addressing review findings on the registry module:

- Drop the redundant `key` field from registry entries; store plain
  strings in `Map<string, string>` since the Map key is already the ref.
- Move the stateful `/g` matcher off the module-level export. The
  exported `TOOL_OUTPUT_REF_PATTERN` is now non-global (no `lastIndex`
  footgun for future callers); the global variant lives as a private
  static on the class.
- Short-circuit `resolve()` via a `hasAnyPlaceholder` pre-scan so
  placeholder-free args skip the deep tree walk and object allocations
  entirely.
- Document the `maxTotalSize` soft-cap semantics on
  `ToolOutputReferencesConfig` (most-recent entry is never self-evicted).
- Fix "appended" → "prepended" in the prefix helper's JSDoc.

Tests updated to match the non-global regex shape.

* fix: harden ToolNode reference lifecycle for edge cases

Addresses the codex review plus the deeper audit:

- Re-resolve `{{…}}` placeholders after a `PreToolUse` hook rewrites
  args via `updatedInput`. Previously a hook that introduced new
  placeholders would bypass substitution and leak raw text to the tool.
  The unresolved-refs record is refreshed on hook replacement so the
  LLM warning reflects the actually-sent args.
- Reset the registry and per-run turn counter when `config.configurable.run_id`
  changes between `run()` invocations. A host that reuses the same Run
  across `processStream` calls no longer serves a stale
  `tool<i>turn<n>` from a previous run.
- Handle `ToolMessage` returns from tools: the LangChain `tool()` helper
  auto-wraps string returns in a `ToolMessage`, which was taking the
  early-return branch and skipping annotation/registration entirely.
  Now the `tool`-typed branch also applies the reference key (string
  content only) or logs a warning and passes through (array content).
- Rename the public accessor to `_unsafeGetToolOutputRegistry` and mark
  it `@internal`: exposing a mutable registry on a published library
  invites external corruption. The accessor remains for test observation.

Adds integration tests for the event-driven dispatch path (previously
untested), including placeholder substitution in `ON_TOOL_EXECUTE`
requests, unresolved-ref reporting, PostToolUse hook output overriding
registered content, and PreToolUse hook re-resolution. Plus a runId-change
reset test for the direct path.

* fix: harden registry caps and _ref injection

Addresses two codex P2 findings:

- **maxTotalSize is now a hard cap.** The per-output cap is clamped to
  `min(maxOutputSize, maxTotalSize)` at construction so a single stored
  entry can never exceed the aggregate bound, and the eviction loop
  no longer keeps the most-recent entry around when it alone violates
  the cap. Fixes a case where `maxOutputSize: 1000, maxTotalSize: 100`
  silently retained 1000 chars. Updated the config JSDoc to drop the
  soft-cap language.
- **`_ref: null` no longer clobbers the injected key.** The guard that
  treated an existing `_ref: null` as non-conflicting was correct, but
  the spread order `{ _ref: key, ...obj }` then let the null re-win.
  Reversed to `{ ...obj, _ref: key }` so the injected value always
  lands last.

Adds tests for both: per-output clamping to aggregate bound, and
`_ref: null` → overwritten with the generated key.

* fix: surface unresolved refs on error paths and refine annotation UX

Addresses codex round 3 (P2) plus three follow-up NITs:

- **Unresolved refs on errors** (codex P2): `[unresolved refs: …]`
  is now appended to error ToolMessages too, across all three error
  paths — thrown exceptions in `runTool`, tool-returned
  `ToolMessage({status:'error'})`, and host-returned error results
  in `dispatchToolEvents`. Without this, an LLM using a bad reference
  key saw only the generic error and lost the self-correction signal
  this feature exists to provide.
- **Preserve `_ref` as the first key** (NIT N-1): the round-2
  `{ ...obj, _ref: key }` ordering fixed the null-clobber issue but
  regressed the key position. Destructure the conflicting key out
  first, then place our injected ref at the front, so the LLM reads
  the reference identifier before the payload.
- **Warn-once per tool per run for non-string content** (NIT N-3):
  a tracked `Set` on the ToolNode prevents the per-invocation log
  spam when a tool consistently returns multi-part content.
  Cleared alongside the registry on runId change.
- **JSDoc trail to the builder** (NIT N-2): `BashExecutionToolDefinition`
  now documents that reference-aware hosts should compose their own
  description via `buildBashExecutionToolDescription`.

Adds tests for all three error paths carrying unresolved-ref hints
and for the first-key placement invariant.

* fix: clear registry state on anonymous runs to prevent leakage

Addresses codex round 4 P2:

When `config.configurable.run_id` is absent, `incomingRunId === lastRunId`
(both `undefined`), so the previous reset branch never fired. A ToolNode
reused across independent sessions that omit `run_id` could therefore
serve stale `{{tool<i>turn<n>}}` substitutions from a prior session —
cross-session data leakage.

Treat a missing `run_id` as "a new, distinct run" on every batch: clear
the registry, warn-set, and turn counter on entry. Production callers
going through `Run` always set `run_id`, so this only affects direct
ToolNode usage (tests, library embedders).

Tests updated to pass an explicit `run_id` through the helper so
cross-batch persistence (`{{tool0turn0}}` substitution into a later
call) continues to work under the new semantics. Added a dedicated test
showing that anonymous callers get fresh state each batch (the
placeholder resolves to an unresolved-ref warning in the second call
because the first call's output is not retained).

* fix: thread turn per invocation and preserve raw outputs in registry

Two related changes:

**Codex round 5 (P2): per-invocation turn.**
`applyOutputReference` previously read `this.currentTurn` after the
tool `await`, which is mutable shared state. Two concurrent `invoke()`
calls on the same ToolNode could step on the field between sync-prefix
assignment and post-await registration, producing cross-batch key
collisions. `run()` now captures `turn` in a local and threads it
through `runTool` and `dispatchToolEvents`; `applyOutputReference`
takes a precomputed `refKey` instead of building one from shared
state.

**User-requested: preserve full tool outputs in the registry.**
The registry no longer mirrors the LLM-visible truncation. It stores
the raw, untruncated output so a `{{tool<i>turn<n>}}` substitution
pipes the complete payload into downstream bash/jq even when the LLM
only saw a head+tail-truncated preview in `ToolMessage.content`.
Separate size caps govern the registry:

- `HARD_MAX_TOTAL_TOOL_OUTPUT_SIZE` is now 5 MB (was 800 KB).
- `ToolOutputReferencesConfig.maxOutputSize` defaults to the 5 MB total
  cap, not `maxToolResultChars`.
- `applyOutputReference(llmContent, registryContent, refKey, unresolved)`
  takes both views explicitly: the truncated, annotated form for the
  model and the full, pristine form for the registry.

The event-driven dispatch path does the same split — the pre-hook raw
and post-hook raw are tracked separately from the truncated LLM view,
so the registered value is the hook-authoritative raw payload.

Adds tests for:
- Raw preservation: a tool returns 8 KB of output with
  `maxToolResultChars: 200`; the LLM sees the truncated form, but a
  later `{{tool0turn0}}` substitution delivers the full 8 KB verbatim.
- Concurrent invocations within a run: two overlapping batches of the
  same run resolve gates in reverse order; each registers under its
  own turn index (tool0turn0 and tool0turn1) with correct content.

* fix: share tool output registry across all ToolNodes in a run

Addresses codex round 6 (P1): multi-agent graphs construct one ToolNode
per agent, each with its own registry instance. A reference key
produced in agent A (for example `tool0turn0`) therefore wasn't
available when agent B tried to substitute `{{tool0turn0}}`, breaking
cross-agent piping.

Centralize run-scoped state onto the registry itself and share one
instance across every ToolNode a `Graph` compiles:

- `ToolOutputReferenceRegistry` now owns `turnCounter`, `lastRunId`,
  and `warnedNonStringTools`. New methods `nextTurn(runId)` and
  `claimWarnOnce(toolName)` handle the run-crossing reset and the
  warn-once memo respectively, atomically from JS's perspective so
  concurrent ToolNode invocations within one run see distinct turns.
- `ToolNodeOptions` gains `toolOutputRegistry?: ToolOutputReferenceRegistry`
  — a host-supplied instance that takes precedence over the config
  object. Passing the config still works for direct ToolNode usage.
- `Graph` lazily constructs one registry per run via
  `getOrCreateToolOutputRegistry()` and threads the same instance into
  every `CustomToolNode` it builds (both the event-driven and direct
  branches of `initializeTools`).
- `ToolNode.run()` shrinks to `const turn = this.toolOutputRegistry?.nextTurn(runId) ?? 0;`
  — the registry handles reset bookkeeping so ToolNode no longer
  duplicates any run-scoped fields.

Adds a test that registers output in one ToolNode and resolves
`{{tool0turn0}}` in a *different* ToolNode sharing the same registry
— cross-node substitution now works, and the second node's own
output lands under `tool0turn1` (proving the turn counter is shared).

* fix: preserve JSON validity and lower per-output default for ref annotation

Addresses two codex round-7 P2 findings:

**JSON validity with unresolved refs.** `applyOutputReference`
appended the unresolved-refs trailer as a trailing text line even
when the tool output was a JSON object, producing invalid JSON that
consumers parsing `ToolMessage.content` couldn't handle.

`annotateToolOutputWithReference` now takes the `unresolved[]`
list and routes both the ref key and the unresolved refs through
the same JSON injection path. New `TOOL_OUTPUT_UNRESOLVED_KEY`
constant (`_unresolved_refs`) surfaces unresolved keys as an array
field on the object instead of a trailing line. Non-object content
keeps the old `[ref: …]\n…\n[unresolved refs: …]` shape.

**Per-output cap default.** The registry's default per-output cap
was `HARD_MAX_TOTAL_TOOL_OUTPUT_SIZE` (5 MB), letting a single
`{{…}}` substitution inject multi-megabyte strings into bash
`command` args and blowing typical shell `ARG_MAX` limits. Default
now `HARD_MAX_TOOL_RESULT_CHARS` (~400 KB) — matches the
standard tool-result budget and the LLM-visible truncation size.
Hosts wanting fuller fidelity can still raise it up to the 5 MB
total budget via config.

Added tests covering:
- unresolved refs carried as a JSON field for parseable objects;
- trailer-line form for non-object content;
- unresolved-only annotation (no ref key, for error paths);
- JSON preservation when only unresolved refs are present;
- no-op when nothing to annotate.

* fix: emit resolved args and surface unresolved on non-string content

Codex round 8 flagged two remaining gaps:

**P2 — Resolved args leak to audit logs as `{{…}}` templates.**
When `runTool` substituted a placeholder, the post-substitution args
only lived in the local `args` variable. `handleRunToolCompletions`
then emitted `ON_RUN_STEP_COMPLETED.tool_call.args` from the original
`call.args`, so hosts observing the event saw the unresolved template
rather than what the tool actually received.

ToolNode now tracks `resolvedArgsByCallId` (cleared at the start of
each batch, populated after `registry.resolve`). The completion
handler prefers that map and falls back to `call.args` when no
substitution happened. Event-driven dispatch already emitted the
resolved args through `entry.args`, so only the direct path needed
the fix.

**P3 — Non-string ToolMessage content drops unresolved-ref warnings.**
The non-string branch (multi-part `ToolMessage.content` blocks) only
warn-logged the skipped registration; unresolved placeholder keys
were silently dropped, denying the LLM the self-correction signal
that the string and error paths already provide. Now prepend a
`{ type: 'text', text: '[unresolved refs: …]' }` block to the array
content so the model sees the warning before the data blocks.

Adds tests for:
- `ON_RUN_STEP_COMPLETED.tool_call.args` carrying the substituted
  command on a batch that resolved `{{tool0turn0}}`.
- A multi-part-content tool that gets the `[unresolved refs: …]`
  text block prepended to its content array, with original blocks
  preserved after.

* fix: preserve existing _ref/_unresolved_refs payload fields on annotation

Codex round 9 (P2): `tryInjectRefIntoJsonObject` unconditionally
stripped both `_ref` and `_unresolved_refs` before rebuilding the
object, which silently deleted legitimate tool payload data on two
paths:

- plain-annotation (`key` set, no unresolved): an existing
  `_unresolved_refs` payload field was dropped because we always
  stripped it even though we weren't injecting one.
- unresolved-only / error (`key` undefined, unresolved > 0): an
  existing `_ref` field was dropped for the same reason.

Now only strip the key we're actually injecting. Pre-existing values
for the *other* framework-owned field pass through untouched, turning
the helper back into a pure annotation that mutates nothing outside
its intended slot.

Added a deep-equal collision guard for `_unresolved_refs` mirroring
the existing `_ref` guard: when an existing array value differs from
the one we'd inject, fall back to the prefix form rather than
silently overwriting. Deep-equal arrays are accepted idempotently.

New tests cover: `_unresolved_refs` preserved on the ref-only path;
`_ref` preserved on the unresolved-only path; collision fallback
when existing `_unresolved_refs` is a non-matching array; and
deep-equal idempotence.

* fix: parallelize mixed direct+event dispatch to isolate same-turn refs

Codex round 10 (P2): in the mixed event-driven + direct path,
`run()` awaited direct calls first and only then dispatched event
calls. Because `runTool` registered its output immediately, event
calls ran `registry.resolve` against a registry that already held
the direct branch's just-registered outputs — meaning event args
could resolve same-turn `{{tool<i>turn<n>}}` references (even
forward refs to same-batch direct calls). Pure-direct or
pure-event batches didn't allow this, so the semantic was
mode-dependent.

Run both branches in parallel. Both sync prefixes — which include
all `registry.resolve(args)` passes — now complete against the
*pre-batch* registry state before either branch's async bodies get
a chance to register. JS single-threaded execution guarantees no
microtask (= no registration) runs between the two sync prefixes,
so same-turn refs remain unresolved regardless of how calls are
split across direct/event in a mixed batch. Cross-batch refs still
resolve as before.

Added a test that exercises a mixed batch with a direct call at
index 0 and an event call at index 1 whose args reference
`{{tool0turn0}}`. First batch: the placeholder does NOT resolve
(host receives the literal template). Second batch: same ref now
points across a batch boundary and resolves to the direct call's
registered output from batch 1.

* fix: restore fail-fast and partition registry state per runId

Codex round 11 flagged two P1 regressions:

**Fail-fast under mixed direct+event mode.** My round-10
`Promise.all([directPromise, eventPromise])` started the host-
dispatch branch *immediately*, so a direct tool throwing (via
`handleToolErrors=false` or `GraphInterrupt`) could no longer abort
event dispatch before side effects ran.

Fix: snapshot event calls' args synchronously against the pre-batch
registry state, then serialize — await directs first, dispatch
events only if directs succeeded. Same-turn isolation stays because
the event args were captured before any direct registration. Added
a test that confirms a thrown direct tool prevents any host
`ON_TOOL_EXECUTE` dispatch.

**Run-scoped registry state leaked across concurrent runs.** The
registry's `nextTurn(runId)` cleared global state whenever `runId`
changed, so two overlapping runs reusing the same ToolNode could
clobber each other: run B's first batch wiped run A's in-flight
state, and run A's later registrations landed in run B's bucket.

Fix: partition the registry internally by `runId`. Each bucket
owns its own entries/totalSize/turnCounter/warnMemo; all public
methods (`set`/`get`/`resolve`/`nextTurn`/`claimWarnOnce`) now take
`runId` and route to the correct bucket. Anonymous (`run_id` absent)
batches still reset per-call. FIFO-evicts oldest runId bucket when
`maxActiveRuns` (default 32) is exceeded; new `releaseRun(runId)`
API for deterministic cleanup. ToolNode threads runId through every
registry call site.

Added a test that runs two runs back-to-back on a shared registry
and verifies: run A's output survives run B starting; each run
resolves `{{tool0turn0}}` against its own partition.

* fix: clear registry on run cleanup and give anonymous batches unique scopes

Codex round 12 flagged two more issues:

**P2 — Graph.clearHeavyState didn't clear registry contents.** The
Graph nulled its own `_toolOutputRegistry` field, but each compiled
ToolNode had already captured the instance at construction time.
So stored `tool<i>turn<n>` entries (up to 5 MB of raw output) stayed
reachable across subsequent `processStream()` calls on the same
Run. Now `clearHeavyState()` calls `registry.clear()` before
dropping the reference so ToolNodes holding the same instance see a
wiped registry on the next batch.

**P3 — Concurrent anonymous invocations clobbered each other.**
`nextTurn(undefined)` deleted the shared anonymous bucket before
creating a new one; two overlapping invocations without `run_id`
could wipe each other's in-flight state. ToolNode now mints a
unique `batchScopeId` (`\0anon-<N>`) per anonymous `run()` and
threads it through every registry call in place of the raw
`run_id`. Each anonymous batch gets its own partition, so
concurrent anonymous calls are fully isolated.

Tests added:
- `clear()` drops every run bucket; `releaseRun(runId)` drops only
  the named one.
- Two concurrent `node.invoke(...)` calls without `run_id`, gates
  resolved in reverse order — each observes its own registered
  output under `[ref: tool0turn0]` with no cross-contamination.

* fix: clamp caller-supplied maxTotalSize to the documented hard cap

Codex round 14 (P2): `HARD_MAX_TOTAL_TOOL_OUTPUT_SIZE` is documented
as an absolute upper bound, but the constructor's user-provided
branch was bypassing it — `calculateMaxTotalToolOutputSize` already
clamps the computed default, but a host passing
`maxTotalSize: 50_000_000` would have a 50 MB registry instead of
5 MB. Apply the same `Math.min(..., HARD_MAX_TOTAL_TOOL_OUTPUT_SIZE)`
to caller-supplied values so the cap actually holds. Per-output is
already clamped to the (now hard-bounded) total, so it inherits the
fix.

Added a test that requests `maxTotalSize: 50_000_000` and asserts
both `totalLimit` and `perOutputLimit` end up ≤ 5 MB.

* fix: clamp registry caps, snapshot for hook re-resolve, batch-local args sink

Addresses two new codex P2 findings plus follow-up audit items:

**Codex P2 — Clamp configured maxTotalSize to hard limit.** A
caller-supplied `maxTotalSize` bypassed the documented 5 MB
absolute cap (`HARD_MAX_TOTAL_TOOL_OUTPUT_SIZE`). The constructor
now clamps via `Math.min(options.maxTotalSize, HARD_MAX_…)` so the
registry can never retain more than the documented bound,
regardless of host config.

**Codex P2 — Preserve pre-batch ref snapshot after PreToolUse
rewrites.** In the mixed direct+event path the PreToolUse hook's
`updatedInput` re-resolved against the live registry, which by
that point already had same-turn direct outputs registered. A
hook-introduced `{{tool<idx>turn<current>}}` could therefore leak
a same-batch direct result, breaking the same-turn isolation
guarantee.

Added `ToolOutputReferenceRegistry.snapshot(runId)` returning a
frozen `ToolOutputResolveView` that resolves against a defensive-
copied entries map. The mixed path takes a snapshot before launching
directs and threads it into `dispatchToolEvents`; the hook re-resolve
uses the snapshot when set, falling back to live-registry resolution
in the pure-event path (where no in-batch registrations have happened
yet).

**Audit follow-ups:**
- Renamed inner `turn` (per-tool usage count) to `usageCount` in
  `runTool` to stop shadowing the outer batch-turn parameter.
- Switched ToolNode's `./toolOutputReferences` import to the `@/`
  alias used everywhere else in the file.
- Removed instance-level `resolvedArgsByCallId` in favor of a batch-
  local `Map` threaded through `runTool` and
  `handleRunToolCompletions`. Concurrent `run()` calls on the same
  ToolNode no longer race on shared sink state — each batch gets
  its own map.

Tests added:
- `snapshot()` resolves against captured state and ignores later
  mutations / additions; multiple snapshots are independent.
- `maxActiveRuns` LRU eviction drops the oldest bucket when capped.
- `calculateMaxTotalToolOutputSize` boundary cases (undefined, 0,
  negative, exact-cap, just-past-cap).
- `buildBashExecutionToolDescription` returns base by default and
  appends the references guide when `enableToolOutputReferences:true`.
- New ToolNode integration test: mixed direct+event batch with a
  PreToolUse hook injecting `{{tool0turn0}}` — host receives the
  literal template (snapshot prevents same-turn resolution).

* refactor: collapse runTool/dispatch params into batch-context options object

Audit follow-ups from the post-round-14 comprehensive review:

**#1 — Parameter explosion.** `dispatchToolEvents` had grown to 8
positional parameters (5 optional) and `runTool` to 6 (4 optional);
`executeViaEvent` was passing trailing `undefined` placeholders to
skip slots. Replaced with two typed options bags
(`RunToolBatchContext`, `DispatchBatchContext`) so each method takes
just `(call/calls, config, batchContext)`. Eliminates the
`undefined`-placeholder anti-pattern and stops future context fields
from re-bloating the signatures.

**#2 — Import-type grouping.** `import type { ToolOutputResolveView,
… }` was placed after the local value imports, separated from the
other `import type` block. Moved into the type-import section per
AGENTS.md's "import type imports should be in their own section"
rule.

**#3 — Test file location.** `src/utils/truncation.test.ts` was
co-located with the source while sibling tests use the
`__tests__/` subdirectory convention. Moved to
`src/utils/__tests__/truncation.test.ts` and switched the import to
the `@/utils/truncation` alias.

**#4 — Missing snapshot edge case.** Added `snapshot()` test for a
`runId` with no bucket — confirms placeholders are reported as
unresolved and content passes through unchanged when nothing has
been registered.

Also exported `PreResolvedArgsMap` and `ResolvedArgsByCallId` aliases
from the registry module so the new context types reuse a single
source of truth instead of duplicating shape definitions.

* chore: bump to 3.1.71-dev.0 and tighten handleRunToolCompletions typing

- Use the exported `ResolvedArgsByCallId` alias for the
  `handleRunToolCompletions` `resolvedArgsByCallId` parameter so it
  matches the batch-context types instead of redeclaring the inline
  shape; the parameter's docstring moved to a `@param` block on the
  method JSDoc.
- Bump `@librechat/agents` to `3.1.71-dev.0` for the
  tool-output-references release (rebased onto 3.1.70 release).
* refactor: lazy-annotate tool output refs at LLM request time

Demote `_ref` annotation from a durable mutation of `ToolMessage.content`
to a transient projection applied right before each provider call.

`ToolNode` now stamps `additional_kwargs._refKey` /
`_unresolvedRefs` and leaves `content` clean (truncated only). At
request time, `attemptInvoke` pulls the run-scoped registry off the
graph and applies `annotateMessagesForLLM` to project a copy of the
relevant ToolMessages with the same `[ref: …]` / `_ref` shape the
model was already trained on. Annotations never persist back into
graph state, so downstream consumers (LibreChat tool-output panel,
audit logs, persisted history) see raw output, and stale `_refKey`
values from prior runs silently no-op via a registry-presence gate.

The `RunConfig.toolOutputReferences.enabled` flag and the
`{{tool<i>turn<n>}}` substitution semantics are unchanged.

* chore: bump version to 3.1.71-dev.1

* perf: drop strip + lazy-allocate output array in annotateMessagesForLLM

Two complementary cleanups:

1. Drop `stripRefMetadata`. LangChain's provider serializers don't
   transmit `additional_kwargs` to provider HTTP APIs, and the
   projected message array is discarded after the request — so the
   metadata never reaches the wire and never round-trips back into
   graph state. The defensive clone was wasted work: an O(n) walk
   plus a fresh object per annotated ToolMessage on every invoke.

2. Lazy-allocate `out`. The previous `messages.map(...)` + trailing
   reference-equality scan allocated an m-length array on every call
   even when no ToolMessage needed annotation (the common case
   mid-conversation, when refs go stale). Switching to a `for` loop
   that copies via `messages.slice()` only on first projection makes
   the no-op path truly allocation-free.

* fix: stamp _refScope so anonymous-run refs annotate correctly

ToolNode partitions the registry by scope id, falling back to a
synthetic per-batch key (`\0anon-<n>`) when `config.configurable.run_id`
is absent. The lazy annotation transform looked up via the
config-derived runId, which is undefined for anonymous callers — so
the registry-presence gate missed and `[ref: …]` markers were
silently dropped, even though the underlying output had been
registered.

Stamp the registry scope alongside `_refKey` in
`ToolMessage.additional_kwargs` and read it back in
`annotateMessagesForLLM`. For named runs `_refScope` equals the
run_id; for anonymous batches it carries the synthetic scope so the
lookup hits the right bucket. Falls back to the runId param when
`_refScope` is absent (legacy / pre-this-fix messages).

Caught by codex review on PR #117.

* review: address comprehensive audit findings on lazy ref annotation

Resolves all valid findings from the codex review (#117):

#1 (MAJOR) - Defensive `additional_kwargs` strip on the projected
ToolMessage. The earlier reasoning (LangChain serializers don't
transmit `additional_kwargs`) is correct today but fragile against
custom adapters and future LangChain changes. Strip our three
framework-owned keys (`_refKey`, `_refScope`, `_unresolvedRefs`) on
the projection so the metadata never reaches the wire under any
serializer behavior. Cost: one shallow object spread per annotated
message. Early-out when none of the keys are present preserves the
no-strip path.

#2 (MINOR) - New test for JSON-object content carrying both a live
`_refKey` and `_unresolvedRefs`, exercising
`annotateToolOutputWithReference`'s collision-detection logic
through the projection entry point.

#3 (MINOR) - New cross-run hydration smoke test in `invoke.test.ts`:
runs ToolNode under run-1, persists a real tool message, then
hydrates it into run-2's state alongside a fresh run-2 tool message.
Verifies run-2's annotation transform annotates only run-2's live
ref and leaves run-1's hydrated message clean — the headline
scenario the lazy-annotation refactor was designed for.

#4 (MINOR) - `Graph.getOrCreateToolOutputRegistry` JSDoc now carries
`@internal`, mirroring `_unsafeGetToolOutputRegistry` on ToolNode.

#7 (NIT) - Extracted `cloneToolMessageWithContent` helper to
deduplicate the 7-field ToolMessage construction across the
string-content and array-content branches.

#6/#8/#9 (NIT) - Removed unused `streamMessages` field from
`buildCapturingModel`, replaced `any` with a `StubModel` interface
in test helpers, fixed eslint indentation warnings.

#5 from the review (anonymous-run silently skipped annotation) was
already resolved in the prior `_refScope` fix.

* fix: defensively coerce ref metadata in annotateMessagesForLLM

Treat persisted/hydrated `ToolMessage.additional_kwargs` as untrusted
input. The previous transform trusted `_unresolvedRefs` to be a
`string[]` and `_refKey`/`_refScope` to be strings, then immediately
called `.length` / `.join(...)` and used the values as registry keys.
A single hydrated message carrying a malformed shape (e.g.
`_unresolvedRefs: 'tool0turn0'` from an upstream serialization
quirk) would crash `attemptInvoke` before the provider call,
turning one bad message into a hard run failure.

Add three small reader helpers — `readRefKey`, `readRefScope`,
`readUnresolvedRefs` — that coerce non-string scalars to undefined
and non-array `_unresolvedRefs` to `[]`. Inside a real array,
filter out non-string entries rather than silently failing later
in the join.

Caught by codex review on PR #117 (P2: Guard lazy annotation against
malformed _unresolvedRefs).

* test: add _refScope coverage at ToolNode unit level + clarify test title

Two follow-up review findings:

#1 (MINOR) - ToolNode unit tests had `getRefKey` and `getUnresolvedRefs`
helpers but no `getRefScope`, so a regression in
`recordOutputReference` that dropped `_refScope` stamping would only
surface via the integration test in `invoke.test.ts`. Add
`getRefScope` and spot-check it at two key sites: the basic stamping
path (named run, scope == run_id) and the concurrent anonymous
invocations path (each batch gets a distinct synthetic
`\0anon-<n>` scope).

#2 (NIT) - The test "drops additional_kwargs entirely when only
framework keys were present" actually asserts `toEqual({})` because
LangChain's ToolMessage constructor normalizes the `undefined` we
pass to `{}`. Renamed to "leaves additional_kwargs empty when
stripping removed every framework key" with a JSDoc clarifying the
chain.

* fix: strip framework ref metadata even on stale-ref-only messages

Codex caught that the previous lazy-projection branch returned the
original `ToolMessage` reference-equal when the only metadata was a
stale `_refKey` (and no `_unresolvedRefs`), so `_refKey` /
`_refScope` survived in `additional_kwargs` on the payload reaching
`model.invoke`/`model.stream`. That contradicts the defensive goal
of `stripFrameworkRefMetadata`: framework keys must never reach the
wire under any custom or future serializer that transmits
`additional_kwargs`.

Project unconditionally whenever any framework key is present,
regardless of whether annotation applies. The annotation step itself
remains conditional on a live registry hit / `_unresolvedRefs`. Net
effect: stale-ref messages keep their content unchanged but get a
clean projected `additional_kwargs`.

Also harden the feature-disabled fast path: hosts that haven't
enabled `RunConfig.toolOutputReferences` get an explicit
short-circuit at O(1) without touching a single ToolMessage. Added
a regression test that spies on `BaseMessage._getType` to assert
not a single message is iterated when the registry is undefined.

Existing tests for stale/malformed ref metadata updated to assert
the new strip-on-projection contract instead of reference-equality.

* docs: nudge model toward heredoc in BashToolOutputReferencesGuide

Real-world testing against LibreChat + ClickHouse + bash hit a
predictable pattern: model emits `echo '{{ref}}' | wc -c`, the
substituted ClickHouse output contains literal single quotes /
parens / arbitrary bytes (e.g. `randomString` results), and the
shell breaks before `wc` ever runs. Model self-corrected on retry
with `wc -c << 'EOF' … EOF`, but burned a tool-call round trip to
discover it.

Substitution itself is doing the right thing — the SDK can't know
the intended quoting context (jq, grep -F, here-string, python -c,
…) so auto-escaping would break legitimate inline cases more often
than it'd help. Same failure mode hits any inline literal data; refs
just make it easier to trip on.

Cheap fix: surface the heredoc-with-quoted-delimiter pattern in the
guide upfront, so the model reaches for it without having to learn
the lesson the hard way. Pure ASCII line-oriented output keeps the
shorter `echo '…' | …` form.

* chore: release 3.1.71

* fix: guard `in` probes against non-object additional_kwargs

Codex P1: `annotateMessagesForLLM` cast `m.additional_kwargs` to
`Record<string, unknown>` and probed with `'_refKey' in meta`. The
`in` operator throws `TypeError` on primitives, so a hydrated or
client-supplied ToolMessage carrying `additional_kwargs: 'string'`
(or any non-object) would crash `attemptInvoke` before the provider
call instead of being safely skipped.

Add a runtime `typeof rawMeta !== 'object'` guard before the `in`
probes. Malformed messages now pass through unchanged and the rest
of the array continues to project normally — same defensive posture
already applied to `_refKey` / `_unresolvedRefs` field shapes.

Regression test forces a primitive past LangChain's typed setter via
cast and asserts (a) no throw, (b) the malformed message passes
through, and (c) a subsequent live-ref message still annotates.

* chore: sync package-lock.json to 3.1.71

* review: address remaining review nits on lazy ref annotation

Comprehensive audit follow-up:

#1 (MINOR) - `InvokeContext` previously resolved indirectly through
`Parameters<ChatModelStreamHandler['handle']>[3]`, hiding the
`getOrCreateToolOutputRegistry()` accessor that `attemptInvoke` calls.
Reshape the type as `NonNullable<...> & { getOrCreateToolOutputRegistry?(): ... }`
so the contract is visible at this site without chasing the
upstream handler signature. NonNullable strips `undefined` so the
intersection doesn't collapse on the undefined branch; callers
express optionality via `context?: InvokeContext`.

#3 (MINOR) - New unit test for artifact preservation through
`annotateMessagesForLLM`. Hosts attach `artifact` via the
`content_and_artifact` response format (code-execution sessions,
MCP tools); the projection must round-trip it untouched so audit
logs and code-session tracking keep working. `makeToolMessage`
helper extended to accept `artifact`.

#4 (NIT) - Extracted test-helper return types `CapturingModel` and
`StreamingCapturingModel` to named aliases. Closing-brace
indentation is no longer ambiguous against property lines, and the
helpers' return shape is reusable.

#5 (NIT) - Documented the `as unknown as ToolMessage['content']`
double-cast inline. LangChain's content union doesn't accept a
freshly built mixed array literal even though the structural shape
is valid at runtime; the comment now records why the cast is
unavoidable.

Skipped #2 (`tryFallbackProviders` test fragility): the reviewer
themselves marked it "Accept as pragmatic." Adding a model-factory
parameter to production code purely for testability is an
anti-pattern, and deleting the test would lose meaningful coverage
of the fallback branch.

Skipped #6 (version bump bundling): the release version bump was
explicitly requested in the PR conversation; the bundling is
intentional for this project's workflow.

* docs: align InvokeContext JSDoc with the actual call site

The JSDoc example used `context?.getOrCreateToolOutputRegistry?.()`
(double optional chain), but the actual call site is single-chain:
`context?.getOrCreateToolOutputRegistry()`. The method is required
on the `StandardGraph` branch of the intersection, so once `context`
is non-null the call is unconditional. Note this inline so the docs
don't drift from the code.

Cosmetic-only follow-up to review of #117.
@pull pull Bot locked and limited conversation to collaborators Apr 26, 2026
@pull pull Bot added the ⤵️ pull label Apr 26, 2026
@pull pull Bot merged commit 644ec2d into innFactory:main Apr 26, 2026
1 check passed
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant