perf: vectorize KV cache prefix matching with numpy#2179
Open
nausicaalii wants to merge 2 commits intoabetlen:mainfrom
Open
perf: vectorize KV cache prefix matching with numpy#2179nausicaalii wants to merge 2 commits intoabetlen:mainfrom
nausicaalii wants to merge 2 commits intoabetlen:mainfrom
Conversation
Replace O(n) Python for-loop in KV cache prefix matching and longest_token_prefix() with numpy vectorized comparison. The element-wise numpy comparison runs in optimized C/SIMD instead of Python's interpreter loop, which matters as conversation history grows (10K+ tokens). No change in behavior — both paths find the first position where cached and new token sequences diverge.
Replace the inline prefix matching in generate() with a call to longest_token_prefix(). Remove .tolist() conversions in _create_completion() so numpy arrays are compared directly, avoiding list conversion overhead. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
generate()andlongest_token_prefix()with numpy vectorized element-wise comparisonnp.argminon a boolean equality array to find the first mismatch position in a single vectorized passgenerate()into a singlelongest_token_prefix()call.tolist()conversions in_create_completion()so numpy arrays are compared directlyMotivation
The current prefix matching iterates token-by-token in Python to find where the cached prompt diverges from the new prompt. This is fine for short prompts, but becomes a bottleneck as conversation history grows — multi-turn chat sessions can accumulate 10K–100K+ tokens in
input_ids, and the linear Python loop runs on everygenerate()call.Numpy's vectorized comparison runs in optimized C/SIMD, giving significant speedup for large token sequences while preserving identical behavior.
Profiling results
Benchmarked on Apple M3 Pro, Python 3.12, numpy 2.2. Mismatch placed at 90% through the sequence. 500 trials.
generate()hot pathself._input_ids(numpy array) vstokens(Python list):_create_completion()cache lookupBoth inputs are numpy arrays (eliminated
.tolist()conversion):Benchmark script
Test plan
longest_token_prefixcorrectness across edge cases: empty sequences, full match, partial match, single element, no match, different lengths, large sequences (10K tokens)test_real_model— passes (low-level batch decode)test_real_llama— passes (multiple sequentialcreate_completioncalls that exercise prefix matching)test_real_llama_embeddings— passes🤖 Generated with Claude Code