[Feature] Add Distributed Posting Router for SPANN#446
Draft
TerrenceZhangX wants to merge 53 commits intomicrosoft:users/qiazh/merge-spfresh-tikvfrom
Draft
[Feature] Add Distributed Posting Router for SPANN#446TerrenceZhangX wants to merge 53 commits intomicrosoft:users/qiazh/merge-spfresh-tikvfrom
TerrenceZhangX wants to merge 53 commits intomicrosoft:users/qiazh/merge-spfresh-tikvfrom
Conversation
…c, benchmarks Core routing (PostingRouter.h): - Hash routing: GetOwner uses headID %% NumNodes for deterministic assignment - RemoteLock RPC for cross-node Merge serialization (try_lock + retry) - BatchAppend, HeadSync, InsertBatch packet types and handlers - TCP-based server/client for inter-node communication ExtraDynamicSearcher.h integration: - EnableRouter/AdoptRouter for index lifecycle management - Split: BroadcastHeadSync after creating/deleting heads - MergePostings: Cross-node lock for neighbor headID on different node - MergePostings: BroadcastHeadSync for deleted head after merge - Reassign: Route Append to owner node + FlushRemoteAppends - AddIndex: Route appends to owner node via QueueRemoteAppend - SetHeadSyncCallback: Wire up HeadSync + RemoteLock callbacks Infrastructure: - IExtraSearcher/Index/VectorIndex: Add routing virtual method chain - Options/ParameterDefinitionList: RouterEnabled, RouterLocalNodeIndex, RouterNodeAddrs, RouterNodeStores config params - CMakeLists: Link Socket sources and Boost into SPTAGLibStatic - Connection.cpp: Safe remote_endpoint() with error_code (no throw) - Packet.h: Append, BatchAppend, InsertBatch, HeadSync, RemoteLock types - SPFreshTest.cpp: ApplyRouterParams, FlushRemoteAppends, WorkerNode test - Benchmark configs: 100k/1m/10m x 1/2/3 node - run_scale_benchmarks.sh: Automated benchmark runner - docker/tikv: TiKV cluster docker-compose + config Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Fix index dir creation: create parent dir only (not spann_index subdir) so the build code creates spann_index and the safety check passes - Clear checkpoint after build phase so driver re-runs all insert batches - Add VectorIndex::AdoptRouter to transfer router between batch clones instead of creating new TCP server per batch (port conflict fix) - Fix ExtraDynamicSearcher::AdoptRouter to override IExtraSearcher interface 100k results (routing works, ~50/50 local/remote split): 1-node steady-state: 90.2 vps 2-node steady-state: 80.5 vps 3-node steady-state: 77.7 vps No scaling at 100k due to small batch size (100 vectors) and shared TiKV. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
12 slides covering: problem statement, solution architecture, full flow comparison (single vs distributed), hash routing, append write path, split/merge/reassign routing, HeadSync broadcast, 3-node sequence diagram, design decisions, network protocol, config, and 100k results. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Fix ini thread params: replace NumThreads with NumSearchThreads/NumInsertThreads - Add 100M benchmark ini files (1-node, 2-node, 3-node) - Update data paths to /mnt/data_disk/sift1b in all ini files - Add BENCHMARK_GUIDE_SCALE_COMPUTE.md (English) - Add BENCHMARK_RESULTS_SCALE_COMPUTE.md with 100K/1M/10M results - Update docker-compose and tikv.toml for 3-PD/3-TiKV cluster - Update run_scale_benchmarks.sh with multi-scale orchestration - Add .gitignore entries for generated benchmark artifacts
- Fix FullSearch routing for multi-node search (per-node build) - Update 10M benchmark: insert throughput 2-node 1.65x, 3-node 1.98x - Search latency 10M: 2-node -35%, 3-node -50% vs 1-node - Near-linear insert scaling across all data sizes (100K, 1M, 10M) - Update benchmark configs, test harness, and scale benchmark script
- Delete all 24 benchmark INI files from Test/ - Replace section 5 in BENCHMARK_GUIDE_SCALE_COMPUTE.md with complete deterministic generation rules (scale table, topology rules, template, Python generator script) - INI files can be regenerated on demand via the guide
- Embed full docker-compose.yml and tikv.toml contents in BENCHMARK_GUIDE_SCALE_COMPUTE.md section 4.1 - Remove docker/tikv/ from git tracking (files stay on disk) - Use <NVME_DIR> placeholder instead of absolute paths Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Bug fixes: - SPANNIndex.cpp: Remove redundant SortResult() in FullSearchCallback that corrupted remote search results (heapsort on already-sorted data) - TestDataGenerator.cpp: Fix EvaluateRecall truth NN stride from 1 to K Feature: - SPFreshTest.cpp: Add BuildOnly parameter to skip insert batches Benchmark results (Float32/dim64, 10M scale): - 1-node: 93.8 vps, 2-node: 200.0 vps (2.13x), 3-node: 271.4 vps (2.89x) - Recall stable within each config after double-sort fix Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Measure RPC round-trip time (sendTime → future.get()) and assign per-query latency for remote search results. Previously p50/min latency showed 0 for remote queries. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Revert TiKVVersionMap node-scoped prefix optimization (vc:{nodeIndex}:...)
that broke cross-node version check correctness. Version map now uses
shared namespace (vc:{layer}:...) so all nodes can read/write the same
version data. Node-scoped optimization deferred to future branch.
Also add explanatory comments for AdoptRouter and HeadSync broadcast.
Restore commented-out debug log lines.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
These were only used by the reverted node-scoped version map. LocalToGlobalVID is kept (used in insert path for VID uniqueness). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
InsertVectors now uses dual path: per-vector multi-threaded insert for single-node (original behavior), bulk AddIndex for router-enabled multi-node (amortizes RPC overhead via batched remote appends). Also add comment on AddIndex explaining caller-side shard partitioning and LocalToGlobalVID purpose. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Restore the original vidIdx counter loop for post-heap version filtering instead of the candidateIndices array approach. Both are functionally equivalent but the original pattern is simpler. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Restore 'For quantized index, pass GetFeatureDim()' comment
- Remove else { func(); } branch that was not in the original code
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Use ConvertToString(valueType) and INI parameters to build the perftest_* filenames, matching TestDataGenerator::RunLargeBatches convention. Moves filename construction out of the per-batch loop. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Document the 3-file INI pattern for multi-node benchmarks:
- _build.ini: Rebuild=true, no Router (build phase)
- _driver.ini: Rebuild=false, Router enabled (driver/n0)
- _n{i}.ini: worker nodes (n1, n2, ...)
Update Python generator and shell script to match. Remove the
sed-based approach of patching _n0.ini at runtime.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Move Router parameters (RouterEnabled, RouterLocalNodeIndex, RouterNodeAddrs, RouterNodeStores) from DefineSSDParameter to DefineRouterParameter with its own [Router] section in: - ParameterDefinitionList.h: new DefineRouterParameter macro - Options.h: SetParameter/GetParameter handle 'Router' section - SPANNIndex.cpp: SaveConfig outputs [Router] block - SPFreshTest.cpp: read [Router] INI section, ApplyRouterParams uses 'Router' section - Benchmark guide: updated INI template and Python generator Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Distributed scale results belong in BENCHMARK_RESULTS_SCALE_COMPUTE.md, not the single-node 10M results file. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
DefineRouterParameter was nested inside #ifdef DefineSSDParameter, causing router parameters to be silently ignored. Moved the #endif to the correct position so DefineRouterParameter is at top-level scope. Also removed a debug log line from the DefineRouterParameter macro in Options.h. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Implement benchmark-level query distribution where each node independently searches its contiguous partition of the query set, coordinated by barrier files (same mechanism as insert distribution). This replaces the previous RPC-based approach, eliminating RPC overhead and serial head search. Key changes in SPFreshTest.cpp: - BenchmarkQueryPerformance: partition queries across nodes, use barrier files for synchronization, compute QPS = totalQueries / max(wallTime) - WorkerNode: unified command loop handling both search and insert commands via shared index directory Results (10M Float32, 200 queries, TopK=5): - 1-node: 194 QPS baseline - 2-node: 404 QPS (2.08x speedup, super-linear due to cache effects) - 3-node: 488 QPS (2.52x speedup) - Insert scaling: 1→2→3 node = 119→211→314 vec/s (1.77x, 2.64x) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1. Fix remote lock leak in MergePostings (ExtraDynamicSearcher.h)
- RAII RemoteLockGuard ensures remote lock is released on all exit
paths (continue/return/exception), preventing distributed deadlock
2. Fix buffer overflow in BatchRouteSearch (SPANNIndex.cpp)
- Validate response array sizes before accessing result vectors
- Fall back to local search on size mismatch
3. Fix missing send-failure callback in SendRemoteLock (PostingRouter.h)
- Add failure callback to complete the promise on send error,
matching the pattern used by all other SendPacket call sites
- Prevents 5-second stall on every send failure
4. Normalize atomic operation in SendRemoteLock (PostingRouter.h)
- Change m_nextResourceId++ to fetch_add(1) for consistency
5. Fix uninitialized workerTime in barrier coordination (SPFreshTest.cpp)
- Initialize workerTime and validate ifstream read
- Skip worker timing on parse failure instead of using garbage value
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Remove BatchRouteSearch, SetFullSearchCallback, GetSearchNodeCount, and all supporting RPC infrastructure (SearchPostingRequest/Response, FullSearchBatchRequest/Response structs, search callbacks, handler methods, and related member variables) from PostingRouter, SPANNIndex, Index, VectorIndex, ExtraDynamicSearcher, SPFresh, and Packet. Tests use barrier-based distributed search exclusively; the RPC-based search routing is dead code. Existing SearchRequest/SearchResponse packet types are preserved as they are used by the pre-existing Aggregator/Client/Server code. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Critical fixes: - Fix integer overflow in DecodeVectorSearchResponse (ExtraTiKVController.h) Prevents OOB reads from corrupt TiKV responses with large numResults. - Fix m_mergeJobsInFlight counter underflow in MergePostings retry paths (ExtraDynamicSearcher.h) Add increment before re-enqueued MergeAsyncJob to match the unconditional decrement in exec(). High fixes: - Add FlushRemoteAppends after Split reassignment (ExtraDynamicSearcher.h) Ensures queued remote appends are sent after CollectReAssign in Split(). - Fix data race on m_nodeAddrs in ConnectToPeer (PostingRouter.h) Snapshot address under m_connMutex before retry loop. - Fix BroadcastHeadSync reading m_nodeAddrs without lock (PostingRouter.h) Snapshot node count under m_connMutex before iterating. Medium fixes: - Fix m_storeToNodes race in AddNode - move inside m_connMutex scope. - Fix unvalidated entryCount in HandleHeadSyncRequest with buffer-end tracking to prevent overruns from corrupt packets. - Add buffer-end tracking in BatchRemoteAppendRequest::Read to catch overruns during per-item deserialization. - Make m_asyncStatus atomic to fix race between async jobs and Checkpoint. Use exchange() for atomic read-and-reset in Checkpoint. - Make shared ErrorCode ret atomic in LoadIndex and WriteDownAllPostingToDB parallel loops. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR adds a peer-to-peer distributed “posting router” layer to SPANN over TiKV, enabling routed inserts (by headID ownership) and distributed query benchmarking across multiple compute processes, along with benchmark documentation updates and a small recall-evaluation correctness fix.
Changes:
- Add a TCP RPC-based
PostingRouterwith consistent-hash routing, batch append RPCs, head sync broadcast, and remote lock RPCs. - Extend SPANN/VectorIndex APIs and options/config plumbing to enable/inspect router behavior and persist router params.
- Update SPFRESH test/benchmark harness to support multi-process distributed insert/search runs and add benchmark guide/results docs; fix truth-stride bug in recall evaluation.
Reviewed changes
Copilot reviewed 16 out of 17 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| Test/src/TestDataGenerator.cpp | Fix truth nearest-neighbor stride in EvaluateRecall. |
| Test/src/SPFreshTest.cpp | Add router-aware bulk insert path, distributed benchmark barriers, worker-node test case, and new benchmark config knobs. |
| BENCHMARK_RESULTS_SCALE_COMPUTE.md | Add distributed scale benchmark results write-up. |
| BENCHMARK_GUIDE_SCALE_COMPUTE.md | Add detailed guide for running distributed scale benchmarks. |
| AnnService/src/Socket/Connection.cpp | Make Start/Stop endpoint logging resilient to disconnected sockets. |
| AnnService/src/Core/SPANN/SPANNIndex.cpp | Persist router section in config and add internal multi-threaded bulk AddIndex path. |
| AnnService/inc/Socket/Packet.h | Add new packet types for append/batch append/head sync/remote lock RPCs. |
| AnnService/inc/SPFresh/SPFresh.h | Refactor local multi-threaded search block (logic-preserving). |
| AnnService/inc/Core/VectorIndex.h | Add router-related virtual methods (no-op defaults). |
| AnnService/inc/Core/SPANN/PostingRouter.h | Introduce the distributed router implementation and wire protocol structs. |
| AnnService/inc/Core/SPANN/ParameterDefinitionList.h | Add Router parameter definitions. |
| AnnService/inc/Core/SPANN/Options.h | Parse/store Router options from [Router] section. |
| AnnService/inc/Core/SPANN/Index.h | Implement router-related VectorIndex overrides for SPANN index. |
| AnnService/inc/Core/SPANN/IExtraSearcher.h | Add router-related virtual hooks for extra searchers. |
| AnnService/inc/Core/SPANN/ExtraDynamicSearcher.h | Integrate router into dynamic posting operations (routing, head sync, remote locks, remote append batching). |
| AnnService/CMakeLists.txt | Compile Socket sources into core libs to support PostingRouter. |
| .gitignore | Ignore benchmark-generated artifacts (json outputs, logs, generated data). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Add DispatchCommand/DispatchResult packet types and protocol structs to PostingRouter for driver→worker command dispatch over TCP. Driver (node 0) uses BroadcastDispatchCommand/WaitForAllResults to coordinate search rounds and insert batches across worker nodes. Workers register a DispatchCallback that handles Search/Insert/Stop commands on dedicated threads. Key changes: - Packet.h: new DispatchCommand (0x09) and DispatchResult (0x89) types - PostingRouter.h: dispatch protocol (~370 lines), shutdown-safe with ClearDispatchCallback (waits for in-flight threads), once_flag for Stop, active dispatch thread tracking - VectorIndex.h / Index.h: GetRouter() virtual for test access - SPFreshTest.cpp: BenchmarkQueryPerformance uses router dispatch, RunBenchmark sends TCP commands instead of writing barrier files, WorkerNode rewritten from filesystem polling to callback-based This enables true multi-machine distributed benchmarking without requiring a shared filesystem (NFS) for barrier synchronization. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
New evaluation/distributed/ directory with: - run_scale_benchmarks.sh: Single-machine multi-process benchmark with graceful worker shutdown (TCP Stop + timeout fallback) - run_distributed.sh: Multi-machine orchestrator (deploy, start-tikv, run, stop, cleanup subcommands via SSH/rsync) - cluster.conf.example: Cluster topology config template - configs/: Per-scale per-node INI files and templates - README.md: Architecture, usage, dispatch protocol docs Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
WorkerNode::LoadVectorSet for the query set was hardcoded to use TestDataGenerator<float>, which fails when ValueType is UInt8 or Int8 because the DefaultReader expects float-sized elements but the binary file contains 1-byte elements. This caused 'Failed to read VectorSet!' errors and crashed the worker in distributed benchmarks with UInt8 data. Add type dispatch based on ValueType (matching the pattern already used for insert vector loading at line 2546-2557). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Results for SIFT1B UInt8/128d across 1-node, 2-node, and 3-node configs. Includes per-batch insert/QPS tables and cross-scale comparison. Files moved from repo root to evaluation/distributed/results/. JSON output files excluded (data captured in markdown). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PostingRouter now retries failed peer connections in a background thread instead of a one-shot attempt during initialization. This handles the case where workers start before the driver's router is listening. - Add background connect thread with exponential backoff (500ms-5s) - Add WaitForAllPeersConnected() for driver to block until peers ready - Register DispatchResult handler on server socket - Add destructor to join background thread on shutdown Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Resolve conflicts: - ExtraDynamicSearcher.h: Keep router/remote lock logic, adopt upstream MergeAsyncJob signature (removed disableReassign param) - SPFreshTest.cpp: Keep buildOnly param (distributed), adopt upstream numSearchDuringInsertThreads param, keep perNodeBatch for distributed insert Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
These static configs were for single-machine testing with multiple containers sharing the same host. Now superseded by run_distributed.sh which auto-generates per-node configs from templates + cluster.conf. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add 10m template and TiKV tuning config (tikv.toml) - Update 1m/100m templates: paths from /mnt_ssd/merged to /mnt_ssd/data - Revert UseMultiChunkPosting to false in 10m template - Add cluster.conf.example with placeholder IPs for new users - Gitignore actual cluster.conf files (contain environment-specific IPs) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…tentHashRing.h Extract protocol structs (RemoteAppendRequest, RemoteAppendResponse, RouteTarget, BatchRemoteAppendRequest, BatchRemoteAppendResponse, HeadSyncEntry, DispatchCommand, DispatchResult, RemoteLockRequest, RemoteLockResponse) into DistributedProtocol.h. Extract ConsistentHashRing class into ConsistentHashRing.h. PostingRouter.h now includes the two new headers. All behavior is preserved; this is a pure move refactor with no logic changes. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Extract driver↔worker dispatch coordination into a standalone DispatchCoordinator class (327 lines). PostingRouter now implements DispatchCoordinator::PeerNetwork interface and delegates all dispatch operations (BroadcastDispatchCommand, WaitForAllResults, Handle*). PostingRouter.h: 1436 → 1204 lines (-232) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Extract all internal posting RPC mechanics (Append, BatchAppend, HeadSync, RemoteLock) into standalone RemotePostingOps class (583 lines). PostingRouter implements RemotePostingOps::NetworkAccess interface and delegates all Send*/Handle* operations. PostingRouter.h: 1204 → 643 lines (-561) File breakdown after full refactoring (originally 1864 lines): PostingRouter.h 643 Core routing + connection management RemotePostingOps.h 583 Internal posting RPCs DistributedProtocol.h 374 Wire protocol structs DispatchCoordinator.h 327 External dispatch coordination ConsistentHashRing.h 84 Hash ring Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add m_nodeIndex to DispatchResult (wire protocol v1.1, backward compatible via mirrorVer check). Track pending nodes in PendingDispatch and log the specific node IDs on timeout. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The expression 'data.size() < 12 + numResults * 12' is more readable than the equivalent 'numResults > (data.size() - 12) / 12' and avoids potential unsigned underflow if data.size() < 12. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
GetNumNodes, GetLocalNodeIndex, FlushRemoteAppends, and GetRemoteQueueSize were pure pass-throughs polluting VectorIndex with distributed concerns. Callers now use PostingRouter* directly (obtained via GetRouter()) for data-plane operations. Only lifecycle methods that need ExtraDynamicSearcher internals remain on VectorIndex: EnableRouter, AdoptRouter, SetHeadSyncCallback, GetRouter. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Ring is no longer statically built from config. Instead: - Dispatcher starts with empty ring - Workers connect and send NodeRegisterRequest - Dispatcher adds each worker to ring, broadcasts RingUpdate - Workers replace local ring atomically on receiving update New protocol messages: NodeRegisterMsg, RingUpdateMsg New packet types: NodeRegisterRequest (0x0A), RingUpdate (0x0B) New config: RouterIsDispatcher (bool, default false) Workers use WaitForRing() to block until ring is received. Dispatcher waits for all workers to register before proceeding. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add RingUpdateACK (0x0C) packet type and RingUpdateACKMsg - Add ring versioning (m_ringVersion) to RingUpdateMsg - Worker sends ACK after receiving RingUpdate - Dispatcher tracks per-worker ACK versions (m_workerAckedVersion) - Dispatcher retries RingUpdate to unACK'd workers in bg thread - Worker retries NodeRegister until ring is received (replaces one-shot flag) - Bg thread exits only when all connected AND ring synchronized Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Move DBKey/DBKeys back to original location (before multi-chunk helpers) and remove the 'Moved from bottom' comment. Pure reorder adds noise to PR. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Extract NetworkNode base class into NetworkNode.h (server/client, connections, ring, bg thread) - Rename PostingRouter.h to WorkerNode.h (routing, remote ops, append queue) - DispatcherNode.h now includes NetworkNode.h - ExtraDynamicSearcher holds non-owning WorkerNode* via SetRouter() - Remove 4 routing virtual methods from VectorIndex/Index/IExtraSearcher chain - SPFreshTest creates DispatcherNode + WorkerNode directly - Update all PostingRouter references in comments Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…fixes - Rename [Router] config section to [Distributed] with proper fields: DispatcherAddr, WorkerAddrs, StoreAddrs, PDAddrs - Merge WorkerNode into BenchmarkFromConfig test driver - Add per-worker TiKV PD address support (each worker uses local PD) - Fix DispatchCoordinator to skip local worker in broadcasts - Fix SO_REUSEADDR in Server.cpp for port TIME_WAIT reuse - Skip distributed setup when BuildOnly=true - Add 1M/10M/100M x 1-node/2-node benchmark configs Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… TiKV - Rewrite run_distributed.sh with 'bench' command for one-click benchmarks: * bench cluster.conf 1m 10m 100m — runs 1-node + N-node per scale * Independent TiKV per node (standalone PD, max-replicas=1) * generate_ini fills [Distributed] section from cluster.conf topology * SSH key support via cluster.conf ssh_key field - Remove run_scale_benchmarks.sh (replaced by run_distributed.sh bench) - Remove hardcoded *_1node.ini/*_2node.ini (now generated from templates) - Add 100m template, fix templates to use [Distributed] format - Move cluster.conf.example to configs/ - Add bulk insert progress logging in SPANNIndex::AddIndex - Move benchmark docs to evaluation/distributed/results/ Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… WorkerNode Worker mode is triggered by WORKER_INDEX>0 inside BenchmarkFromConfig, not a separate test case. Also relax ready-check grep pattern. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…workers Workers need to connect to the dispatcher for ring registration. The old order (workers first, then driver) caused a deadlock — workers timed out waiting for a dispatcher that hadn't started yet. New flow: start driver in background → wait for dispatcher port → start workers → wait for driver to complete. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Move 7 new distributed-specific headers into a Distributed/ subdirectory to separate networking/dispatch concerns from core SPANN index algorithm: - ConsistentHashRing.h, DispatchCoordinator.h, DispatcherNode.h - DistributedProtocol.h, NetworkNode.h, RemotePostingOps.h, WorkerNode.h Only moves files added by our PR (vs upstream main). Original SPANN files (Index.h, ExtraDynamicSearcher.h, storage controllers, etc.) stay in place. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Remove 'distributed' skip condition in BenchmarkQueryPerformance - Set TruthPath=/mnt_ssd/data/sift1b/bigann_truth.bin in templates - Remove stale cluster.conf.example and benchmark.ini from repo root Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- ExtraRocksDBController.h: Tuned RocksDB 8.11.3 with optimal params (max_subcompactions=4, LZ4/ZSTD compression, kMinOverlappingRatio, partitioned index/filters, ioprio, separate blob cache for postings) - Made 5 high-impact RocksDB params INI-configurable: RocksDBBlockCacheMB, RocksDBBlobCacheMB, RocksDBMaxSubCompactions, RocksDBAsyncIO, RocksDBLowPriorityCompaction - CMakeLists.txt: Added Folly linkage for RocksDB coroutine MultiGet - ExtraDynamicSearcher.h: weak attribute on RocksDbIOUringEnable - SPANNIndex.cpp: Fixed include path, pass tuning params to RocksDBIO - evaluation/backend_comparison/: 4 INI configs (TiKV/RocksDB x L1/L2) and runner script with auto TiKV deployment Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds peer-to-peer distributed insert & search to SPANN over TiKV. Each compute node owns a hash-range of posting lists; vectors are routed to the correct owner at insert time, queries are round-robin load-balanced across nodes at search time. No central coordinator — nodes communicate via custom TCP RPCs.
Results
Float32, 64d, 10M scale (2× Xeon Gold 6530, 128GB DDR5, 1× NVMe SSD)
Near-linear insert and query scaling. Query latency decreases with more nodes (CPU parallelism + reduced gRPC contention on shared TiKV).
Design
Three ideas:
Route by headID ownership. A consistent hash ring (FNV-1a, 150 vnodes/node) maps each headID to one owner node. Insert, Split, and Merge operations on a posting list go to its owner. Queries are round-robin load-balanced (each node has the full head index and can access all posting lists via shared TiKV).
Broadcast head changes. When Split/Merge creates or removes a head, the owner broadcasts a
HeadSyncmessage to all peers so they update their local BKT index. Fire-and-forget, no response needed.Batch RPCs.
QueueRemoteAppendaccumulates appends in memory;FlushRemoteAppendssends one batch per target node after each insert round. Cross-node Merge uses a synchronous remote lock RPC per headID.Known Limitations
Future Work
Directed query routing —
BatchRouteSearchcurrently sends each query to one random node for full search. Should route to nodes owning the relevant posting lists (scatter-gather). This is the prerequisite for multi-machine query scaling.Dynamic node discovery —
InitializeRouterreads a static address list from the INI config. For elastic scaling, nodes need a service-discovery mechanism (e.g. etcd, ZooKeeper, or gossip) so new nodes can join and be discovered automatically.Fault detection & failover — no heartbeat or health-check exists. If a node crashes, queries routed to it simply fail (with fallback to local search). A proper failure detector + replica promotion or re-routing strategy is needed.
Dynamic rebalancing —
AddNode/RemoveNode/ComputeMigrationAPIs exist on the consistent hash ring, but there is no online migration path that moves posting list data between nodes while serving traffic.Replace file-based barriers — the filesystem barrier (
start_batch_N/done_N_batch_Npolling) only works when all nodes share a filesystem. Multi-machine coordination needs a network-based barrier (e.g. RPC-based or via the TiKV cluster itself).Bug Fixes
One correctness bug was discovered and fixed during benchmark validation:
1. EvaluateRecall stride (TestDataGenerator.cpp)
Truth NN access used stride=1 (
truth->GetData() + i) instead of stride=K (truth->GetVector(i)). For K=5, query 0 was correct by luck (offset=0) but query 1+ read wrong truth entries. Distance-matching fallback partially compensated, making numbers approximately correct but not exact.