Skip to content

[Feature] Add Distributed Posting Router for SPANN#446

Draft
TerrenceZhangX wants to merge 53 commits intomicrosoft:users/qiazh/merge-spfresh-tikvfrom
TerrenceZhangX:users/zhangt/merge-distributed-to-tikv
Draft

[Feature] Add Distributed Posting Router for SPANN#446
TerrenceZhangX wants to merge 53 commits intomicrosoft:users/qiazh/merge-spfresh-tikvfrom
TerrenceZhangX:users/zhangt/merge-distributed-to-tikv

Conversation

@TerrenceZhangX
Copy link
Copy Markdown

@TerrenceZhangX TerrenceZhangX commented Apr 11, 2026

Adds peer-to-peer distributed insert & search to SPANN over TiKV. Each compute node owns a hash-range of posting lists; vectors are routed to the correct owner at insert time, queries are round-robin load-balanced across nodes at search time. No central coordinator — nodes communicate via custom TCP RPCs.

Results

Float32, 64d, 10M scale (2× Xeon Gold 6530, 128GB DDR5, 1× NVMe SSD)

Metric 1-node 2-node 3-node
Insert throughput 94 vec/s 200 vec/s (2.13×) 271 vec/s (2.89×)
Query QPS 194 404 (2.08×) 488(2.52×)

Near-linear insert and query scaling. Query latency decreases with more nodes (CPU parallelism + reduced gRPC contention on shared TiKV).

Design

Three ideas:

  1. Route by headID ownership. A consistent hash ring (FNV-1a, 150 vnodes/node) maps each headID to one owner node. Insert, Split, and Merge operations on a posting list go to its owner. Queries are round-robin load-balanced (each node has the full head index and can access all posting lists via shared TiKV).

  2. Broadcast head changes. When Split/Merge creates or removes a head, the owner broadcasts a HeadSync message to all peers so they update their local BKT index. Fire-and-forget, no response needed.

  3. Batch RPCs. QueueRemoteAppend accumulates appends in memory; FlushRemoteAppends sends one batch per target node after each insert round. Cross-node Merge uses a synchronous remote lock RPC per headID.

Known Limitations

  • Cross-config recall varies due to BKT build non-determinism (different random trees produce different head selection and posting assignment). Within each config, recall is stable across all insert batches and matches between local and remote search.
  • Single-machine benchmark only: all "nodes" are processes on the same box sharing the same TiKV cluster. Query latency scaling comes from CPU parallelism and reduced gRPC contention, not I/O locality. True multi-machine testing is future work.

Future Work

  1. Directed query routingBatchRouteSearch currently sends each query to one random node for full search. Should route to nodes owning the relevant posting lists (scatter-gather). This is the prerequisite for multi-machine query scaling.

  2. Dynamic node discoveryInitializeRouter reads a static address list from the INI config. For elastic scaling, nodes need a service-discovery mechanism (e.g. etcd, ZooKeeper, or gossip) so new nodes can join and be discovered automatically.

  3. Fault detection & failover — no heartbeat or health-check exists. If a node crashes, queries routed to it simply fail (with fallback to local search). A proper failure detector + replica promotion or re-routing strategy is needed.

  4. Dynamic rebalancingAddNode/RemoveNode/ComputeMigration APIs exist on the consistent hash ring, but there is no online migration path that moves posting list data between nodes while serving traffic.

  5. Replace file-based barriers — the filesystem barrier (start_batch_N / done_N_batch_N polling) only works when all nodes share a filesystem. Multi-machine coordination needs a network-based barrier (e.g. RPC-based or via the TiKV cluster itself).

Bug Fixes

One correctness bug was discovered and fixed during benchmark validation:

1. EvaluateRecall stride (TestDataGenerator.cpp)

Truth NN access used stride=1 (truth->GetData() + i) instead of stride=K (truth->GetVector(i)). For K=5, query 0 was correct by luck (offset=0) but query 1+ read wrong truth entries. Distance-matching fallback partially compensated, making numbers approximately correct but not exact.

TerrenceZhangX and others added 10 commits April 10, 2026 08:08
…c, benchmarks

Core routing (PostingRouter.h):
- Hash routing: GetOwner uses headID %% NumNodes for deterministic assignment
- RemoteLock RPC for cross-node Merge serialization (try_lock + retry)
- BatchAppend, HeadSync, InsertBatch packet types and handlers
- TCP-based server/client for inter-node communication

ExtraDynamicSearcher.h integration:
- EnableRouter/AdoptRouter for index lifecycle management
- Split: BroadcastHeadSync after creating/deleting heads
- MergePostings: Cross-node lock for neighbor headID on different node
- MergePostings: BroadcastHeadSync for deleted head after merge
- Reassign: Route Append to owner node + FlushRemoteAppends
- AddIndex: Route appends to owner node via QueueRemoteAppend
- SetHeadSyncCallback: Wire up HeadSync + RemoteLock callbacks

Infrastructure:
- IExtraSearcher/Index/VectorIndex: Add routing virtual method chain
- Options/ParameterDefinitionList: RouterEnabled, RouterLocalNodeIndex,
  RouterNodeAddrs, RouterNodeStores config params
- CMakeLists: Link Socket sources and Boost into SPTAGLibStatic
- Connection.cpp: Safe remote_endpoint() with error_code (no throw)
- Packet.h: Append, BatchAppend, InsertBatch, HeadSync, RemoteLock types
- SPFreshTest.cpp: ApplyRouterParams, FlushRemoteAppends, WorkerNode test
- Benchmark configs: 100k/1m/10m x 1/2/3 node
- run_scale_benchmarks.sh: Automated benchmark runner
- docker/tikv: TiKV cluster docker-compose + config

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Fix index dir creation: create parent dir only (not spann_index subdir)
  so the build code creates spann_index and the safety check passes
- Clear checkpoint after build phase so driver re-runs all insert batches
- Add VectorIndex::AdoptRouter to transfer router between batch clones
  instead of creating new TCP server per batch (port conflict fix)
- Fix ExtraDynamicSearcher::AdoptRouter to override IExtraSearcher interface

100k results (routing works, ~50/50 local/remote split):
  1-node steady-state: 90.2 vps
  2-node steady-state: 80.5 vps
  3-node steady-state: 77.7 vps
No scaling at 100k due to small batch size (100 vectors) and shared TiKV.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
12 slides covering: problem statement, solution architecture, full flow
comparison (single vs distributed), hash routing, append write path,
split/merge/reassign routing, HeadSync broadcast, 3-node sequence
diagram, design decisions, network protocol, config, and 100k results.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Fix ini thread params: replace NumThreads with NumSearchThreads/NumInsertThreads
- Add 100M benchmark ini files (1-node, 2-node, 3-node)
- Update data paths to /mnt/data_disk/sift1b in all ini files
- Add BENCHMARK_GUIDE_SCALE_COMPUTE.md (English)
- Add BENCHMARK_RESULTS_SCALE_COMPUTE.md with 100K/1M/10M results
- Update docker-compose and tikv.toml for 3-PD/3-TiKV cluster
- Update run_scale_benchmarks.sh with multi-scale orchestration
- Add .gitignore entries for generated benchmark artifacts
- Fix FullSearch routing for multi-node search (per-node build)
- Update 10M benchmark: insert throughput 2-node 1.65x, 3-node 1.98x
- Search latency 10M: 2-node -35%, 3-node -50% vs 1-node
- Near-linear insert scaling across all data sizes (100K, 1M, 10M)
- Update benchmark configs, test harness, and scale benchmark script
- Delete all 24 benchmark INI files from Test/
- Replace section 5 in BENCHMARK_GUIDE_SCALE_COMPUTE.md with complete
  deterministic generation rules (scale table, topology rules, template,
  Python generator script)
- INI files can be regenerated on demand via the guide
- Embed full docker-compose.yml and tikv.toml contents in
  BENCHMARK_GUIDE_SCALE_COMPUTE.md section 4.1
- Remove docker/tikv/ from git tracking (files stay on disk)
- Use <NVME_DIR> placeholder instead of absolute paths

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Bug fixes:
- SPANNIndex.cpp: Remove redundant SortResult() in FullSearchCallback that
  corrupted remote search results (heapsort on already-sorted data)
- TestDataGenerator.cpp: Fix EvaluateRecall truth NN stride from 1 to K

Feature:
- SPFreshTest.cpp: Add BuildOnly parameter to skip insert batches

Benchmark results (Float32/dim64, 10M scale):
- 1-node: 93.8 vps, 2-node: 200.0 vps (2.13x), 3-node: 271.4 vps (2.89x)
- Recall stable within each config after double-sort fix

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@TerrenceZhangX TerrenceZhangX changed the title Add disributed router to SPFresh [Feature] Add Distributed Posting Router for SPANN Apr 13, 2026
TerrenceZhangX and others added 15 commits April 12, 2026 19:26
Measure RPC round-trip time (sendTime → future.get()) and assign
per-query latency for remote search results. Previously p50/min
latency showed 0 for remote queries.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Revert TiKVVersionMap node-scoped prefix optimization (vc:{nodeIndex}:...)
that broke cross-node version check correctness. Version map now uses
shared namespace (vc:{layer}:...) so all nodes can read/write the same
version data. Node-scoped optimization deferred to future branch.

Also add explanatory comments for AdoptRouter and HeadSync broadcast.
Restore commented-out debug log lines.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
These were only used by the reverted node-scoped version map.
LocalToGlobalVID is kept (used in insert path for VID uniqueness).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
InsertVectors now uses dual path: per-vector multi-threaded insert
for single-node (original behavior), bulk AddIndex for router-enabled
multi-node (amortizes RPC overhead via batched remote appends).

Also add comment on AddIndex explaining caller-side shard partitioning
and LocalToGlobalVID purpose.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Restore the original vidIdx counter loop for post-heap version
filtering instead of the candidateIndices array approach. Both are
functionally equivalent but the original pattern is simpler.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Restore 'For quantized index, pass GetFeatureDim()' comment
- Remove else { func(); } branch that was not in the original code

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Use ConvertToString(valueType) and INI parameters to build the
perftest_* filenames, matching TestDataGenerator::RunLargeBatches
convention. Moves filename construction out of the per-batch loop.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Document the 3-file INI pattern for multi-node benchmarks:
- _build.ini: Rebuild=true, no Router (build phase)
- _driver.ini: Rebuild=false, Router enabled (driver/n0)
- _n{i}.ini: worker nodes (n1, n2, ...)

Update Python generator and shell script to match. Remove the
sed-based approach of patching _n0.ini at runtime.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Move Router parameters (RouterEnabled, RouterLocalNodeIndex,
RouterNodeAddrs, RouterNodeStores) from DefineSSDParameter to
DefineRouterParameter with its own [Router] section in:
- ParameterDefinitionList.h: new DefineRouterParameter macro
- Options.h: SetParameter/GetParameter handle 'Router' section
- SPANNIndex.cpp: SaveConfig outputs [Router] block
- SPFreshTest.cpp: read [Router] INI section, ApplyRouterParams
  uses 'Router' section
- Benchmark guide: updated INI template and Python generator

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Distributed scale results belong in BENCHMARK_RESULTS_SCALE_COMPUTE.md,
not the single-node 10M results file.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
DefineRouterParameter was nested inside #ifdef DefineSSDParameter, causing
router parameters to be silently ignored. Moved the #endif to the correct
position so DefineRouterParameter is at top-level scope. Also removed a
debug log line from the DefineRouterParameter macro in Options.h.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Implement benchmark-level query distribution where each node independently
searches its contiguous partition of the query set, coordinated by barrier
files (same mechanism as insert distribution). This replaces the previous
RPC-based approach, eliminating RPC overhead and serial head search.

Key changes in SPFreshTest.cpp:
- BenchmarkQueryPerformance: partition queries across nodes, use barrier
  files for synchronization, compute QPS = totalQueries / max(wallTime)
- WorkerNode: unified command loop handling both search and insert commands
  via shared index directory

Results (10M Float32, 200 queries, TopK=5):
- 1-node: 194 QPS baseline
- 2-node: 404 QPS (2.08x speedup, super-linear due to cache effects)
- 3-node: 488 QPS (2.52x speedup)
- Insert scaling: 1→2→3 node = 119→211→314 vec/s (1.77x, 2.64x)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1. Fix remote lock leak in MergePostings (ExtraDynamicSearcher.h)
   - RAII RemoteLockGuard ensures remote lock is released on all exit
     paths (continue/return/exception), preventing distributed deadlock

2. Fix buffer overflow in BatchRouteSearch (SPANNIndex.cpp)
   - Validate response array sizes before accessing result vectors
   - Fall back to local search on size mismatch

3. Fix missing send-failure callback in SendRemoteLock (PostingRouter.h)
   - Add failure callback to complete the promise on send error,
     matching the pattern used by all other SendPacket call sites
   - Prevents 5-second stall on every send failure

4. Normalize atomic operation in SendRemoteLock (PostingRouter.h)
   - Change m_nextResourceId++ to fetch_add(1) for consistency

5. Fix uninitialized workerTime in barrier coordination (SPFreshTest.cpp)
   - Initialize workerTime and validate ifstream read
   - Skip worker timing on parse failure instead of using garbage value

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Remove BatchRouteSearch, SetFullSearchCallback, GetSearchNodeCount, and all
supporting RPC infrastructure (SearchPostingRequest/Response,
FullSearchBatchRequest/Response structs, search callbacks, handler methods,
and related member variables) from PostingRouter, SPANNIndex, Index,
VectorIndex, ExtraDynamicSearcher, SPFresh, and Packet.

Tests use barrier-based distributed search exclusively; the RPC-based search
routing is dead code. Existing SearchRequest/SearchResponse packet types are
preserved as they are used by the pre-existing Aggregator/Client/Server code.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Critical fixes:
- Fix integer overflow in DecodeVectorSearchResponse (ExtraTiKVController.h)
  Prevents OOB reads from corrupt TiKV responses with large numResults.
- Fix m_mergeJobsInFlight counter underflow in MergePostings retry paths
  (ExtraDynamicSearcher.h) Add increment before re-enqueued MergeAsyncJob
  to match the unconditional decrement in exec().

High fixes:
- Add FlushRemoteAppends after Split reassignment (ExtraDynamicSearcher.h)
  Ensures queued remote appends are sent after CollectReAssign in Split().
- Fix data race on m_nodeAddrs in ConnectToPeer (PostingRouter.h)
  Snapshot address under m_connMutex before retry loop.
- Fix BroadcastHeadSync reading m_nodeAddrs without lock (PostingRouter.h)
  Snapshot node count under m_connMutex before iterating.

Medium fixes:
- Fix m_storeToNodes race in AddNode - move inside m_connMutex scope.
- Fix unvalidated entryCount in HandleHeadSyncRequest with buffer-end
  tracking to prevent overruns from corrupt packets.
- Add buffer-end tracking in BatchRemoteAppendRequest::Read to catch
  overruns during per-item deserialization.
- Make m_asyncStatus atomic to fix race between async jobs and Checkpoint.
  Use exchange() for atomic read-and-reset in Checkpoint.
- Make shared ErrorCode ret atomic in LoadIndex and WriteDownAllPostingToDB
  parallel loops.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a peer-to-peer distributed “posting router” layer to SPANN over TiKV, enabling routed inserts (by headID ownership) and distributed query benchmarking across multiple compute processes, along with benchmark documentation updates and a small recall-evaluation correctness fix.

Changes:

  • Add a TCP RPC-based PostingRouter with consistent-hash routing, batch append RPCs, head sync broadcast, and remote lock RPCs.
  • Extend SPANN/VectorIndex APIs and options/config plumbing to enable/inspect router behavior and persist router params.
  • Update SPFRESH test/benchmark harness to support multi-process distributed insert/search runs and add benchmark guide/results docs; fix truth-stride bug in recall evaluation.

Reviewed changes

Copilot reviewed 16 out of 17 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
Test/src/TestDataGenerator.cpp Fix truth nearest-neighbor stride in EvaluateRecall.
Test/src/SPFreshTest.cpp Add router-aware bulk insert path, distributed benchmark barriers, worker-node test case, and new benchmark config knobs.
BENCHMARK_RESULTS_SCALE_COMPUTE.md Add distributed scale benchmark results write-up.
BENCHMARK_GUIDE_SCALE_COMPUTE.md Add detailed guide for running distributed scale benchmarks.
AnnService/src/Socket/Connection.cpp Make Start/Stop endpoint logging resilient to disconnected sockets.
AnnService/src/Core/SPANN/SPANNIndex.cpp Persist router section in config and add internal multi-threaded bulk AddIndex path.
AnnService/inc/Socket/Packet.h Add new packet types for append/batch append/head sync/remote lock RPCs.
AnnService/inc/SPFresh/SPFresh.h Refactor local multi-threaded search block (logic-preserving).
AnnService/inc/Core/VectorIndex.h Add router-related virtual methods (no-op defaults).
AnnService/inc/Core/SPANN/PostingRouter.h Introduce the distributed router implementation and wire protocol structs.
AnnService/inc/Core/SPANN/ParameterDefinitionList.h Add Router parameter definitions.
AnnService/inc/Core/SPANN/Options.h Parse/store Router options from [Router] section.
AnnService/inc/Core/SPANN/Index.h Implement router-related VectorIndex overrides for SPANN index.
AnnService/inc/Core/SPANN/IExtraSearcher.h Add router-related virtual hooks for extra searchers.
AnnService/inc/Core/SPANN/ExtraDynamicSearcher.h Integrate router into dynamic posting operations (routing, head sync, remote locks, remote append batching).
AnnService/CMakeLists.txt Compile Socket sources into core libs to support PostingRouter.
.gitignore Ignore benchmark-generated artifacts (json outputs, logs, generated data).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread Test/src/SPFreshTest.cpp Outdated
Comment thread Test/src/SPFreshTest.cpp
Comment thread AnnService/inc/Core/SPANN/PostingRouter.h Outdated
Comment thread AnnService/inc/Core/SPANN/ExtraDynamicSearcher.h Outdated
Comment thread AnnService/CMakeLists.txt
Comment thread AnnService/src/Socket/Connection.cpp
@TerrenceZhangX TerrenceZhangX marked this pull request as ready for review April 17, 2026 03:32
@TerrenceZhangX TerrenceZhangX marked this pull request as draft April 17, 2026 05:09
TerrenceZhangX and others added 26 commits April 16, 2026 22:49
Add DispatchCommand/DispatchResult packet types and protocol structs
to PostingRouter for driver→worker command dispatch over TCP.

Driver (node 0) uses BroadcastDispatchCommand/WaitForAllResults to
coordinate search rounds and insert batches across worker nodes.
Workers register a DispatchCallback that handles Search/Insert/Stop
commands on dedicated threads.

Key changes:
- Packet.h: new DispatchCommand (0x09) and DispatchResult (0x89) types
- PostingRouter.h: dispatch protocol (~370 lines), shutdown-safe with
  ClearDispatchCallback (waits for in-flight threads), once_flag for
  Stop, active dispatch thread tracking
- VectorIndex.h / Index.h: GetRouter() virtual for test access
- SPFreshTest.cpp: BenchmarkQueryPerformance uses router dispatch,
  RunBenchmark sends TCP commands instead of writing barrier files,
  WorkerNode rewritten from filesystem polling to callback-based

This enables true multi-machine distributed benchmarking without
requiring a shared filesystem (NFS) for barrier synchronization.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
New evaluation/distributed/ directory with:
- run_scale_benchmarks.sh: Single-machine multi-process benchmark
  with graceful worker shutdown (TCP Stop + timeout fallback)
- run_distributed.sh: Multi-machine orchestrator (deploy, start-tikv,
  run, stop, cleanup subcommands via SSH/rsync)
- cluster.conf.example: Cluster topology config template
- configs/: Per-scale per-node INI files and templates
- README.md: Architecture, usage, dispatch protocol docs

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
WorkerNode::LoadVectorSet for the query set was hardcoded to use
TestDataGenerator<float>, which fails when ValueType is UInt8 or Int8
because the DefaultReader expects float-sized elements but the binary
file contains 1-byte elements. This caused 'Failed to read VectorSet!'
errors and crashed the worker in distributed benchmarks with UInt8 data.

Add type dispatch based on ValueType (matching the pattern already used
for insert vector loading at line 2546-2557).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Results for SIFT1B UInt8/128d across 1-node, 2-node, and 3-node configs.
Includes per-batch insert/QPS tables and cross-scale comparison.

Files moved from repo root to evaluation/distributed/results/.
JSON output files excluded (data captured in markdown).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PostingRouter now retries failed peer connections in a background thread
instead of a one-shot attempt during initialization. This handles the
case where workers start before the driver's router is listening.

- Add background connect thread with exponential backoff (500ms-5s)
- Add WaitForAllPeersConnected() for driver to block until peers ready
- Register DispatchResult handler on server socket
- Add destructor to join background thread on shutdown

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Resolve conflicts:
- ExtraDynamicSearcher.h: Keep router/remote lock logic, adopt upstream
  MergeAsyncJob signature (removed disableReassign param)
- SPFreshTest.cpp: Keep buildOnly param (distributed), adopt upstream
  numSearchDuringInsertThreads param, keep perNodeBatch for distributed insert

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
These static configs were for single-machine testing with multiple
containers sharing the same host. Now superseded by run_distributed.sh
which auto-generates per-node configs from templates + cluster.conf.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add 10m template and TiKV tuning config (tikv.toml)
- Update 1m/100m templates: paths from /mnt_ssd/merged to /mnt_ssd/data
- Revert UseMultiChunkPosting to false in 10m template
- Add cluster.conf.example with placeholder IPs for new users
- Gitignore actual cluster.conf files (contain environment-specific IPs)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…tentHashRing.h

Extract protocol structs (RemoteAppendRequest, RemoteAppendResponse, RouteTarget,
BatchRemoteAppendRequest, BatchRemoteAppendResponse, HeadSyncEntry, DispatchCommand,
DispatchResult, RemoteLockRequest, RemoteLockResponse) into DistributedProtocol.h.

Extract ConsistentHashRing class into ConsistentHashRing.h.

PostingRouter.h now includes the two new headers. All behavior is preserved;
this is a pure move refactor with no logic changes.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Extract driver↔worker dispatch coordination into a standalone
DispatchCoordinator class (327 lines). PostingRouter now implements
DispatchCoordinator::PeerNetwork interface and delegates all dispatch
operations (BroadcastDispatchCommand, WaitForAllResults, Handle*).

PostingRouter.h: 1436 → 1204 lines (-232)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Extract all internal posting RPC mechanics (Append, BatchAppend,
HeadSync, RemoteLock) into standalone RemotePostingOps class (583 lines).
PostingRouter implements RemotePostingOps::NetworkAccess interface and
delegates all Send*/Handle* operations.

PostingRouter.h: 1204 → 643 lines (-561)

File breakdown after full refactoring (originally 1864 lines):
  PostingRouter.h          643  Core routing + connection management
  RemotePostingOps.h       583  Internal posting RPCs
  DistributedProtocol.h    374  Wire protocol structs
  DispatchCoordinator.h    327  External dispatch coordination
  ConsistentHashRing.h      84  Hash ring

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add m_nodeIndex to DispatchResult (wire protocol v1.1, backward
compatible via mirrorVer check). Track pending nodes in PendingDispatch
and log the specific node IDs on timeout.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The expression 'data.size() < 12 + numResults * 12' is more readable
than the equivalent 'numResults > (data.size() - 12) / 12' and avoids
potential unsigned underflow if data.size() < 12.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
GetNumNodes, GetLocalNodeIndex, FlushRemoteAppends, and
GetRemoteQueueSize were pure pass-throughs polluting VectorIndex
with distributed concerns.  Callers now use PostingRouter* directly
(obtained via GetRouter()) for data-plane operations.

Only lifecycle methods that need ExtraDynamicSearcher internals
remain on VectorIndex: EnableRouter, AdoptRouter,
SetHeadSyncCallback, GetRouter.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Ring is no longer statically built from config. Instead:
- Dispatcher starts with empty ring
- Workers connect and send NodeRegisterRequest
- Dispatcher adds each worker to ring, broadcasts RingUpdate
- Workers replace local ring atomically on receiving update

New protocol messages: NodeRegisterMsg, RingUpdateMsg
New packet types: NodeRegisterRequest (0x0A), RingUpdate (0x0B)
New config: RouterIsDispatcher (bool, default false)

Workers use WaitForRing() to block until ring is received.
Dispatcher waits for all workers to register before proceeding.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add RingUpdateACK (0x0C) packet type and RingUpdateACKMsg
- Add ring versioning (m_ringVersion) to RingUpdateMsg
- Worker sends ACK after receiving RingUpdate
- Dispatcher tracks per-worker ACK versions (m_workerAckedVersion)
- Dispatcher retries RingUpdate to unACK'd workers in bg thread
- Worker retries NodeRegister until ring is received (replaces one-shot flag)
- Bg thread exits only when all connected AND ring synchronized

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Move DBKey/DBKeys back to original location (before multi-chunk helpers)
and remove the 'Moved from bottom' comment. Pure reorder adds noise to PR.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Extract NetworkNode base class into NetworkNode.h (server/client, connections, ring, bg thread)
- Rename PostingRouter.h to WorkerNode.h (routing, remote ops, append queue)
- DispatcherNode.h now includes NetworkNode.h
- ExtraDynamicSearcher holds non-owning WorkerNode* via SetRouter()
- Remove 4 routing virtual methods from VectorIndex/Index/IExtraSearcher chain
- SPFreshTest creates DispatcherNode + WorkerNode directly
- Update all PostingRouter references in comments

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…fixes

- Rename [Router] config section to [Distributed] with proper fields:
  DispatcherAddr, WorkerAddrs, StoreAddrs, PDAddrs
- Merge WorkerNode into BenchmarkFromConfig test driver
- Add per-worker TiKV PD address support (each worker uses local PD)
- Fix DispatchCoordinator to skip local worker in broadcasts
- Fix SO_REUSEADDR in Server.cpp for port TIME_WAIT reuse
- Skip distributed setup when BuildOnly=true
- Add 1M/10M/100M x 1-node/2-node benchmark configs

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… TiKV

- Rewrite run_distributed.sh with 'bench' command for one-click benchmarks:
  * bench cluster.conf 1m 10m 100m — runs 1-node + N-node per scale
  * Independent TiKV per node (standalone PD, max-replicas=1)
  * generate_ini fills [Distributed] section from cluster.conf topology
  * SSH key support via cluster.conf ssh_key field
- Remove run_scale_benchmarks.sh (replaced by run_distributed.sh bench)
- Remove hardcoded *_1node.ini/*_2node.ini (now generated from templates)
- Add 100m template, fix templates to use [Distributed] format
- Move cluster.conf.example to configs/
- Add bulk insert progress logging in SPANNIndex::AddIndex
- Move benchmark docs to evaluation/distributed/results/

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… WorkerNode

Worker mode is triggered by WORKER_INDEX>0 inside BenchmarkFromConfig,
not a separate test case. Also relax ready-check grep pattern.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…workers

Workers need to connect to the dispatcher for ring registration. The old
order (workers first, then driver) caused a deadlock — workers timed out
waiting for a dispatcher that hadn't started yet.

New flow: start driver in background → wait for dispatcher port → start
workers → wait for driver to complete.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Move 7 new distributed-specific headers into a Distributed/ subdirectory
to separate networking/dispatch concerns from core SPANN index algorithm:
- ConsistentHashRing.h, DispatchCoordinator.h, DispatcherNode.h
- DistributedProtocol.h, NetworkNode.h, RemotePostingOps.h, WorkerNode.h

Only moves files added by our PR (vs upstream main). Original SPANN files
(Index.h, ExtraDynamicSearcher.h, storage controllers, etc.) stay in place.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Remove 'distributed' skip condition in BenchmarkQueryPerformance
- Set TruthPath=/mnt_ssd/data/sift1b/bigann_truth.bin in templates
- Remove stale cluster.conf.example and benchmark.ini from repo root

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- ExtraRocksDBController.h: Tuned RocksDB 8.11.3 with optimal params
  (max_subcompactions=4, LZ4/ZSTD compression, kMinOverlappingRatio,
  partitioned index/filters, ioprio, separate blob cache for postings)
- Made 5 high-impact RocksDB params INI-configurable:
  RocksDBBlockCacheMB, RocksDBBlobCacheMB, RocksDBMaxSubCompactions,
  RocksDBAsyncIO, RocksDBLowPriorityCompaction
- CMakeLists.txt: Added Folly linkage for RocksDB coroutine MultiGet
- ExtraDynamicSearcher.h: weak attribute on RocksDbIOUringEnable
- SPANNIndex.cpp: Fixed include path, pass tuning params to RocksDBIO
- evaluation/backend_comparison/: 4 INI configs (TiKV/RocksDB x L1/L2)
  and runner script with auto TiKV deployment

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants