fix(cassandra): auto-recover session after Cassandra restart by dpol1 · Pull Request #2997 · apache/hugegraph

dpol1 · 2026-04-18T15:38:23Z

Purpose of the PR

HugeGraphServer stops responding after Cassandra is restarted and never
recovers without a full server restart.

Root cause: CassandraSessionPool builds the Datastax Cluster without a
ReconnectionPolicy, CassandraSession.execute(...) calls the driver once
with no retry, and thread-local sessions are never probed for liveness.
Once Cassandra goes down, transient NoHostAvailableException /
OperationTimedOutException errors surface to the user and the pool stays
dead even after Cassandra comes back online.

Main Changes

Register ExponentialReconnectionPolicy(baseDelay, maxDelay) on the
Cluster builder so the Datastax driver keeps retrying downed nodes in
the background.
Wrap every Session.execute(...) in executeWithRetry(Statement) with
exponential backoff on transient connectivity failures.
Implement reconnectIfNeeded() / reset() on CassandraSession so the
pool reopens closed sessions and issues a lightweight health-check
(SELECT now() FROM system.local) before subsequent queries.

Add four tunables in CassandraOptions (defaults preserve previous
behavior for healthy clusters):

Option	Default	Meaning
`cassandra.reconnect_base_delay`	`1000` ms	Initial backoff for driver reconnection policy
`cassandra.reconnect_max_delay`	`60000` ms	Cap for reconnection backoff
`cassandra.reconnect_max_retries`	`10`	Per-query retries on transient errors (`0` disables)
`cassandra.reconnect_interval`	`5000` ms	Base interval for per-query exponential backoff

Add unit tests covering defaults, overrides, disabling retries and option keys.

Verifying these changes

Need tests and can be verified as follows:
- mvn -pl hugegraph-server/hugegraph-test -am test -Dtest=CassandraTest — 13/13 pass

Does this PR potentially affect the following parts?

Modify configurations

Documentation Status

Doc - TODO

- Register ExponentialReconnectionPolicy on the Cluster builder so the Datastax driver keeps retrying downed nodes in the background. - Wrap every Session.execute() in executeWithRetry() with exponential backoff on transient connectivity failures. - Implement reconnectIfNeeded()/reset() so the pool reopens closed sessions and issues a lightweight health-check (SELECT now() FROM system.local) before subsequent queries. - Add tunable options: cassandra.reconnect_base_delay, cassandra.reconnect_max_delay, cassandra.reconnect_max_retries, cassandra.reconnect_interval. - Add unit tests covering defaults, overrides, disabling retries and option keys. Fixes apache#2740

imbajin · 2026-04-18T19:47:22Z

+         * See issue #2740.
+         */
+        private ResultSet executeWithRetry(Statement statement) {
+            int retries = CassandraSessionPool.this.maxRetries;


‼️ Critical: BackendException argument order is wrong — lastError is passed as the first vararg instead of as the cause

The constructor signature being called here is BackendException(String message, Object... args), which means lastError is used as a format argument (%s), not as the exception cause. The original exception's stack trace is lost entirely.

The format string has two %s placeholders but the args are (lastError, retries, message) — three args. The first %s will be filled by lastError.toString() (the exception object), the second %s by retries (an int). The message string becomes a dangling unused arg.

To preserve the cause chain and format the message correctly, use the (String, Throwable, Object...) constructor:

Suggested change

int retries = CassandraSessionPool.this.maxRetries;

throw new BackendException("Failed to execute Cassandra query " +

"after %s retries: %s",

lastError, retries,

lastError == null ? "<null>" :

lastError.getMessage());

Should be:

throw new BackendException( "Failed to execute Cassandra query after %s retries: %s", lastError, retries, lastError == null ? "<null>" : lastError.getMessage());

Wait — looking more carefully at the available constructors:

BackendException(String message, Throwable cause, Object... args) — this is what you need.

The fix:

throw new BackendException( "Failed to execute Cassandra query after %s retries", lastError, retries);

This passes lastError as the cause (Throwable) and retries as the format arg. The cause's message is already accessible via getCause().getMessage(), so repeating it in the format string is unnecessary.

imbajin · 2026-04-18T19:47:24Z

+         * with a lightweight query. Any failure here is swallowed so the
+         * caller can still issue the real query, which will drive retries via
+         * {@link #executeWithRetry(Statement)}.
+         */


⚠️ reconnectIfNeeded() checks this.opened but never sets it to false on failure

When the health check (HEALTH_CHECK_CQL) fails, the session stays in opened = true state. On the next call to detectSession() from BackendSessionPool, the idle-time check (now - session.updated() > interval) may skip reconnectIfNeeded() entirely because session.update() was called right after.

This means if the health check catches a failure, the session is marked as "recently updated" but is actually broken — and reconnectIfNeeded() won't be called again until the detect interval elapses.

Also: if tryOpen() fails inside reconnectIfNeeded(), the method silently returns. But this.opened was already true (the guard at the top), and this.session is now null. Any subsequent execute() call will NPE on this.session.execute(...) before executeWithRetry can even catch a DriverException.

Consider setting this.opened = false when the session is nulled out, so that tryOpen() is properly re-triggered on the next access.

imbajin · 2026-04-18T19:47:25Z

+            return this.executeWithRetry(new SimpleStatement(statement, args));
+        }
+
+        /**


🧹 Minor: Thread.sleep() in executeWithRetry blocks the calling thread

This is acceptable for low-QPS scenarios, but for high-throughput workloads with many concurrent queries hitting a downed Cassandra, all request threads will pile up sleeping here (up to 10 retries × 60s max delay = 10 minutes worst case per thread).

Consider whether the retry count and max delay defaults might be too aggressive for production. A thread blocked for minutes can cascade into thread pool exhaustion. An alternative would be to fail faster (e.g., 3 retries with shorter delays) and let the caller/user retry at a higher level.

Not a blocker — just worth considering for the default values.

imbajin · 2026-04-18T19:47:32Z

⚠️ commitAsync() bypasses retry — still calls this.session.executeAsync(s) directly

The PR wraps execute() and commit() with executeWithRetry, but commitAsync() (line 177 in the base file) still calls this.session.executeAsync(s) directly. If a Cassandra restart happens during an async batch commit, the same connectivity failure will surface without any retry.

Consider wrapping the async path as well, or at minimum adding a TODO/comment explaining why async commits are deliberately left un-retried (e.g., if retry semantics for async batches are too complex for this PR).

dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. bug Something isn't working store Store module labels Apr 18, 2026

github-project-automation bot added this to HugeGraph PD-Store Tasks Apr 18, 2026

github-project-automation bot moved this to In progress in HugeGraph PD-Store Tasks Apr 18, 2026

dpol1 force-pushed the fix/2740-cassandra-reconnect branch from 97de8e9 to fc3d291 Compare April 18, 2026 17:37

imbajin reviewed Apr 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(cassandra): auto-recover session after Cassandra restart#2997

fix(cassandra): auto-recover session after Cassandra restart#2997
dpol1 wants to merge 1 commit intoapache:masterfrom
dpol1:fix/2740-cassandra-reconnect

dpol1 commented Apr 18, 2026

Uh oh!

imbajin Apr 18, 2026

Uh oh!

imbajin Apr 18, 2026

Uh oh!

imbajin Apr 18, 2026

Uh oh!

imbajin commented Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-            int retries = CassandraSessionPool.this.maxRetries;
+            throw new BackendException("Failed to execute Cassandra query " +
+                                       "after %s retries: %s",
+                                       lastError, retries,
+                                       lastError == null ? "<null>" :
+                                               lastError.getMessage());

Conversation

dpol1 commented Apr 18, 2026

Purpose of the PR

Main Changes

Verifying these changes

Does this PR potentially affect the following parts?

Documentation Status

Uh oh!

imbajin Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

imbajin Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

imbajin Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

imbajin commented Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants