fix(cassandra): auto-recover session after Cassandra restart#2997
fix(cassandra): auto-recover session after Cassandra restart#2997dpol1 wants to merge 1 commit intoapache:masterfrom
Conversation
- Register ExponentialReconnectionPolicy on the Cluster builder so the
Datastax driver keeps retrying downed nodes in the background.
- Wrap every Session.execute() in executeWithRetry() with exponential
backoff on transient connectivity failures.
- Implement reconnectIfNeeded()/reset() so the pool reopens closed
sessions and issues a lightweight health-check (SELECT now() FROM
system.local) before subsequent queries.
- Add tunable options: cassandra.reconnect_base_delay,
cassandra.reconnect_max_delay, cassandra.reconnect_max_retries,
cassandra.reconnect_interval.
- Add unit tests covering defaults, overrides, disabling retries and
option keys.
Fixes apache#2740
97de8e9 to
fc3d291
Compare
| * See issue #2740. | ||
| */ | ||
| private ResultSet executeWithRetry(Statement statement) { | ||
| int retries = CassandraSessionPool.this.maxRetries; |
There was a problem hiding this comment.
lastError is passed as the first vararg instead of as the cause
The constructor signature being called here is BackendException(String message, Object... args), which means lastError is used as a format argument (%s), not as the exception cause. The original exception's stack trace is lost entirely.
The format string has two %s placeholders but the args are (lastError, retries, message) — three args. The first %s will be filled by lastError.toString() (the exception object), the second %s by retries (an int). The message string becomes a dangling unused arg.
To preserve the cause chain and format the message correctly, use the (String, Throwable, Object...) constructor:
| int retries = CassandraSessionPool.this.maxRetries; | |
| throw new BackendException("Failed to execute Cassandra query " + | |
| "after %s retries: %s", | |
| lastError, retries, | |
| lastError == null ? "<null>" : | |
| lastError.getMessage()); |
Should be:
throw new BackendException(
"Failed to execute Cassandra query after %s retries: %s",
lastError, retries,
lastError == null ? "<null>" : lastError.getMessage());Wait — looking more carefully at the available constructors:
BackendException(String message, Throwable cause, Object... args)— this is what you need.
The fix:
throw new BackendException(
"Failed to execute Cassandra query after %s retries",
lastError, retries);This passes lastError as the cause (Throwable) and retries as the format arg. The cause's message is already accessible via getCause().getMessage(), so repeating it in the format string is unnecessary.
| * with a lightweight query. Any failure here is swallowed so the | ||
| * caller can still issue the real query, which will drive retries via | ||
| * {@link #executeWithRetry(Statement)}. | ||
| */ |
There was a problem hiding this comment.
reconnectIfNeeded() checks this.opened but never sets it to false on failure
When the health check (HEALTH_CHECK_CQL) fails, the session stays in opened = true state. On the next call to detectSession() from BackendSessionPool, the idle-time check (now - session.updated() > interval) may skip reconnectIfNeeded() entirely because session.update() was called right after.
This means if the health check catches a failure, the session is marked as "recently updated" but is actually broken — and reconnectIfNeeded() won't be called again until the detect interval elapses.
Also: if tryOpen() fails inside reconnectIfNeeded(), the method silently returns. But this.opened was already true (the guard at the top), and this.session is now null. Any subsequent execute() call will NPE on this.session.execute(...) before executeWithRetry can even catch a DriverException.
Consider setting this.opened = false when the session is nulled out, so that tryOpen() is properly re-triggered on the next access.
| return this.executeWithRetry(new SimpleStatement(statement, args)); | ||
| } | ||
|
|
||
| /** |
There was a problem hiding this comment.
🧹 Minor: Thread.sleep() in executeWithRetry blocks the calling thread
This is acceptable for low-QPS scenarios, but for high-throughput workloads with many concurrent queries hitting a downed Cassandra, all request threads will pile up sleeping here (up to 10 retries × 60s max delay = 10 minutes worst case per thread).
Consider whether the retry count and max delay defaults might be too aggressive for production. A thread blocked for minutes can cascade into thread pool exhaustion. An alternative would be to fail faster (e.g., 3 retries with shorter delays) and let the caller/user retry at a higher level.
Not a blocker — just worth considering for the default values.
|
The PR wraps Consider wrapping the async path as well, or at minimum adding a TODO/comment explaining why async commits are deliberately left un-retried (e.g., if retry semantics for async batches are too complex for this PR). |
Purpose of the PR
closes #2740
HugeGraphServer stops responding after Cassandra is restarted and never
recovers without a full server restart.
Root cause:
CassandraSessionPoolbuilds the DatastaxClusterwithout aReconnectionPolicy,CassandraSession.execute(...)calls the driver oncewith no retry, and thread-local sessions are never probed for liveness.
Once Cassandra goes down, transient
NoHostAvailableException/OperationTimedOutExceptionerrors surface to the user and the pool staysdead even after Cassandra comes back online.
Main Changes
Register
ExponentialReconnectionPolicy(baseDelay, maxDelay)on theClusterbuilder so the Datastax driver keeps retrying downed nodes inthe background.
Wrap every
Session.execute(...)inexecuteWithRetry(Statement)withexponential backoff on transient connectivity failures.
Implement
reconnectIfNeeded()/reset()onCassandraSessionso thepool reopens closed sessions and issues a lightweight health-check
(
SELECT now() FROM system.local) before subsequent queries.Add four tunables in
CassandraOptions(defaults preserve previousbehavior for healthy clusters):
cassandra.reconnect_base_delay1000mscassandra.reconnect_max_delay60000mscassandra.reconnect_max_retries100disables)cassandra.reconnect_interval5000msAdd unit tests covering defaults, overrides, disabling retries and option keys.
Verifying these changes
mvn -pl hugegraph-server/hugegraph-test -am test -Dtest=CassandraTest— 13/13 passDoes this PR potentially affect the following parts?
Documentation Status
Doc - TODO