Skip to content

fix(cassandra): auto-recover session after Cassandra restart#2997

Open
dpol1 wants to merge 1 commit intoapache:masterfrom
dpol1:fix/2740-cassandra-reconnect
Open

fix(cassandra): auto-recover session after Cassandra restart#2997
dpol1 wants to merge 1 commit intoapache:masterfrom
dpol1:fix/2740-cassandra-reconnect

Conversation

@dpol1
Copy link
Copy Markdown

@dpol1 dpol1 commented Apr 18, 2026

Purpose of the PR

closes #2740

HugeGraphServer stops responding after Cassandra is restarted and never
recovers without a full server restart.

Root cause: CassandraSessionPool builds the Datastax Cluster without a
ReconnectionPolicy, CassandraSession.execute(...) calls the driver once
with no retry, and thread-local sessions are never probed for liveness.
Once Cassandra goes down, transient NoHostAvailableException /
OperationTimedOutException errors surface to the user and the pool stays
dead even after Cassandra comes back online.

Main Changes

  • Register ExponentialReconnectionPolicy(baseDelay, maxDelay) on the
    Cluster builder so the Datastax driver keeps retrying downed nodes in
    the background.

  • Wrap every Session.execute(...) in executeWithRetry(Statement) with
    exponential backoff on transient connectivity failures.

  • Implement reconnectIfNeeded() / reset() on CassandraSession so the
    pool reopens closed sessions and issues a lightweight health-check
    (SELECT now() FROM system.local) before subsequent queries.

  • Add four tunables in CassandraOptions (defaults preserve previous
    behavior for healthy clusters):

    Option Default Meaning
    cassandra.reconnect_base_delay 1000 ms Initial backoff for driver reconnection policy
    cassandra.reconnect_max_delay 60000 ms Cap for reconnection backoff
    cassandra.reconnect_max_retries 10 Per-query retries on transient errors (0 disables)
    cassandra.reconnect_interval 5000 ms Base interval for per-query exponential backoff
  • Add unit tests covering defaults, overrides, disabling retries and option keys.

Verifying these changes

  • Need tests and can be verified as follows:
    • mvn -pl hugegraph-server/hugegraph-test -am test -Dtest=CassandraTest — 13/13 pass

Does this PR potentially affect the following parts?

  • Modify configurations

Documentation Status

  • Doc - TODO

@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. bug Something isn't working store Store module labels Apr 18, 2026
  - Register ExponentialReconnectionPolicy on the Cluster builder so the
    Datastax driver keeps retrying downed nodes in the background.
  - Wrap every Session.execute() in executeWithRetry() with exponential
    backoff on transient connectivity failures.
  - Implement reconnectIfNeeded()/reset() so the pool reopens closed
    sessions and issues a lightweight health-check (SELECT now() FROM
    system.local) before subsequent queries.
  - Add tunable options: cassandra.reconnect_base_delay,
    cassandra.reconnect_max_delay, cassandra.reconnect_max_retries,
    cassandra.reconnect_interval.
  - Add unit tests covering defaults, overrides, disabling retries and
    option keys.

  Fixes apache#2740
@dpol1 dpol1 force-pushed the fix/2740-cassandra-reconnect branch from 97de8e9 to fc3d291 Compare April 18, 2026 17:37
* See issue #2740.
*/
private ResultSet executeWithRetry(Statement statement) {
int retries = CassandraSessionPool.this.maxRetries;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

‼️ Critical: BackendException argument order is wrong — lastError is passed as the first vararg instead of as the cause

The constructor signature being called here is BackendException(String message, Object... args), which means lastError is used as a format argument (%s), not as the exception cause. The original exception's stack trace is lost entirely.

The format string has two %s placeholders but the args are (lastError, retries, message) — three args. The first %s will be filled by lastError.toString() (the exception object), the second %s by retries (an int). The message string becomes a dangling unused arg.

To preserve the cause chain and format the message correctly, use the (String, Throwable, Object...) constructor:

Suggested change
int retries = CassandraSessionPool.this.maxRetries;
throw new BackendException("Failed to execute Cassandra query " +
"after %s retries: %s",
lastError, retries,
lastError == null ? "<null>" :
lastError.getMessage());

Should be:

throw new BackendException(
    "Failed to execute Cassandra query after %s retries: %s",
    lastError, retries,
    lastError == null ? "<null>" : lastError.getMessage());

Wait — looking more carefully at the available constructors:

  • BackendException(String message, Throwable cause, Object... args) — this is what you need.

The fix:

throw new BackendException(
    "Failed to execute Cassandra query after %s retries",
    lastError, retries);

This passes lastError as the cause (Throwable) and retries as the format arg. The cause's message is already accessible via getCause().getMessage(), so repeating it in the format string is unnecessary.

* with a lightweight query. Any failure here is swallowed so the
* caller can still issue the real query, which will drive retries via
* {@link #executeWithRetry(Statement)}.
*/
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ reconnectIfNeeded() checks this.opened but never sets it to false on failure

When the health check (HEALTH_CHECK_CQL) fails, the session stays in opened = true state. On the next call to detectSession() from BackendSessionPool, the idle-time check (now - session.updated() > interval) may skip reconnectIfNeeded() entirely because session.update() was called right after.

This means if the health check catches a failure, the session is marked as "recently updated" but is actually broken — and reconnectIfNeeded() won't be called again until the detect interval elapses.

Also: if tryOpen() fails inside reconnectIfNeeded(), the method silently returns. But this.opened was already true (the guard at the top), and this.session is now null. Any subsequent execute() call will NPE on this.session.execute(...) before executeWithRetry can even catch a DriverException.

Consider setting this.opened = false when the session is nulled out, so that tryOpen() is properly re-triggered on the next access.

return this.executeWithRetry(new SimpleStatement(statement, args));
}

/**
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Minor: Thread.sleep() in executeWithRetry blocks the calling thread

This is acceptable for low-QPS scenarios, but for high-throughput workloads with many concurrent queries hitting a downed Cassandra, all request threads will pile up sleeping here (up to 10 retries × 60s max delay = 10 minutes worst case per thread).

Consider whether the retry count and max delay defaults might be too aggressive for production. A thread blocked for minutes can cascade into thread pool exhaustion. An alternative would be to fail faster (e.g., 3 retries with shorter delays) and let the caller/user retry at a higher level.

Not a blocker — just worth considering for the default values.

@imbajin
Copy link
Copy Markdown
Member

imbajin commented Apr 18, 2026

⚠️ commitAsync() bypasses retry — still calls this.session.executeAsync(s) directly

The PR wraps execute() and commit() with executeWithRetry, but commitAsync() (line 177 in the base file) still calls this.session.executeAsync(s) directly. If a Cassandra restart happens during an async batch commit, the same connectivity failure will surface without any retry.

Consider wrapping the async path as well, or at minimum adding a TODO/comment explaining why async commits are deliberately left un-retried (e.g., if retry semantics for async batches are too complex for this PR).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working size:L This PR changes 100-499 lines, ignoring generated files. store Store module

Projects

Status: In progress

Development

Successfully merging this pull request may close these issues.

[Bug] Hugegraph isn't responding after Cassandra restarted.

2 participants