Mike Sun / 2025-05-15

Repair Time Requirements to Prevent Data Resurrection in Cassandra & Scylla

Updated 2025-05-19.

Cassandra and ScyllaDB share well known issues with potential race conditions between repair and garbage collection processes that can potentially cause deleted data to resurrect.

This post establishes the repair time requirements necessary to ensure no risk of data resurrection—that consecutive repairs must both begin and complete within the garbage collection grace period (gc_grace_seconds).

The Data Resurrection Problem

In Cassandra and ScyllaDB, deletes don't immediately remove data. Instead, they write a special marker called a tombstone that indicates that the data is deleted. The actual removal of the data and its corresponding tombstone happens later during a process called compaction, but only after a garbage collection grace period known as gc_grace_seconds has elapsed.

Cassandra and ScyllaDB allow for eventual consistency replication with last-write-wins conflict resolution which means that a tombstone may not always be propagated to all replicas at the time of deletion. If a tombstone is compacted away on some replica nodes after gc_grace_seconds, but before that tombstone has been successfully propagated to all other replicas, a critical race condition occurs:

  1. Data is deleted, creating a tombstone
  2. The tombstone exists on some replicas but not all
  3. The tombstone is compacted away after gc_grace_seconds on replicas that received it
  4. A replica that never received the tombstone is queried or streams its data to a new node
  5. The supposedly deleted data reappears because there's no tombstone to indicate it was removed
Hints and read repairs can help propogate tombstones to replicas, but only a repair can guarantee that all replicas have received the tombstone and must occur before the tombstone is eligible for garbage collection.

nodetool repair vs. Cluster Repair

The term "repair" is often used to refer to both nodetool repair operations that repair token ranges on individual nodes and a "cluster-level" repairs that repairs token ranges across all nodes.

Cluster repairs are generally provided by tools such as Cassandra Reaper and Scylla Manager that schedule and manage individual nodetool repair operations across nodes to avoid putting too much load on the cluster, avoiding performance degradation.

Importantly, cluster repairs do not guarantee deterministic ordering and timings of individual nodetool repair operations between cluster repair cycles.

Shortcomings in Doc Recommendations

The Cassandra and ScyllaDB docs provide recommendations for how frequently nodetool repair operations should be run to prevent data resurrection, but the guidance is not operationally straightforward and potentially inadequate.

Cassandra (5.0):

At a minimum, repair should be run often enough that the gc grace period never expires on unrepaired data. Otherwise, deleted data could reappear. With a default gc_grace_seconds of 10 days, repairing every node in your cluster at least once every 7 days will prevent this, while providing enough slack to allow for delays.

Scylla (6.2):

Run the nodetool repair command regularly. If you delete data frequently, it should be more often than the value of gc_grace_seconds (by default: 10 days), for example, every week.

The Cassandra docs speak about an invariant that must be held to prevent data resurrection: "[nodetool repairs] should be run often enough that the gc grace period never expires on unrepaired data". In other words, nodetool repairs must be frequent enough to ensure that any tombstone will always be repaired before it expires. Both advise that running nodetool repairs on every node at least every 7 days will prevent expired tombstones.

Theoretical Issues

Though running a nodetool repair every 7 days with the default gc grace period of 10 days should be adequate in practice, theoretically, it's not absolutely sufficient because it doesn’t account for the duration of the repair process itself and the specific timing of when data ranges (tokens) are repaired. A tombstone created for a token after one nodetool repair operation has started and already scanned the specific token, can expire before the next scheduled nodetool repair operation manages to reach and process that token, resulting in potential data resurrection.

Example 1: If a nodetool repair operation starts and completes every 7 days with gc_grace_seconds = 10 days, and each nodetool repair operation takes 3 to 4 days:

The tombstone for data in Token A was created after Token A was processed by nodetool repair 1. This tombstone expires on Day 10 at 14:00. nodetool repair 2 only begins on Day 7 at 00:00 and doesn't repair Token A until Day 10 at 16:00, after the tombstone has expired.

In reality, individual nodetool repair operations shouldn't take days to run.

Practical Shortcomings

Operators tend to reason about repairs as cluster-level jobs managed by tools like Cassandra Reaper and Scylla Manager, but Cassandra and Scylla docs don't provide guidance on the timing requirements for these cluster-level repairs.

Cluster-level repairs can take days to run and generally need to stagger individual nodetool repair operations across different nodes to prevent overloading the cluster. They also do not guarantee a deterministic ordering and timing of individual nodetool repair operations between cluster repair cycles.

The means the theoretical, but unlikely risk of data resurrection with the recommendation to run nodetool repairs every gc_grace_seconds, becomes a real, practical risk if you run cluster repairs only once every gc_grace_seconds.

Example 2: If a a cluster repair job starts and completes every 7 days with gc_grace_seconds = 10 days, and each cluster repair takes 4 days:

Similar to Example 1, data deleted in Token A after it was covered by Cluster Repair 1 has its tombstone expire on Day 10 at 14:00 before Token A is covered by Cluster Repair 2 on Day 10 on 16:00 and is at risk for data resurrection.

A Stricter Requirement

Every tombstone has a gc_grace_seconds time window from creation to expiration. To prevent data resurrection, we must ensure that at least one complete "repair" (either nodetool repair or cluster repair) occurs within this window—specifically, a repair that begins after the tombstone is written and finishes before it expires.

If consecutive repairs always start and complete within gc_grace_seconds, then for any newly created tombstone, the next repair will fully complete its pass over all tokens before that tombstone expires. This ensures that even the last token touched by the second repair is processed before any relevant tombstone (created after the first repair started) expires.

Denote a repair cycle by its start time S and end time E:

Consider any tombstone created just after it was last processed by repairi. This tombstone must be covered by the next repair (repairi+1) before its tombstone expires. The latest point this tombstone will be covered by repairi+1 is at Ei+1. The tombstone was created at some point after Si. The time elapsed for this tombstone must not exceed gc_grace_seconds.

Therefore, the interval from the start of one repair (Si) to the completion of the subsequent repair (Ei+1) must be within gc_grace_seconds.

That is, Ei+1 - Si < gc_grace_seconds

This diagram illustrates how if consecutive complete repairs (10 days) are always within gc_grace_seconds (10 days), there will always be a complete repair within any gc_grace_seconds time interval.

In this diagram, consecutive complete repairs (14 days) take longer than gc_grace_seconds (10 days), potentially allowing a tombstone to expire before it's repaired.

Practical Considerations

Since operators generally operate and monitor repairs at the cluster level, maintaining that consecutive cluster repairs always start and complete within gc_grace_seconds presents practical performance challenges. For the default gc_grace_seconds = 10 days, it means repairs must happen at least every 5 days and their durations must be shorter than that. It's not uncommon for cluster repairs to take 3 days if you have a lot of data in your database cluster which doesn't give allow much buffer time for other operations (e.g. cluster expansion, cleanups) that may require cluster repairs to be paused or stopped.

If you extend gc_grace_seconds, more tombstones will accumulate, resulting in increased read latencies as they must process these markers, while also consuming additional disk space since the marked-for-deletion data cannot be permanently removed until the grace period expires.

Repairs themselves are resource-intensive operations that consume significant disk and network IO, and increasing their intensity or parallelism to complete them faster can cause query performance to degreade.

Cassandra is introducing a new built-in cluster repair feature (Unified Repair Solution) in the next 5.1 version that will track the oldest repaired node in the cluster and prioritize it for repair. It also provides a LongestUnrepairedSec metric that can be used to monitor the time until any potential tombstone will expire. With older versions of Cassandra, incremental repairs can help reduce repair time, though there are some caveats.

ScyllaDB has introduced a feature called Repair Based Tombstone Garbage Collection that allows tombstones to be compacted before gc_grace_seconds expires if they've been repaired. This allows gc_grace_seconds to be extended to very long duration without performance degradation. The feature is actively being refined and is presently an optional setting.

Another approach is to execute deletes using a consistency level of ALL (CL=ALL). This guarantees that a delete operation succeeds only if its tombstone is written to all replicas. If any replica fails to receive the tombstone, the delete operation will return an error. However, this approach reduces the system's availability for delete operations. Because every replica must be online for the delete to succeed, the operation will fail if even one replica is unavailable.

A variant to previous approach is to fallback to deletes with consistency level of CL=QUORUM if a replica is unavailable, then queue tasks that ensure the tombstones are propagated to all replicas when the replica is back online with CL=ALL deletes (or a CL=ALL reads which trigger read repairs).