Repair Time Requirements to Prevent Data Resurrection in Cassandra & Scylla

Updated 2025-05-19.

Cassandra and ScyllaDB share well known issues with potential race conditions between repair and garbage collection processes that can potentially cause deleted data to resurrect.

This post establishes the repair time requirements necessary to ensure no risk of data resurrection—that consecutive repairs must both begin and complete within the garbage collection grace period (gc_grace_seconds).

The Data Resurrection Problem

In Cassandra and ScyllaDB, deletes don't immediately remove data. Instead, they write a special marker called a tombstone that indicates that the data is deleted. The actual removal of the data and its corresponding tombstone happens later during a process called compaction, but only after a garbage collection grace period known as gc_grace_seconds has elapsed.

Cassandra and ScyllaDB allow for eventual consistency replication with last-write-wins conflict resolution which means that a tombstone may not always be propagated to all replicas at the time of deletion. If a tombstone is compacted away on some replica nodes after gc_grace_seconds, but before that tombstone has been successfully propagated to all other replicas, a critical race condition occurs:

Data is deleted, creating a tombstone
The tombstone exists on some replicas but not all
The tombstone is compacted away after gc_grace_seconds on replicas that received it
A replica that never received the tombstone is queried or streams its data to a new node
The supposedly deleted data reappears because there's no tombstone to indicate it was removed

Hints and read repairs can help propogate tombstones to replicas, but only a repair can guarantee that all replicas have received the tombstone and must occur before the tombstone is eligible for garbage collection.

nodetool repair vs. Cluster Repair

The term "repair" is often used to refer to both nodetool repair operations that repair token ranges on individual nodes and "cluster-level" repair jobs that repair token ranges across an entire cluster.

Cluster repairs are generally provided by tools such as Cassandra Reaper and Scylla Manager that schedule and manage nodetool repair operations across nodes to repair all token ranges in the cluster, managing load on the cluster to avoid performance degradation.

Shortcomings in Doc Recommendations

The Cassandra and ScyllaDB docs provide recommendations for how frequently nodetool repair operations should be run to prevent data resurrection, but the guidance is not operationally straightforward and potentially inadequate.

Cassandra (5.0):

At a minimum, repair should be run often enough that the gc grace period never expires on unrepaired data. Otherwise, deleted data could reappear. With a default gc_grace_seconds of 10 days, repairing every node in your cluster at least once every 7 days will prevent this, while providing enough slack to allow for delays.

Scylla (6.2):

Run the nodetool repair command regularly. If you delete data frequently, it should be more often than the value of gc_grace_seconds (by default: 10 days), for example, every week.

Both the Cassandra and ScyllaDB docs advise that running nodetool repairs on every node at least every 7 days will prevent expired tombstones.

Theoretical Issues

Running a nodetool repair every 7 days with the default gc grace period of 10 days should be adequate in practice, but technically, it's not completely sufficient as it doesn’t account for the duration of the repair process itself and the specific timing of when data ranges (tokens) are repaired.

A tombstone created for a token after one nodetool repair operation has started and already scanned the specific token, can expire before the next scheduled nodetool repair operation manages to reach and process that token, resulting in potential data resurrection.

Example 1: If a nodetool repair operation starts and completes every 7 days with gc_grace_seconds = 10 days, and each nodetool repair operation takes 3 to 4 days:

The tombstone for data in Token A was created after Token A was processed by nodetool repair 1. This tombstone expires on Day 10 at 14:00. nodetool repair 2 only begins on Day 7 at 00:00 and doesn't repair Token A until Day 10 at 16:00, after the tombstone has expired.

In reality, individual nodetool repair operations shouldn't take days to run.

Practical Shortcomings

Operators generally reason about repairs as cluster-level jobs managed by tools like Cassandra Reaper and Scylla Manager, but there isn't clear guidance on the timing requirements for these cluster-level repairs.

Cluster-level repairs can take days to run and generally need to stagger individual nodetool repair operations across different nodes to prevent overloading the cluster. They also do not guarantee a deterministic ordering and timing of individual nodetool repair operations between cluster repair cycles.

The means there is a real risk of data resurrection if you run cluster repair jobs only once every gc_grace_seconds.

Example 2: If a cluster repair job starts and completes every 7 days with gc_grace_seconds = 10 days, and each cluster repair takes 4 days:

Similar to Example 1, data deleted in Token A after it was covered by Cluster Repair 1 has its tombstone expire on Day 10 at 14:00 before Token A is covered by Cluster Repair 2 on Day 10 on 16:00 and is at risk for data resurrection.

A Stricter Requirement

Every tombstone has a gc_grace_seconds time window from creation to expiration. To prevent data resurrection, we must ensure that at least one complete "repair" (either nodetool repair or cluster repair) occurs within this window—specifically, a repair that begins after the tombstone is written and finishes before it expires.

If consecutive repairs always start and complete within gc_grace_seconds, then for any newly created tombstone, the next repair will fully complete its pass over all tokens before that tombstone expires. This ensures that even the last token touched by the second repair is processed before any relevant tombstone (created after the first repair started) expires.

Denote a repair cycle by its start time S and end time E:

For repair_i we have start time S_i and end time E_i
For the next repair (repair_i+1) we have start time S_i+1 and end time E_i+1

Consider any tombstone created just after it was last processed by repair_i. This tombstone must be covered by the next repair (repair_i+1) before its tombstone expires. The latest point this tombstone will be covered by repair_i+1 is at E_i+1. The tombstone was created at some point after S_i. The time elapsed for this tombstone must not exceed gc_grace_seconds.

Therefore, the interval from the start of one repair (S_i) to the completion of the subsequent repair (E_i+1) must be within gc_grace_seconds.

That is, E_i+1 - S_i < gc_grace_seconds

This diagram illustrates how if consecutive complete repairs (10 days) are always within gc_grace_seconds (10 days), there will always be a complete repair within any gc_grace_seconds time interval.

In this diagram, consecutive complete repairs (14 days) take longer than gc_grace_seconds (10 days), potentially allowing a tombstone to expire before it's repaired.

Practical Considerations

Maintaining that consecutive cluster repairs always start and complete within gc_grace_seconds presents practical performance challenges. For the default gc_grace_seconds = 10 days, it means repairs must happen at least every 5 days and their durations must be shorter than that. It's not uncommon for cluster repairs to take 3 days if you have a lot of data in your database cluster which doesn't give allow much buffer time for other operations (e.g. cluster expansion, cleanups) that may require cluster repairs to be paused or stopped.

If you extend gc_grace_seconds, more tombstones will accumulate, resulting in increased read latencies as they must process these markers, while also consuming additional disk space since the marked-for-deletion data cannot be permanently removed until the grace period expires.

Repairs themselves are resource-intensive operations that consume significant disk and network IO, and increasing their intensity or parallelism to complete them faster can cause query performance to degreade.

Cassandra is introducing a new built-in cluster repair feature (Unified Repair Solution) in the next 5.1 version that will track the oldest repaired node in the cluster and prioritize it for repair. It also provides a LongestUnrepairedSec metric that can be used to monitor the time until any potential tombstone will expire. With older versions of Cassandra, incremental repairs can help reduce repair time, though there are some caveats.

ScyllaDB has introduced a feature called Repair Based Tombstone Garbage Collection that allows tombstones to be compacted before gc_grace_seconds expires if they've been repaired. This allows gc_grace_seconds to be extended to very long duration without performance degradation. The feature is actively being refined and is presently an optional setting.

Another approach is to execute deletes using a consistency level of ALL (CL=ALL). This guarantees that a delete operation succeeds only if its tombstone is written to all replicas. If any replica fails to receive the tombstone, the delete operation will return an error. However, this approach reduces the system's availability for delete operations. Because every replica must be online for the delete to succeed, the operation will fail if even one replica is unavailable.

A variant to previous approach is to fallback to deletes with consistency level of CL=QUORUM if a replica is unavailable, then queue tasks that ensure the tombstones are propagated to all replicas when the replica is back online with CL=ALL deletes (or a CL=ALL reads which trigger read repairs).