Mike Sun / 2025-05-15

Repair Time Requirements to Prevent Data Resurrection in Cassandra & Scylla

Race conditions may be undeleting your data!

Cassandra and ScyllaDB share well known issues with race conditions between repair and garbage collection processes that can cause deleted data to resurrect. This can happen despite following the repair and garbage collection recommendations in their official docs!

This post demonstrates that a stricter invariant is required to ensure no risk of data resurrection—that consecutive repairs must both begin and complete within the garbage collection grace period (gc_grace_seconds).

The Data Resurrection Problem

In Cassandra and ScyllaDB, deletes don't immediately remove data. Instead, they write a special marker called a tombstone that indicates that the data is deleted. The actual removal of the data and its corresponding tombstone happens later during a process called compaction, but only after a garbage collection grace period known as gc_grace_seconds has elapsed.

Cassandra and ScyllaDB allow for eventual consistency replication with last-write-wins conflict resolution which means that a tombstone may not always be propagated to all replicas at the time of deletion. If a tombstone is compacted away on some replica nodes after gc_grace_seconds, but before that tombstone has been successfully propagated to all other replicas, a critical race condition occurs:

  1. Data is deleted, creating a tombstone
  2. The tombstone exists on some replicas but not all
  3. The tombstone is compacted away after gc_grace_seconds on replicas that received it
  4. A replica that never received the tombstone is queried or streams its data to a new node
  5. The supposedly deleted data reappears because there's no tombstone to indicate it was removed
Hints and read repairs can help propogate tombstones to replicas, but only a repair can guarantee that all replicas have received the tombstone and must occur before the tombstone is eligible for garbage collection.

Repairing At Least Every gc_grace_seconds Is Insufficient

The Cassandra and ScyllaDB docs advise that repairs to be run at least every gc_grace_seconds to prevent data resurrection:

Cassandra (5.0):

At a minimum, repair should be run often enough that the gc grace period never expires on unrepaired data. Otherwise, deleted data could reappear. With a default gc grace period of 10 days, repairing every node in your cluster at least once every 7 days will prevent this, while providing enough slack to allow for delays.

Scylla (6.2):

Run the nodetool repair command regularly. If you delete data frequently, it should be more often than the value of gc_grace_seconds (by default: 10 days), for example, every week.

Running a complete repair every gc_grace_seconds is actually insufficient because it doesn’t account for the duration of the repair process itself and the specific timing of when data ranges (tokens) are repaired.

A tombstone created for a token after one repair has started and already scanned the specific token, can expire before the next repair cycle manages to reach and process that token, resulting in potential data resurrection.

Counterexample 1

Repair every 10 days, gc_grace_seconds = 10 days, and each repair takes 3 days.

In this scenario, the tombstone for data in Token A was created after Token A was processed by Repair 1. This tombstone expires on Day 10 at 14:00. Repair 2 only begins on Day 10 at 00:00 and doesn't repair Token A until Day 10 at 16:00, after the tombstone has expired. If any replica hadn't received the tombstone for Token A before its expiry and subsequent compaction on other replicas, that data could be resurrected and would continue to persist even after Token A is repaired by Repair 2 on Day 10 at 16:00.

Counterexample 2

Repair every 7 days, gc_grace_seconds = 10 days, and each repair takes 4 days. Even the more conservative recommendation of repairing every 7 days with a 10-day gc_grace_seconds can be problematic if repairs take a significant amount of time.

Data deleted in Token A after it was covered by Repair 1 has its tombstone expire on Day 10 at 14:00 before Token A is covered by Repair 2 on Day 10 on 16:00 and is at risk for data resurrection.

A Stricter Requirement

Every tombstone has a gc_grace_seconds time interval from when it's written to when it expires. Therefore, we must ensure there is always a repair that starts after the tombstone is written and completes before the tombstone expires, or in other words, there must always be a repair that starts and completes within any moving gc_grace_seconds window.

If consecutive repairs always start and complete within gc_grace_seconds, then for any newly created tombstone, the next repair will fully complete its pass over all tokens before that tombstone expires. This ensures that even the last token touched by the second repair is processed before any relevant tombstone (created after the first repair started) expires.

Denote a repair cycle by its start time S and end time E:

Consider any tombstone created just after it was last processed by repairi. This tombstone must be covered by the next repair (repairi+1) before its tombstone expires. The latest point this tombstone will be covered by repairi+1 is at Ei+1. The tombstone was created at some point after Si. The time elapsed for this tombstone must not exceed gc_grace_seconds.

Therefore, the interval from the start of one repair (Si) to the completion of the subsequent repair (Ei+1) must be within gc_grace_seconds.

That is, Ei+1 - Si < gc_grace_seconds

This diagram illustrates how if consecutive complete repairs (10 days) are always within gc_grace_seconds (10 days), there will always be a complete repair within any gc_grace_seconds time interval.

In this diagram, consecutive complete repairs (14 days) take longer than gc_grace_seconds (10 days), potentially allowing a tombstone to expire before it's repaired.

Practical Considerations

Maintaining that consecutive repairs always start and complete within gc_grace_seconds presents practical performance challenges. For the default gc_grace_seconds = 10 days config, it means repairs must happen at least every 5 days and their durations must be shorter than that. It's not uncommon for repairs to take 3 days if you have a lot of data in your database cluster which doesn't give allow much buffer time for other operations (e.g. cluster expansion, cleanups) that may require repairs to be paused or stopped.

If you extend gc_grace_seconds, more tombstones will accumulate, resulting in increased read latencies as they must process these markers, while also consuming additional disk space since the marked-for-deletion data cannot be permanently removed until the grace period expires.

Repairs themselves are resource-intensive operations that consume significant disk and network IO, and increasing their intensity or parallelism to complete them faster can cause query performance to degreade.

Cassandra supports incremental repairs which can help reduce repair time. ScyllaDB has introduced a feature called Repair Based Tombstone Garbage Collection that allows tombstones to be compacted before gc_grace_seconds expires if they've been repaired. This allows gc_grace_seconds to be extended to very long duration without performance degradation. The feature is actively being refined and is presently an optional setting.

Another approach is to execute deletes using a consistency level of ALL (CL=ALL). This guarantees that a delete operation succeeds only if its tombstone is written to all replicas. If any replica fails to receive the tombstone, the delete operation will return an error. However, this approach reduces the system's availability for delete operations. Because every replica must be online for the delete to succeed, the operation will fail if even one replica is unavailable.

A variant to previous approach is to fallback to deletes with consistency level of CL=QUORUM if a replica is unavailable, then queue tasks that ensure the tombstones are propagated to all replicas when the replica is back online with CL=ALL deletes (or a CL=ALL reads which trigger read repairs).