Cassandra Deletes : An Introduction

In a distributed Database, replication is key for ensuring high availability and performance. Once data gets deleted from a node in Cassandra, having a good understanding of c* deletes and tombstones will help operators and developers avoid stale reads and zombie data.

Because Cassandra is a distributed system with immutable sstables, deletes are done differently compared to a relational database. Keep in mind that 1) writes can take time to reach all replicas (eventual consistency) and 2) cassandra has a sparse data model where a missing value for a column in a row in one sstable does not mean that the value is not present in another sstable.

Deletes are accomplished by writing tombstones or 'null's to the database. Like other writes, tombstones have timestamps which allow us to determine if the tombstone is current or stale by following standard last write wins semantics (LWW).

We also have the ability to establish data retention policies via TTL (Time to Live) expiration settings which can be configured at write time or at the table level.
Once a cell's TTL expires it is treated as a tombstone for all intents and purposes.

You may wonder, if deletes are just more writes, how do I ever reclaim disk after deleting data? Cassandra does eventually clean up tombstones, but will not do so until the data fits certain criteria, giving the system ample time to ensure the tombstone makes its way to all replicas before it is removed. Otherwise we may have scenarios where deleted values would again become readable again because a tombstone only made it to a limited set of replicas and then got cleaned up. The lifetime of a tombstone is defined by gc_grace_seconds. After gc grace, the tombstone becomes available for deletion during compaction. There are additional compaction subproperties which mandate whether an sstable will get checked for tombstones. These can be found ii nn the DataStax documentation.

Implications of having too many tombstones

Tombstones protect us against stale reads but also consume additional disk and can have a negative impact on read latencies. The process of compacting away excessive numbers of tombstones is also expensive.

Imagine a workload that constantly inserts and deletes a set of partitions from Cassandra. Regular deletes would generate individual tombstones for each cell in the partition that is being deleted. Reading back these rows would be slow due to the number of sstables (usually disk operations) required to pull the latest value, and compaction would painstakingly clean up tombstones (after GC grace etc. etc.) One by one.

The rest of this article will demonstrate range tombstones, which can minimize the performance impact of deletes for a particular set of workloads.

Range Tombstones for partition wide deletes

Range Tombstones are an efficient way of deletion based on the partition key. This creates a single tombstone for an entire row, decreasing the number of tombstones read by range reads and the number of tombstones that need to be merged or cleaned up by compaction.

The following code demonstrates how range tombstones and regular tombstones are created, laid out on disk, and eventually compacted.

1: Create a Demo keyspace "autogeneratedtest" with a Table "test"

CREATE TABLE autogeneratedtest.test (
    a text,
    b text,
    c text,
    PRIMARY KEY (a, b)
) WITH CLUSTERING ORDER BY (b ASC)

2: Next insert some data to the table test :

cqlsh> insert into autogeneratedtest.test (a, b , c ) VALUES ( 'a', 'a', 'a' );
cqlsh> insert into autogeneratedtest.test (a, b , c ) VALUES ( 'a', 'b', 'a' );
cqlsh> insert into autogeneratedtest.test (a, b , c ) VALUES ( 'a', 'c', 'a' );
cqlsh> insert into autogeneratedtest.test (a, b , c ) VALUES ( 'a', 'd', 'a' );
cqlsh> insert into autogeneratedtest.test (a, b , c ) VALUES ( 'b', 'a', 'a' );
cqlsh> insert into autogeneratedtest.test (a, b , c ) VALUES ( 'b', 'b', 'a' );
cqlsh> insert into autogeneratedtest.test (a, b , c ) VALUES ( 'b', 'c', 'a' );
cqlsh> insert into autogeneratedtest.test (a, b , c ) VALUES ( 'b', 'd', 'a' );

3 : Read

select * from autogeneratedtest.test;

 a | b | c
---+---+---
 a | a | a
 a | b | a
 a | c | a
 a | d | a
 b | a | a
 b | b | a
 b | c | a
 b | d | a

4: Perform a nodetool Flush -> flush Data from the memtable to SSTables on disk

$ nodetool flush

5:Use SStable2json to introspect the sstable for expiration times and tombstones counts.
This converts sstables to human readable json

$ sstable2json autogeneratedtest-test-ka-2-Data.db | jq "."

The json formatted sstable:

{"key": "a",

 "cells": [["a:","",1445880794890189],
           ["a:c","a",1445880794890189],
           ["b:","",1445880802889147],
           ["b:c","a",1445880802889147],
           ["c:","",1445880808905308],
           ["c:c","a",1445880808905308],
           ["d:","",1445880813897010],
           ["d:c","a",1445880813897010]]},

{"key": "b",

 "cells": [["a:","",1445880819440662],
           ["a:c","a",1445880819440662],
           ["b:","",1445880828536178],
           ["b:c","a",1445880828536178],
           ["c:","",1445880833840070],
           ["c:c","a",1445880833840070],
           ["d:","",1445880839727743],
           ["d:c","a",1445880839727743]]}

6: Perform Delete operation on the data : Here the first three statements are ordinary column deletes and the fourth statement performs a delete of the complete row using a range tombstone.

cqlsh> delete c from autogeneratedtest.test where a = 'a' AND b = 'a';
cqlsh> delete c from autogeneratedtest.test where a = 'a' AND b = 'b';
cqlsh> delete c from autogeneratedtest.test where a = 'a' AND b = 'c';
cqlsh> delete c from autogeneratedtest.test where a = 'a' AND b = 'd';
cqlsh> delete from autogeneratedtest.test WHERE a='b' ;

 a | b | c
---+---+------
 a | a | null
 a | b | null
 a | c | null
 a | d | null


7:Repeat the steps 4 and 5 

8: Compare the sstables after the delete operation is performed

{"key": "a",
 "cells": [["a:c",1445883190,1445883190723238,"d"]
           ["b:c",1445883196,1445883196743264,"d"],
           ["c:c",1445883202,1445883202431044,"d"],
           ["d:c",1445883208,1445883208774698,"d"]]},

{"key": "b",
 "metadata": {"deletionInfo": {"markedForDeleteAt":1445883214918032,"localDeletionTime":1445883214}},

 "cells": []}

As seen from the output of the sstables, the rows with key "b", do not have an individual tombstone for each cell, instead they have a single range tombstone.

Fewer tombstones yield faster reads and quicker, less expensive compactions.