Cassandra 2.1 Overview

DataStax Training

Cassandra 2.1 Overview

Contents

Approach
Forensic-entomology

Approach

There were > 900 Jiras associated with the 2.1 release
Today I will go over them one by one

Approach

There were > 900 Jiras associated with the 2.1 release
Today I will not go over them one by one
What we will do:
- Macro Trends (top 100 Jiras — by watchers)
- Deep dive for major Bugs, Improvements, and Features
- Some will be overviews, some will be examples based on SE feedback

Forensic-Entomology

Loose definition: "The study of squashed bugs"

Long term vision: a tool that can be repeatably used for (cross project) Jira analysis
Requirements: Powered directly by the Jira API, interactive, aesthetically pleasing, easy to deploy

Forensic-Entomology

Current State: * Top 100 Jiras for 2.1

Size - Watches
Y - Votes
X - Committer
Color - Issue type

https://github.com/phact/forensic-entomology

Cassandra 2.1 Features

Contents

Methodology
Dive in

Methodology

Top features by watchers + special requests.
Got feedback from SE’s for ranking

Counters 2.0 CASSANDRA-4775

TL;DR:

2.0 counters == painful death — 2.1 counters still think twice.

This is the Jira that birthed the other Counter Jiras summarized in this deck.
Read Aleksey’s blog post

There are more minor counter improvements in 2.1. And more pending for 3.0.

https://issues.apache.org/jira/browse/CASSANDRA-4775

New Row Cache - Number of columns CASSANDRA-1956/CASSANDRA-2864/CASSANDRA-6487

The row cache used to be mostly inefficient and in most cases worse than OS page cache.
2.1 allows us to configure the row cache for a certain number of rows per partition.
This works great for tables where you’ll often want to get the top X rows per partition (clustering order by)

Ryan Mcgure’s example:

CREATE TABLE status (
    user text,
    status_id timeuuid,
    status text,
    PRIMARY KEY (user, status_id))
    WITH CLUSTERING ORDER BY (status_id DESC);

Read Ryan’s blog post

https://issues.apache.org/jira/browse/CASSANDRA-1956

https://issues.apache.org/jira/browse/CASSANDRA-2864

https://issues.apache.org/jira/browse/CASSANDRA-6487

DTCS CASSANDRA-6602

Backported to 2.0
Deserves it’s own presentation
The workload matters a lot
Huge implications for Cassandra in terms of:
- Data density
- Decreased compaction pain
- Address TTL tombstone issues

Read Shooky’s doc

https://issues.apache.org/jira/browse/CASSANDRA-6602

Partially Off-heap memtables CASSANDRA-6689/CASSANDRA-6694

GC is a huge pain, we’ve all fought it
Memtables were the last major bit to go off heap — caches, bloom filters, etc.
This is not turned on by default you have to configure the following (maybe default in 3.0 or 2.2?)

Move the cell name and buffer off (affects heap for large heaps and blobs):

memtable_allocation_type: offheap_buffers

Move the entire cell off heap:

memtable_allocation_type: offheap_objects

^^ minor wins in lab perf tests
Is GC that bad after all? Al Tobey on G1GC CASSANDRA-7486

Jonathan’s blog post

https://issues.apache.org/jira/browse/CASSANDRA-6694

https://issues.apache.org/jira/browse/CASSANDRA-6689

cassandra-stress 2.1 CASSANDRA-6146/CASSANDRA-7468

cql native
Bechmark your own schema! Read Jake’s blog post
Takes a config (yaml) file with your table def, filed distribution parameters, batching, and query patterns to generate a realistic workload
For some reason, requires tons of threads to actually STRESS a cluster.
Notoriously mediocre docs / pretty complex. Use my data modeling tool to generate a yaml for your workload.
CASSANDRA-7468 gives us time based execution for stress. Now you can set stress to run for X hours / days / etc.

https://issues.apache.org/jira/browse/CASSANDRA-6146

https://issues.apache.org/jira/browse/CASSANDRA-7468

Incremental Repairs CASSANDRA-5351

Avoid repairing already-repaired data by marking sstables as repaired and only repairing new sstables once.
Once you pop the top you can’t stop
use -inc in your repairs
is compatible with sync and parallel repairs
missing data or corrupted sstables require a full repair
to manually mark an sstable as repaired use tools/bin/sstablerepairedset --is-repaired <sstable> but first verify using /tools/bin/sstablemetadata and looking at repairedAt.
Impact on compaction:
- Size tiered compacts repaired and un-repaired sstables separately.
- Leveled does size tiered compaction on un-repaired tables.
- DTCS, same as leveled?
Once you start inc you kind of have to keep doing it.

https://issues.apache.org/jira/browse/CASSANDRA-5351

Separate flush directory CASSANDRA-6357

New configurable directory for flushing memtables flush_directory
Idea is to place flushes on a different drive, like we do today with the commitlog. Why?
- Keep flush writers from backing up due to io wait when under high load
- Can avoid certain OOM edge cases under load
If you don’t configure this it will default to the data directory, same as 2.0

https://issues.apache.org/jira/browse/CASSANDRA-6357

User Defined Types CASSANDRA-5590

Refresher from Summitt:
- Make a type, then use it in a table def
- You cannot update only parts of a UDT value, you have to overwrite the whole thing every time.
- You can add or rename fields to a UDT using Alter but you can’t delete fields.
- You can select just some fields of a UDT, but c* will still pull and de-serialize the whole thing.
- You can but probably don’t want to use a UDT as a clustering column

https://issues.apache.org/jira/browse/CASSANDRA-5590

User Defined Types CASSANDRA-5590

What UDT’s look like

 CREATE TYPE address (
      street text,
      city text,
      zip int
  );

  CREATE TABLE user_profiles (
      login text PRIMARY KEY,
      first_name text,
      last_name text,
      email text,
      addresses map<text, address>
  );

User Defined Types CASSANDRA-5590

  // Inserts a user with a home address
  INSERT INTO user_profiles(login, first_name, last_name, email, addresses)
  VALUES ('tsmith',
          'Tom',
          'Smith',
          'tsmith@gmail.com',
          { 'home': { street: '1021 West 4th St. #202',
                      city: 'San Fransisco',
                      zip: 94110 }});

User Defined Types CASSANDRA-5590

  // Adds a work address for our user
  UPDATE user_profiles
     SET addresses = addresses
                   + { 'work': { street: '3975 Freedom Circle Blvd',
                                 city: 'Santa Clara',
                                 zip: 95050 }}
   WHERE login = 'tsmith';

  CREATE TYPE phone (
      number text,
      tags set<text>
  );

  // Add a 'phones' field to address that is a set of the 'phone' UDT above

	ALTER TYPE address ADD phones set<phone>;

Read Sylvain’s post

https://issues.apache.org/jira/browse/CASSANDRA-5590

Listen interface and RPC interface CASSANDRA-7417

Main Use Case: with vnodes, and the *_interface settings, you can have one config file for the whole cluster (though I think you should still keep track of your tokens somewhere).
Can be useful if your machines have multiple NICs
Interface must map to a single address
Two new attributes in the yaml to specify listen and rpc interfaces.
Don’t set address and interface, one will suffice
Notably Broadcast interface is missing…

#listen_interface: eth0
#rpc_interface: eth1

https://issues.apache.org/jira/browse/CASSANDRA-7417

Cassandra 2.1 Improvements

Contents

Methodology
Dive in

Methodology

Top improvements by watchers + special requests.
Got feedback from SE’s for ranking

counters++ CASSANDRA-6504

TL;DR:

2.0 counters == painful death — 2.1 counters still think twice.

Improvements and changes to counters:

1) Replicate on write is gone, now we have COUNTER_MUTATION

2) New configurable counter timeout

3) New counter cache

Note

There are more minor counter improvements in 2.1. And more pending for 3.0.

https://issues.apache.org/jira/browse/CASSANDRA-6504

Warn on large batch sizes CASSANDRA-6487

Why? Monster batches hurt C*, even small unlogged batches can’t really help if they’re cross partitions.

Configurable warning for large batches
Backported to 2.0 so you may have seen this already.
For details read Ryan’s Blog

https://issues.apache.org/jira/browse/CASSANDRA-6487

Compare and Delete (CAD) CASSANDRA-5832

Allow LWT’s for delete.

Example:

DELETE FROM test1 WHERE k = 456 IF EXISTS

(Backported to 2.0)

https://issues.apache.org/jira/browse/CASSANDRA-5832

2i for Collections CASSANDRA-4511

These are still local indexes and have the same caveats of other secondary indexes.
Collections still can’t be part of your Primary Key

To create an index use the usual syntax (note when you index a map, you only index the Key not the Value):

CREATE INDEX keyword_index ON test2 (keywords);

For lists and sets:

SELECT * FROM myTable WHERE tags CONTAINS 'awesome';

For Maps (note you can only index the keys not the values):

select * from myTable where tags CONTAINS KEY 'awesome';

You can index a UDT column but not a field within it CASSANDRA-6382

for indexing Map values or nested types see Future CASSANDRA-6382

https://issues.apache.org/jira/browse/CASSANDRA-4511

Java 8 CASSANDRA-7028

Java 7 is EOL since April…

https://issues.apache.org/jira/browse/CASSANDRA-7028

Hot partitions CASSANDRA-7974

Have you ever had a customer issue and told them "maybe you have a wider row / hot partition" but had no way to prove it to them? Even if you could, what do you do next?
2.1 has brand new tooling to identify hot partiton, accessible via nodetool and JMX. (Remind me to talk to the OpsCenter team about an alert for this)

nodetool toppartitions <keyspace> <table> <duration>
WRITES Sampler:
  Cardinality: ~0 (256 capacity)
  Top 10 partitions:
	Nothing recorded during sampling period...

READS Sampler:
  Cardinality: ~0 (256 capacity)
  Top 10 partitions:
	Nothing recorded during sampling period...

https://issues.apache.org/jira/browse/CASSANDRA-7974

Don’t start without JNA CASSANDRA-6575

Historically, we couldn’t bundle JNA for licensing reasons but it’s now dual-licensed LGPL/APL.
CASSANDRA-5872 now gives us bundled JNA in 2.1

Impact:

You no longer have to worry about installing JNA separately.
Cassandra will actually fail to start if there’s an issue with JNA
Common problem: Many customers secure /tmp directory and it imposes on JNA from initializing.
Solution: Java option to redirect JNA to use a directory other than /tmp

JVM_OPTS="$JVM_OPTS -Djna.tmpdir=/var/tmp"

https://issues.apache.org/jira/browse/CASSANDRA-6575

Outlaw running -pr and -local repairs together CASSANDRA-7450

Doing this results in missing some data in the repair
You will now see an error if you try this

Not sure why this is labled as an Improvement. It’s a Bug! https://issues.apache.org/jira/browse/CASSANDRA-7248

Switch to logback CASSANDRA-5883

Now we will set logging in:

logback-test.xml or logback.xml

you can still use nodetool on the fly:

nodetool getlogginglevels

and

nodetool setlogginglevel <class> <level>

https://issues.apache.org/jira/browse/CASSANDRA-5883

Bulk loading Improvements CASSANDRA-7405/CASSANDRA-3668

Production ready copy to and copy from. For a better UX and ~2X performance check out Brian’s Loader and watch and vote or CASSANDRA-9303
Parallel steaming for sstable loader — 2x improvement in early tests (will OpsC backup restore get wins here?)

https://issues.apache.org/jira/browse/CASSANDRA-3668

https://issues.apache.org/jira/browse/CASSANDRA-7248

cqlsh uses python driver 6307

Fewer thrift things

https://issues.apache.org/jira/browse/CASSANDRA-6307

Netty performance things CASSANDRA-5663/CASSANDRA-6861/CASSANDRA-6236

Upgraded and optimized for Netty 4 - btw this is what inter-node communication uses
Coalecing network packets 2x improvement- interesting but dense.
Disabled by default, depends on architecture, cloud, etc.

Read Ariel’s blog

https://issues.apache.org/jira/browse/CASSANDRA-5663

https://issues.apache.org/jira/browse/CASSANDRA-6861

https://issues.apache.org/jira/browse/CASSANDRA-6236

Cassandra 2.1 Bugs

Contents

Methodology
Dive in

Methodology

Focused on bugs present in 2.0 that are fixed in 2.1 (ignored bugs that were introduced in 2.1 and fixed)
Got feedback from SE’s for ranking

Unique CF ID’s CASSANDRA-5202

Q: Dropping and removing CF’s has historically caused problems in Cassandra, why?: A: Because the unique identifiers for tables were just the table name so you could read old data in some edge cases.
Q: So is it safe to delete and recreate tables all the time?: A: Actually, it’s much worse now because concurrent creates can break your cluster and mess up your data CASSANDRA-8387 also keep an eye out for CASSANDRA-9291

2.0:

/var/lib/cassandra/data/test/test$ ls
test-test-jb-1-CompressionInfo.db  test-test-jb-1-Index.db       test-test-jb-1-TOC.txt             test-test-jb-2-Filter.db      test-test-jb-2-Summary.db
test-test-jb-1-Data.db             test-test-jb-1-Statistics.db  test-test-jb-2-CompressionInfo.db  test-test-jb-2-Index.db       test-test-jb-2-TOC.txt
test-test-jb-1-Filter.db           test-test-jb-1-Summary.db     test-test-jb-2-Data.db             test-test-jb-2-Statistics.db

2.1:

/var/lib/cassandra/data/test/test-1262ef90ff3911e490a74f7807583d97# ls
test-test-ka-1-CompressionInfo.db  test-test-ka-1-Digest.sha1  test-test-ka-1-Index.db       test-test-ka-1-Summary.db
test-test-ka-1-Data.db             test-test-ka-1-Filter.db    test-test-ka-1-Statistics.db  test-test-ka-1-TOC.txt

https://issues.apache.org/jira/browse/CASSANDRA-5202

Hadoop releases CASSANDRA-5201

TL;DR:

We now use Twitter’s Elephant Bird to support multiple versions of Hadoop.

This means people can use hadoop1 and hadoop2 with C* (backported to 2.0)

https://issues.apache.org/jira/browse/CASSANDRA-5201

Consistent range movements CASSANDRA-2434

Potential pitfall

Historical Cassandra Edge Case:

Pre 2434:

There was a possibility for data loss if nodes were inconsistent at the time of bootstrap.
Bootstrapping multiple nodes at a time (in different racks) leads to consistency issues.
Bootsraps were done using the 2 minute rule

Post 2434:

Data loss edge case is fixed.
Bootstrap will throw an error when you bootstrap more than one node at a time (even after waiting 2 minutes for gossip to settle!)

https://issues.apache.org/jira/browse/CASSANDRA-2434

Consistent range movements CASSANDRA-2434

Q: Darn, what if I’m in a hurry? Need capacity!!!: A: Never fear, like everything C* this is configurable.

Procedural TL;DR: Turn off consistent range movements. In your cassandra-env.sh set:

JVM_OPTS="$JVM_OPTS -Dconsistent.rangemovement=false

see DOC-362 for details.

This actually broke PAXOS which was fixed CASSANDRA-8640 (backported into 2.0)

https://issues.apache.org/jira/browse/CASSANDRA-2434

Hints & Metered Flusher private executors CASSANDRA-8285/CASSANDRA-8485

Put both hints and metered flushers on their own threads. That way they can’t get blocked!

Al Tobey has some awesome magic to make Hints DTCS

https://issues.apache.org/jira/browse/CASSANDRA-8285

https://issues.apache.org/jira/browse/CASSANDRA-8485

Flush on Truncate CASSANDRA-7511

2.0 Commitlog would continue to grow above the limit if you truncated a large table.

2.1 Now we flush after Truncate

https://issues.apache.org/jira/browse/CASSANDRA-7511

Multiple clustering column Where-In clause CASSANDRA-6875

In thrift you used to be able to select multiple specific clustering keys in a single query.

Now in CQL we can do the following for parity: For the following table:

CREATE TABLE test (
  k int,
  c1 int,
  c2 int,
  PRIMARY KEY (k, c1, c2)
);

Code:

SELECT * FROM test WHERE k = 0 AND (c1, c2) IN ((0, 0), (1,1)) ;

Not sure why this is labled as a Bug! https://issues.apache.org/jira/browse/CASSANDRA-6875

Counter time bomb CASSANDRA-6405

TL;DR: 2.0 counters == painful death — 2.1 counters still think twice.

Idempotent — an operation that will continue to have the same outcome when executed multiple times.

C* is good at idempotent operations and counters are not, and will never be, idempotent. As a result there are accuracy and performance issues with counters.

2.0 fixes address a little bit of both.

https://issues.apache.org/jira/browse/CASSANDRA-6405

Counter time bomb CASSANDRA-6405

Reasons for bad counter performance:

The majority of the CPU and IO required for counters does not occur on Write. It occurs later, on Replicate on Write. Keep an eye on this thread-pool.
Counters do not get stored as a value, but rather as an increment on the existing value throughout much of the write path.
When we reconcile counters we require a byte buffer that can hold the existing counter shards in sorted order, plus their union. Many of these at the same time consue resources exponentially.
When a counter time bomb explodes, at random on a random node in your cluster, the side effects include high CPU ultilization and a lot of garbage generation. Band aids include pre-aggregating counters, using token aware counter batches, the settings from 8150, and massive over provisioning.

This ticket addresses incremental counter storage. As of 6405, we store final counter values at the commitlog. From the commitlog on, we make counters idempotent.

The final counter fix will not get done until 3.x CASSANDRA-6506

https://issues.apache.org/jira/browse/CASSANDRA-6405

Counter time bomb CASSANDRA-6405

Reasons for bad counter accuracy:

In a distributed system, non-idempotent operations are not reliable
The longer (in your write path) an operation is not idempotent, the more unreliable it will be. This means possibility of under or over counting.
By making counters idempotent at the commitlog, counters become a lot more reliable than they used to be.
Be careful making use case selections for counters. No money etc.
You can ensure you overcount by always retrying, or undercount by never retrying.

https://issues.apache.org/jira/browse/CASSANDRA-6405

DTCS Improvement CASSANDRA-8243

One of the big hitters of DTCS is that if you have TTL, you can expire entire SSTables for free (as in beer) thanks to CASSANDRA-5228

Unfortunately, conditions for expiry in 2.0 were unnecessarily strict limiting this significantly.

8243 loosens up the condition to maximize free sstable expiry! (backported to 2.0)

Note: For details on DTCS see Shooky’s Guide

Not sure why this is labeled as a bug! https://issues.apache.org/jira/browse/CASSANDRA-8243

10k column families CASSANDRA-6977

TL;DR:

¡¡¡It’s still a bad idea to have thousands of tables in C*!!!

Too many tables and too many keyspaces results in performance degradation for the table creation operation, regardless of whether the tables are separated into separate Keyspaces.

With 6977 performance goes from:

2.0

Linear wrt the total number of tables in the schema

to 2.1

Linear wrt the number of tables in a keyspace

Q: So is it safe to create a ton of tables in C* for Multi-Tennant?

A: No! The memory allocation per table, the schema_columns issues/ CASSANDRA-9291, the compaction and repair pains still apply. Don’t do it!

https://issues.apache.org/jira/browse/CASSANDRA-6977

Tuple type CASSANDRA-7248

Sylvain added this for the where in Thift parity ticket CASSANDRA-6875

You can use Tuples in tables now:

CREATE TABLE tupletable (k int PRIMARY KEY, t tuple<int, text, double>);
INSERT INTO tupletable (k, t) VALUES ( 0, (3,'foo', 3.4))

Not sure why this is labled as a Bug! https://issues.apache.org/jira/browse/CASSANDRA-7248

Cassandra 2.1 Summary

Summary

There were a lot of changes in 2.1, many to be excited about
From early field testing (Al Tobey):
- Big IO and CPU improvements for 2.1 especially for writes
- For the first time, C* hits other bottlenecks (not IO or CPU) like hints or even network!
- G1GC shows better and better results for C* and lately I heard a great review for a Search workload. C* 3.0 is looking at Al’s PR

THANK YOU

To you for listenning
To the devs for giving me squashed bugs to study
To Al, Kennedy, Hess, Rich Reffner, Avinash, and others for helping me prioritize this monster list.