Cassandra 2.1 Overview

Cassandra 2.1 Overview

DataStax Training


Cassandra 2.1 Overview

Contents

  • Approach
  • Forensic-entomology

Approach

  • There were > 900 Jiras associated with the 2.1 release
  • Today I will go over them one by one

Approach

  • There were > 900 Jiras associated with the 2.1 release
  • Today I will not go over them one by one
  • What we will do:

    • Macro Trends (top 100 Jiras — by watchers)
    • Deep dive for major Bugs, Improvements, and Features
    • Some will be overviews, some will be examples based on SE feedback

Forensic-Entomology

Loose definition: "The study of squashed bugs"

  • Long term vision: a tool that can be repeatably used for (cross project) Jira analysis
  • Requirements: Powered directly by the Jira API, interactive, aesthetically pleasing, easy to deploy

Forensic-Entomology

Current State: * Top 100 Jiras for 2.1

  • Size - Watches
  • Y - Votes
  • X - Committer
  • Color - Issue type

Cassandra 2.1 Features

Contents

  • Methodology
  • Dive in

Methodology

  • Top features by watchers + special requests.
  • Got feedback from SE’s for ranking

Counters 2.0 CASSANDRA-4775

TL;DR:

2.0 counters == painful death — 2.1 counters still think twice.

  • This is the Jira that birthed the other Counter Jiras summarized in this deck.
  • Read Aleksey’s blog post

There are more minor counter improvements in 2.1. And more pending for 3.0.

New Row Cache - Number of columns CASSANDRA-1956/CASSANDRA-2864/CASSANDRA-6487

  • The row cache used to be mostly inefficient and in most cases worse than OS page cache.
  • 2.1 allows us to configure the row cache for a certain number of rows per partition.
  • This works great for tables where you’ll often want to get the top X rows per partition (clustering order by)

Ryan Mcgure’s example:

CREATE TABLE status (
    user text,
    status_id timeuuid,
    status text,
    PRIMARY KEY (user, status_id))
    WITH CLUSTERING ORDER BY (status_id DESC);

DTCS CASSANDRA-6602

  • Backported to 2.0
  • Deserves it’s own presentation
  • The workload matters a lot
  • Huge implications for Cassandra in terms of:

    • Data density
    • Decreased compaction pain
    • Address TTL tombstone issues

Partially Off-heap memtables CASSANDRA-6689/CASSANDRA-6694

  • GC is a huge pain, we’ve all fought it
  • Memtables were the last major bit to go off heap — caches, bloom filters, etc.
  • This is not turned on by default you have to configure the following (maybe default in 3.0 or 2.2?)

Move the cell name and buffer off (affects heap for large heaps and blobs):

memtable_allocation_type: offheap_buffers

Move the entire cell off heap:

memtable_allocation_type: offheap_objects

cassandra-stress 2.1 CASSANDRA-6146/CASSANDRA-7468

  • cql native
  • Bechmark your own schema! Read Jake’s blog post
  • Takes a config (yaml) file with your table def, filed distribution parameters, batching, and query patterns to generate a realistic workload
  • For some reason, requires tons of threads to actually STRESS a cluster.
  • Notoriously mediocre docs / pretty complex. Use my data modeling tool to generate a yaml for your workload.
  • CASSANDRA-7468 gives us time based execution for stress. Now you can set stress to run for X hours / days / etc.

Incremental Repairs CASSANDRA-5351

  • Avoid repairing already-repaired data by marking sstables as repaired and only repairing new sstables once.
  • Once you pop the top you can’t stop
  • use -inc in your repairs
  • is compatible with sync and parallel repairs
  • missing data or corrupted sstables require a full repair
  • to manually mark an sstable as repaired use tools/bin/sstablerepairedset --is-repaired <sstable> but first verify using /tools/bin/sstablemetadata and looking at repairedAt.
  • Impact on compaction:

    • Size tiered compacts repaired and un-repaired sstables separately.
    • Leveled does size tiered compaction on un-repaired tables.
    • DTCS, same as leveled?

  • Once you start inc you kind of have to keep doing it.

Separate flush directory CASSANDRA-6357

  • New configurable directory for flushing memtables flush_directory
  • Idea is to place flushes on a different drive, like we do today with the commitlog. Why?

    • Keep flush writers from backing up due to io wait when under high load
    • Can avoid certain OOM edge cases under load

  • If you don’t configure this it will default to the data directory, same as 2.0

User Defined Types CASSANDRA-5590

  • Refresher from Summitt:

    • Make a type, then use it in a table def
    • You cannot update only parts of a UDT value, you have to overwrite the whole thing every time.
    • You can add or rename fields to a UDT using Alter but you can’t delete fields.
    • You can select just some fields of a UDT, but c* will still pull and de-serialize the whole thing.
    • You can but probably don’t want to use a UDT as a clustering column

User Defined Types CASSANDRA-5590

  • What UDT’s look like
 CREATE TYPE address (
      street text,
      city text,
      zip int
  );

  CREATE TABLE user_profiles (
      login text PRIMARY KEY,
      first_name text,
      last_name text,
      email text,
      addresses map<text, address>
  );

User Defined Types CASSANDRA-5590

  // Inserts a user with a home address
  INSERT INTO user_profiles(login, first_name, last_name, email, addresses)
  VALUES ('tsmith',
          'Tom',
          'Smith',
          'tsmith@gmail.com',
          { 'home': { street: '1021 West 4th St. #202',
                      city: 'San Fransisco',
                      zip: 94110 }});

User Defined Types CASSANDRA-5590

  // Adds a work address for our user
  UPDATE user_profiles
     SET addresses = addresses
                   + { 'work': { street: '3975 Freedom Circle Blvd',
                                 city: 'Santa Clara',
                                 zip: 95050 }}
   WHERE login = 'tsmith';

  CREATE TYPE phone (
      number text,
      tags set<text>
  );

  // Add a 'phones' field to address that is a set of the 'phone' UDT above

	ALTER TYPE address ADD phones set<phone>;

Listen interface and RPC interface CASSANDRA-7417

  • Main Use Case: with vnodes, and the *_interface settings, you can have one config file for the whole cluster (though I think you should still keep track of your tokens somewhere).
  • Can be useful if your machines have multiple NICs
  • Interface must map to a single address
  • Two new attributes in the yaml to specify listen and rpc interfaces.
  • Don’t set address and interface, one will suffice
  • Notably Broadcast interface is missing…​
#listen_interface: eth0
#rpc_interface: eth1

Cassandra 2.1 Improvements

Contents

  • Methodology
  • Dive in

Methodology

  • Top improvements by watchers + special requests.
  • Got feedback from SE’s for ranking

counters++ CASSANDRA-6504

TL;DR:

2.0 counters == painful death — 2.1 counters still think twice.

Improvements and changes to counters:

1) Replicate on write is gone, now we have COUNTER_MUTATION

2) New configurable counter timeout

3) New counter cache

Note

There are more minor counter improvements in 2.1. And more pending for 3.0.

Warn on large batch sizes CASSANDRA-6487

Why? Monster batches hurt C*, even small unlogged batches can’t really help if they’re cross partitions.

  • Configurable warning for large batches
  • Backported to 2.0 so you may have seen this already.
  • For details read Ryan’s Blog

Compare and Delete (CAD) CASSANDRA-5832

Allow LWT’s for delete.

Example:

DELETE FROM test1 WHERE k = 456 IF EXISTS

(Backported to 2.0)

2i for Collections CASSANDRA-4511

  • These are still local indexes and have the same caveats of other secondary indexes.
  • Collections still can’t be part of your Primary Key

To create an index use the usual syntax (note when you index a map, you only index the Key not the Value):

CREATE INDEX keyword_index ON test2 (keywords);

For lists and sets:

SELECT * FROM myTable WHERE tags CONTAINS 'awesome';

For Maps (note you can only index the keys not the values):

select * from myTable where tags CONTAINS KEY 'awesome';

for indexing Map values or nested types see Future CASSANDRA-6382

Java 8 CASSANDRA-7028

  • Java 7 is EOL since April…​

Hot partitions CASSANDRA-7974

  • Have you ever had a customer issue and told them "maybe you have a wider row / hot partition" but had no way to prove it to them? Even if you could, what do you do next?
  • 2.1 has brand new tooling to identify hot partiton, accessible via nodetool and JMX. (Remind me to talk to the OpsCenter team about an alert for this)
nodetool toppartitions <keyspace> <table> <duration>
WRITES Sampler:
  Cardinality: ~0 (256 capacity)
  Top 10 partitions:
	Nothing recorded during sampling period...

READS Sampler:
  Cardinality: ~0 (256 capacity)
  Top 10 partitions:
	Nothing recorded during sampling period...

Don’t start without JNA CASSANDRA-6575

  • Historically, we couldn’t bundle JNA for licensing reasons but it’s now dual-licensed LGPL/APL.
  • CASSANDRA-5872 now gives us bundled JNA in 2.1

Impact:

  • You no longer have to worry about installing JNA separately.
  • Cassandra will actually fail to start if there’s an issue with JNA
  • Common problem: Many customers secure /tmp directory and it imposes on JNA from initializing.
  • Solution: Java option to redirect JNA to use a directory other than /tmp
JVM_OPTS="$JVM_OPTS -Djna.tmpdir=/var/tmp"

Outlaw running -pr and -local repairs together CASSANDRA-7450

  • Doing this results in missing some data in the repair
  • You will now see an error if you try this

Not sure why this is labled as an Improvement. It’s a Bug! https://issues.apache.org/jira/browse/CASSANDRA-7248

Switch to logback CASSANDRA-5883

  • Now we will set logging in:

logback-test.xml or logback.xml

  • you can still use nodetool on the fly:
nodetool getlogginglevels

and

nodetool setlogginglevel <class> <level>

Bulk loading Improvements CASSANDRA-7405/CASSANDRA-3668

  • Production ready copy to and copy from. For a better UX and ~2X performance check out Brian’s Loader and watch and vote or CASSANDRA-9303
  • Parallel steaming for sstable loader — 2x improvement in early tests (will OpsC backup restore get wins here?)

cqlsh uses python driver 6307

  • Fewer thrift things

Netty performance things CASSANDRA-5663/CASSANDRA-6861/CASSANDRA-6236

  • Upgraded and optimized for Netty 4 - btw this is what inter-node communication uses
  • Coalecing network packets 2x improvement- interesting but dense.
  • Disabled by default, depends on architecture, cloud, etc.

Cassandra 2.1 Bugs

Contents

  • Methodology
  • Dive in

Methodology

  • Focused on bugs present in 2.0 that are fixed in 2.1 (ignored bugs that were introduced in 2.1 and fixed)
  • Got feedback from SE’s for ranking

Unique CF ID’s CASSANDRA-5202

Q: Dropping and removing CF’s has historically caused problems in Cassandra, why?

A: Because the unique identifiers for tables were just the table name so you could read old data in some edge cases.

Q: So is it safe to delete and recreate tables all the time?

A: Actually, it’s much worse now because concurrent creates can break your cluster and mess up your data CASSANDRA-8387 also keep an eye out for CASSANDRA-9291

2.0:

/var/lib/cassandra/data/test/test$ ls
test-test-jb-1-CompressionInfo.db  test-test-jb-1-Index.db       test-test-jb-1-TOC.txt             test-test-jb-2-Filter.db      test-test-jb-2-Summary.db
test-test-jb-1-Data.db             test-test-jb-1-Statistics.db  test-test-jb-2-CompressionInfo.db  test-test-jb-2-Index.db       test-test-jb-2-TOC.txt
test-test-jb-1-Filter.db           test-test-jb-1-Summary.db     test-test-jb-2-Data.db             test-test-jb-2-Statistics.db

2.1:

/var/lib/cassandra/data/test/test-1262ef90ff3911e490a74f7807583d97# ls
test-test-ka-1-CompressionInfo.db  test-test-ka-1-Digest.sha1  test-test-ka-1-Index.db       test-test-ka-1-Summary.db
test-test-ka-1-Data.db             test-test-ka-1-Filter.db    test-test-ka-1-Statistics.db  test-test-ka-1-TOC.txt

Hadoop releases CASSANDRA-5201

TL;DR:

We now use Twitter’s Elephant Bird to support multiple versions of Hadoop.

This means people can use hadoop1 and hadoop2 with C* (backported to 2.0)

Consistent range movements CASSANDRA-2434

Potential pitfall

Historical Cassandra Edge Case:

Pre 2434:

  • There was a possibility for data loss if nodes were inconsistent at the time of bootstrap.
  • Bootstrapping multiple nodes at a time (in different racks) leads to consistency issues.
  • Bootsraps were done using the 2 minute rule

Post 2434:

  • Data loss edge case is fixed.
  • Bootstrap will throw an error when you bootstrap more than one node at a time (even after waiting 2 minutes for gossip to settle!)

Consistent range movements CASSANDRA-2434

Q: Darn, what if I’m in a hurry? Need capacity!!!

A: Never fear, like everything C* this is configurable.

Procedural TL;DR: Turn off consistent range movements. In your cassandra-env.sh set:

JVM_OPTS="$JVM_OPTS -Dconsistent.rangemovement=false

see DOC-362 for details.

This actually broke PAXOS which was fixed CASSANDRA-8640 (backported into 2.0)

Hints & Metered Flusher private executors CASSANDRA-8285/CASSANDRA-8485

Put both hints and metered flushers on their own threads. That way they can’t get blocked!

Al Tobey has some awesome magic to make Hints DTCS

Flush on Truncate CASSANDRA-7511

2.0 Commitlog would continue to grow above the limit if you truncated a large table.

2.1 Now we flush after Truncate

Multiple clustering column Where-In clause CASSANDRA-6875

In thrift you used to be able to select multiple specific clustering keys in a single query.

Now in CQL we can do the following for parity: For the following table:

CREATE TABLE test (
  k int,
  c1 int,
  c2 int,
  PRIMARY KEY (k, c1, c2)
);

Code:

SELECT * FROM test WHERE k = 0 AND (c1, c2) IN ((0, 0), (1,1)) ;

Not sure why this is labled as a Bug! https://issues.apache.org/jira/browse/CASSANDRA-6875

Counter time bomb CASSANDRA-6405

TL;DR: 2.0 counters == painful death — 2.1 counters still think twice.

Idempotent — an operation that will continue to have the same outcome when executed multiple times.

C* is good at idempotent operations and counters are not, and will never be, idempotent. As a result there are accuracy and performance issues with counters.

2.0 fixes address a little bit of both.

Counter time bomb CASSANDRA-6405

Reasons for bad counter performance:

  1. The majority of the CPU and IO required for counters does not occur on Write. It occurs later, on Replicate on Write. Keep an eye on this thread-pool.
  2. Counters do not get stored as a value, but rather as an increment on the existing value throughout much of the write path.
  3. When we reconcile counters we require a byte buffer that can hold the existing counter shards in sorted order, plus their union. Many of these at the same time consue resources exponentially.
  4. When a counter time bomb explodes, at random on a random node in your cluster, the side effects include high CPU ultilization and a lot of garbage generation. Band aids include pre-aggregating counters, using token aware counter batches, the settings from 8150, and massive over provisioning.

This ticket addresses incremental counter storage. As of 6405, we store final counter values at the commitlog. From the commitlog on, we make counters idempotent.

The final counter fix will not get done until 3.x CASSANDRA-6506

Counter time bomb CASSANDRA-6405

Reasons for bad counter accuracy:

  1. In a distributed system, non-idempotent operations are not reliable
  2. The longer (in your write path) an operation is not idempotent, the more unreliable it will be. This means possibility of under or over counting.
  3. By making counters idempotent at the commitlog, counters become a lot more reliable than they used to be.
  4. Be careful making use case selections for counters. No money etc.
  5. You can ensure you overcount by always retrying, or undercount by never retrying.

DTCS Improvement CASSANDRA-8243

One of the big hitters of DTCS is that if you have TTL, you can expire entire SSTables for free (as in beer) thanks to CASSANDRA-5228

Unfortunately, conditions for expiry in 2.0 were unnecessarily strict limiting this significantly.

8243 loosens up the condition to maximize free sstable expiry! (backported to 2.0)

Note: For details on DTCS see Shooky’s Guide

Not sure why this is labeled as a bug! https://issues.apache.org/jira/browse/CASSANDRA-8243

10k column families CASSANDRA-6977

TL;DR:

¡¡¡It’s still a bad idea to have thousands of tables in C*!!!

Too many tables and too many keyspaces results in performance degradation for the table creation operation, regardless of whether the tables are separated into separate Keyspaces.

With 6977 performance goes from:

2.0

  • Linear wrt the total number of tables in the schema

to 2.1

  • Linear wrt the number of tables in a keyspace

    Q: So is it safe to create a ton of tables in C* for Multi-Tennant?

    A: No! The memory allocation per table, the schema_columns issues/ CASSANDRA-9291, the compaction and repair pains still apply. Don’t do it!

Tuple type CASSANDRA-7248

Sylvain added this for the where in Thift parity ticket CASSANDRA-6875

You can use Tuples in tables now:

CREATE TABLE tupletable (k int PRIMARY KEY, t tuple<int, text, double>);
INSERT INTO tupletable (k, t) VALUES ( 0, (3,'foo', 3.4))

Not sure why this is labled as a Bug! https://issues.apache.org/jira/browse/CASSANDRA-7248

Cassandra 2.1 Summary

Summary

  • There were a lot of changes in 2.1, many to be excited about
  • From early field testing (Al Tobey):

    • Big IO and CPU improvements for 2.1 especially for writes
    • For the first time, C* hits other bottlenecks (not IO or CPU) like hints or even network!
    • G1GC shows better and better results for C* and lately I heard a great review for a Search workload. C* 3.0 is looking at Al’s PR

THANK YOU

  • To you for listenning
  • To the devs for giving me squashed bugs to study
  • To Al, Kennedy, Hess, Rich Reffner, Avinash, and others for helping me prioritize this monster list.

/