Lessons from >100 Startups
Contents
Overview
Perks
Join over 500 startups worldwide. Apply now at datastax.com/startups
Stats about the program
Starting Cassandra/DSE
Cannot access cqlsh
Connection error: ('Unable to connect to any servers', {'127.0.0.1': error(111, "Tried connecting to [('127.0.0.1', 9042)]. Last error: Connection refused")})
Is the process running?
ps -ef|grep dse ps -ef|grep cassandra
If you restart the process sudo service dse restart
or dse cassandra
and it still doesn’t appear in your process list it did not come up successfully so check your system.log
. Common causes include:
There is one scenario where the process won’t start and you won’t see anything in your logs!
Is your disk full?
$ df Filesystem 1K-blocks Used Available Use% Mounted on /dev/xvda1 848321208 848321208 0 100% / udev 3806788 8 3806780 1% /dev tmpfs 1525892 212 1525680 1% /run none 5120 0 5120 0% /run/lock none 3814728 112 3814616 1% /run/shm /dev/xvdb 433455904 203012 411234588 1% /mnt/ephemeral cgroup 3814728 0 3814728 0% /sys/fs/cgroup
Why is my commitlog so big?
DSE is up but I still can’t connect!
1) Why all these addresses?
2) Firewall (OS and Cloud security groups)?
Telnet is your friend.
Pro tip- Check out the Listen and RPC interfaces
3) Is your auth keyspace replicated?
4) SSL setup? Cassandra and DSE security can be tricky to set up. Use this shortcut currently Work in Progress.
Avoid cryptic errors by
Set up your os/kernel system settings
dstat -s 10
will show if it’s not offUse the preflight check in DSE to check for these things
DSE has a lot of configurables, which ones should I use and when?
Compaction levers
Goal: Ensure that you don’t fall behind on compactions (increasing pending compactions in nodetool compactionstats
) but also minimize impact of compactions on reads and writes
Concurrent compactors:
export target_compactors=2 wget https://jmxsh.googlecode.com/files/jmxsh-R5.jar wget https://jmxsh.googlecode.com/files/jmxsh echo jmx_set -m org.apache.cassandra.db:type=CompactionManager CoreCompactorThreads $target_compactors > changeCoreCompactors.sh echo jmx_set -m org.apache.cassandra.db:type=CompactionManager MaximumCompactorThreads $target_compactors > changeMaxCompactors.sh java -jar jmxsh-R5.jar -h localhost -p 7199 -q changeCoreCompactors.sh java -jar jmxsh-R5.jar -h localhost -p 7199 -q changeMaxCompactors.sh
Compaction throttling:
nodetool getcompactionthroughput
- Default 16mb/s
nodetool setcompactionthroughput
or compaction_throughput_mb_per_sec in cassandra.yaml
Compaction levers
Use the right compaction strategy for your use case:
Tombstone levers
Goal: Increase read performance and reclaim disk while avoiding zombie data. Reminder- tombstones come from deletes or from updating collections.
1) gc_grace_seconds - can be decreased in order to make tombstones available for deletion sooner. Remember to run repairs more often than gc_grace_seconds (configurable by table) or risk zombie data.
3) confirm using nodetool cfstats
and sstablemetadata <sstable filenames>
Emergency tactic: Reading a row with millions of tombstones can hurt cluster performance. Drop the row if you can find it.
1) Find the most frequently opened file:
wget https://raw.githubusercontent.com/brendangregg/perf-tools/master/opensnoop chmod +x opensnoop echo 'ctrl-c after a while' && sudo ./opensnoop|grep db > files.txt cat files.txt|awk '{ print $4 }'|sort| uniq -c|sort
2) Confirm that it’s the culprit:
sstablemetadata <filename>
3) Find the bad row (This is Russ Bradberry’s code):
keys = Counter() for row in rows: for cell in row.get('cells', []): if len(cell) > 3: if cell[3] == 't': keys[row['key']] += 1
4) Drop the row and watch your cluster get better. Tweet about it
Levers for bootstraps
nodetool getstreamthroughput
nodetool setstreamthroughput
Contingency strategy:
Reboot with autobootstrap false
, repair your way to the end.
Levers for bootstraps
"Other bootstrapping/leaving/moving nodes detected, cannot bootstrap while cassandra.consistent.rangemovement is true"
Consistent range movements edge case CASSANDRA-2434 and CASSANDRA-7069
I need to grow my cluster fast, what do I do?
JVM_OPTS="$JVM_OPTS -Dcassandra.consistent.rangemovement=false
Levers for DSE Search
1) Search monitoring
2) Indexing performance
3) Query time performance
This is also a sizing conversation, indexing performance scales with CPU cores and query time performance scales with RAM.
Levers for DSE Search
What to monitor?
Levers for DSE Search
What to tune for indexing perf?
Levers for DSE Search
What to tune for query perf?
1) Turn off Term Vector information if you’re not using highlighting or other functionality that relies on it:
• termVectors="false"
• termPositions="false"
• termOffsets="false"
2) Turn on omit norms if you’re not using Boosts:
• omitNorms="true"
3) Only index fields you intend to search. As you mentioned above, you don’t have to index all your fields
From what I’ve seen term vectors and omit norms can be a substantial percentage of your index ~50%
Levers for DSE Search
What about the JVM?
In DSE Search, solr and cassandra run in the same JVM. If you’re running RT, you should expect significant heap pressure from search indexing.
Use 20gb heaps with G1GC configured, G1 is almost as good as a perfectly tuned CMS and you don’t have to know the black magic of CMS.
G1GC levers-
Levers for DSE Search
Common reason for OOM - Don’t abuse dynamic fields
Want to find out if you are abusing dynamic fields? check luke
Consider a solr side join for this scenario
Levers for DSE Analytics
Read Russ’s blog posts.
Read Russ’s blog posts.
Read Russ’s blog posts.
Levers for DSE Analytics
Levers for stability and for running tasks consistently:
Levers for DSE Analytics
For reads:
Node locality:
Introspect partitions and preferred locations:
val rdd = sc.cassandraTable("test_ks","test_table") rdd.partitions.foreach(part => println (rdd.getPreferredLocations(part)))
Levers for DSE Analyitics
For writes:
Levers for DSE Analyitics
Where Operations should be in your Chain of RDD Operations
Placement <- Earliest Latest -> ###Type of Operation Cassandra RDD Specific Filters on the Spark Side Independent Transforms Per Partition Combinable Transforms Full Shuffle Operations ###Examples where select filter sample map mapByPartition keyByreduceByKey aggregateByKeygroupByKey join sort shuffle
Levers for DSE Analyitics
Other performance wins:
Levers for DSE Analyitics
Spark 1.4 (which ships with DSE 4.8) gives us:
Levers for DSE Analytcs
Read these:
Historical montoring
Configure opscenter
www.datastax.com/dev/blog/opscenter-5-2-dashboard-importexport-labs-feature[Import / Export feature blog post]
Real-time monitoring
dstat -rvn 10
- 10 second intervals of OS statisticswget https://bintray.com/artifact/download/aragozin/generic/sjk-plus-0.3.6.jar java -jar sjk-plus-0.3.6.jar ttop -s localhost:7199 -n 30 -o CPU java -jar sjk-plus-0.3.6.jar gc -s localhost:7199
nodetool cfhistograms
nodetool proxyhistograms
nodetool cfstats
and sstablemetatdata
sstable2json
utililitiesBenchmark your datamodel
Get your main table to scale and perform
Baseline and predictability for when to add nodes
Take your app out of the equation
Leverage your resources
THANK YOU
/