Distributed Musings

Agency Swarm with Third Party and Open Source Models

Sebastian Estevez — Thu, 11 Jul 2024 04:38:40 GMT

I have recently been answering a bunch of questions from agency swarm users looking to leverage Astra Assistants.

They fall into the following two categories:

I want to run agency swarm with other model providers via API (Anthropic, Google, Groq, etc.)
I want to run agency swarm with open source models on my local machine / infrastructure

Agency what?

Agency swarm is a popular multi agent framework built on top of OpenAI's Assistants API.

Folks have been voicing desire to use it with other model providers and with open source models. They even recommend Astra Assistants in their docs.

I watched some of VRSEN's videos on youtube and some of them remind me of the phrase "the future is here, it's just not evenly distributed".

He is out there talking about automating business processes not only using AI agents but groups of them (which he calls agencies) and yet all his stuff is very grounded in reality.

There are a couple of things about his framework that I deeply agree with:

No hard coded framework prompts
Pydantic / Instructor powered type checking for tool creation
Commitment to OpenAI's Assistant API as the right level of abstraction both for setting up and scaling agents

If you haven't checked out agency-swarm, I recommend having a look at the github repo and some of VRSEN's videos.

If you're here to find out how to set up Agency Swarm to work with Astra Assistants, read on:

Other providers via API

If all you are looking at doing is leveraging agency-swarm with other providers (for example Anthropic), simply set up your .env file with the api keys for the provider and wrap your openai client with the following sample code:

from openai import OpenAI
from astra_assistants import patch
from agency_swarm import Agent, Agency, set_openai_client
from dotenv import load_dotenv

load_dotenv("./.env")
load_dotenv("../../../.env")

client = patch(OpenAI())

set_openai_client(client)

ceo = Agent(name="CEO",
            description="Responsible for client communication, task planning, and management.",
            instructions="Please communicate with users and other agents.",
            model="anthropic/claude-3-haiku-20240307",
            # model="gpt-3.5-turbo",
            files_folder="./examples/python/agency-swarm/files",
            tools=[])

agency = Agency([ceo])

assistant = client.beta.assistants.retrieve(ceo.id)
print(assistant)

completion = agency.get_completion("What's something interesting about language models?")
print(completion)

your .env file may look like this:

#!/bin/bash

# AstraDB -> https://astra.datastax.com/ --> tokens --> administrator user --> generate
export ASTRA_DB_APPLICATION_TOKEN=""

# OpenAI Models - https://platform.openai.com/api-keys --> create new secret key
export OPENAI_API_KEY="fake"

# Anthropic claude models - https://console.anthropic.com/settings/keys
export ANTHROPIC_API_KEY=""

Note on architecture. You do not have to run the Astra Assistants backend yourself, the client library will point you at the hosted astra-assistants API hosted by DataStax. However the code is open source, Apache 2 licensed and you can choose to self host if you so choose.

Local Models

If you're running inference locally or in your own private infrastructure, you will have to run the Astra Assistants backend yourself so as to be able to point to your inference server for completions.

The simplest approach is to use ollama and leverage the docker-compose yamls in the Astra Assistants repo.

There are two versions, with and without GPU support. We'll look at GPU support since it's more performant and slightly more complex. See the docker-compose.yaml below:

version: '3.8'

services:
  ollama:
    image: ollama/ollama
    ports:
      - "11434:11434"
    networks:
      - my_network
    volumes:
      - ~/.ollama:/root/.ollama  #map to local volume to keep models
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [ gpu ]
    environment:
      NVIDIA_VISIBLE_DEVICES: "all"  # or specify the GPU IDs
    runtime: nvidia  # Specify the runtime for NVIDIA GPUs  -

  assistants:
    image: datastax/astra-assistants
    ports:
      - "8080:8000"
    networks:
      - my_network
    depends_on:
      - ollama


networks:
  my_network:
    driver: bridge

Notice the networks section which ensures that your containers can talk to each other. You can ensure this is working properly by exec'ing into the assistants container and running:

curl http://ollama:11434

Note: in this setup you need to point to ollama in your application code using the LLM-PARAM-base-url header as per this example when you wrap the client:

from openai import OpenAI
from astra_assistants import patch
from agency_swarm import Agent, Agency, set_openai_client
from dotenv import load_dotenv

load_dotenv("./.env")
load_dotenv("../../../.env")

# client = patch(OpenAI(default_headers={"LLM-PARAM-base-url": "http://localhost:11434"}))
# if using docker-compose, pass custom header to point to the ollama container instead of localhost
client = patch(OpenAI(default_headers={"LLM-PARAM-base-url": "http://ollama:11434"}))

set_openai_client(client)

ceo = Agent(name="CEO",
            description="Responsible for client communication, task planning, and management.",
            instructions="Please communicate with users and other agents.",
            model="ollama_chat/deepseek-coder-v2", # ensure that the model has been pulled in ollama
            files_folder="./examples/python/agency-swarm/files",
            tools=[])

agency = Agency([ceo])

assistant = client.beta.assistants.retrieve(ceo.id)
print(assistant)

completion = agency.get_completion("What's something interesting about language models?")
print(completion)

If you were running ollama and astra-assistants directly on your host (or with docker using host networking) you would point to localhost:

default_headers={"LLM-PARAM-base-url": "http://localhost:11434"}

UPDATE: In astra-assistants 2.0.13 I added support for `OLLAMA_API_BASE_URL` which replaces the LLM-PARAM-base-url setting. Not only is the env var more convenient but it also allows it to work with complex agencies that leverage both ollama and API provider based models.

Note on LiteLLM

Adding this note because I have been asked this question multiple times. LiteLLM proxy with Astra Assistants will be supported when this PR gets merged.

That said, Astra Assistants uses litellm as a library to route LLM completions so the proxy is not strictly necessary to get agency swarm working with other models.

Note, other features of LiteLLM Proxy like cost tracking, etc. will not be available with this method.

What is Astra Assistants

Sebastian Estevez — Thu, 11 Jul 2024 02:22:40 GMT

Last November, the week before my daughter was born, OpenAI released the Assistants API. I tried building an app with it and was impressed with the simplicity and power of the abstraction so I decided to start sleep deprivation early and built v0 of `astra-assistants`.

Astra Assistants is a drop in replacement for OpenAI's Assistants API that supports third party LLMs and embedding models and uses AstraDB / Apache Cassandra for persistence and ANN. You can use our managed service on Astra or you can host it yourself since it's open source.

Note: If you like the project please give us a github star! and join us on Discord!

As you can see below, you can simply patch your OpenAI client with the assistants client library and pick your model. This will point your app at our managed astra-assistants service instead of at OpenAI.

astra asssistants works with third party models like claude-3-5-sonnet, gemini 1.5, command-r-plus, and even local ollama models with a single line of code

Astra Assistants will automatically route LLM and embedding calls to your model provider of choice using LiteLLM and it will persist your threads, messages, assistants, vector_stores, files, etc. to AstraDB. File search leverages AstraDB's vector functionality powered by jvector.

For authentication you must provide corresponding api keys for the model provider(s). We recommend using environment variables in a .env file which automatically get picked up and sent to the astra-assistants service as http request headers by the astra-assistants client library when you patch the OpenAI sdk.

sample .env file with some api keys

Architecture

Astra Assistants is a python project built on fastapi that implements the backend for Assistants API using the Cassandra python driver and LiteLLM.

If you run astra-assistants yourself you can even point to your local ollama setup for use with open source models.

Release and improvements

We launched the service on November 15th 2023:

Introducing the Astra Assistants API | DataStax

Learn about the new Astra Assistants API

Sebastian EstevezDataStax

astra assistants hosted service was announced on November 16th

We added streaming support in February 2024 (before OpenAI):

Astra Assistants API Now Supports Streaming: Because Who Wants to Wait? | DataStax

DataStax announces support for OpenAI style streaming runs in Astra Assistants--it is available both in the managed service and the open source codebase.

Sebastian EstevezDataStax

We open sourced the server side code in March of 2024

The Astra Assistants API Is Now Open Source | DataStax

We’re excited to announce that the Astra Assistants API server, our drop-in replacement for the OpenAI Assistants API, is now open source.

Sebastian EstevezDataStax

And we added support for assistants v2 (including vector_stores) in June of 2024.

Conclusion

It's been a ton of fun working on Astra Assistants and I'll continue to post updates here so stay tuned!

If you like the project please give us a github star! and join us on Discord!

Connecting to Astra from DataGrip via JDBC

Sebastian Estevez — Fri, 31 Jul 2020 18:16:35 GMT

A few users have been asking me lately about connecting to DataStax Astra from different developer tools. As a result I am planning to do a series of quick posts around these starting with Intellij DataGrip. Big thank you to Donnie Roberson and Nick Panahi for their help getting the material ready for these posts.

Although I'm a fan of IDEA, Go Land, and GoLion, I confess I had not played with DataGrip until recently. I have found it to be a solid product. The kind of thing you'd expect from our friends at IntelliJ.

I thought DataGrip supported Cassandra out of the box

Yes this is true, there is an Apache Cassandra that ships with DataGrip. However, Astra is secure by default and the easiest way to connect to Astra is to use the secure connect bundle which effortlessly gives us mTLS. For this we need a JDBC driver that is built ontop of a modern version of the DataStax Java driver. Fortunately DataGrip allows us to add custom JDBC drivers with relative ease.

It might be possible to unpack the secure connect bundle and pick out all the pieces needed to configure SSL with the DataGrip cassandra driver but that will be a topic for another day.

Grab and unzip the driver

Downoad the DataStax JDBC Driver which you can download here or at https://downloads.datastax.com/#odbc-jdbc-drivers. I'm using version 2.0.4.

Next unzip that file in your Downloads directory

unzip SimbaCassandraJDBC42-2.0.4.1004.zip

Import IDE Settings

Download and import these settings.zip into DataGrip.

**Note**: If you are already a heavy DataGrip user, make sure to back up your existing settings and proceed with caution!

File –> Manage IDE Settings –> Import Settings or simply triple shift – > import settings

At this point you should be able to see a new database connection type called Astra:

Astra DB connection

If you would rather keep our JDBC jar elsewhere, just remember to change the path under driver files to match where you unziped your jar. By default it will use your user home `/Downloads` directory .

Establish the connection

When you create your connection, URL should look like this: jdbc:cassandra://;AuthMech=2;UID=;PWD=;SecureConnectionBundlePath=;TunableConsistency=6

It works!

At this point you can do things like create tables, introspect your keyspaces, view your data in the DataGrip table explorer and more.

Table explorer

Execute queries

Astra and Akka-Peristence

Sebastian Estevez — Wed, 08 Jul 2020 21:18:52 GMT

I've been chatting with a few folks that run akka-persistence backed by cassandra using either the akka-persistence-cassandra project directly or via the Lagom micro services framework. These folks are often big fans of cassandra's peer to peer distributed architecture, it's availability and performance capabilities, it's active active geo-reduncancy, and it's scalability characteristics.

For a lot of these folks, running cassandra clusters is not their main passion. They'd rather spend time thinking about cqrs, event sourcing, actors, fancy scala things like @transient, and of course their business logic.

Zero Ops?

Today, you don't have to manage your own cassandra clusters, just use Astra!

Dependency fun

If you are a lagom or akka-persistence user trying to switch over to dbaas you may have landed at the astra java driver docs and seen that to use the secure-connect-bundle you need a recent version of the Cassandra Java Driver.

This is both true and not true, but first what even is this secure-connect-bundle thing?

What even is a secure-connect-bundle?

Connections to Astra are secure by default and support two way TLS over the wire. In order to achieve this, there are a multiple settings in the driver that need to be configured when instantiating your cql session.

To make things easier for users, Astra replaces this complex configuration with a single line that points to a compressed file (the secure-connect-bundle).

These are the Java driver versions that support scb (naturally newer dot versions will also support it):

DataStax Java driver for Apache Cassandra 4.x

com.datastax.oss java-driver-core 4.6.0

DataStax Java driver for Apache Cassandra 3.x

com.datastax.cassandra cassandra-driver-core 3.8.0

DSE Java 2.x

com.datastax.dse dse-java-driver-core 2.3.0

DSE Java 1.x

com.datastax.dse dse-java-driver-core 1.9.0

Source DataStax Drivers Docs

What if I can't upgrade?

Driver upgrades are often just a simple dependency change but with frameworks like `lagom` and `akka-persistence-cassandra` that take care of session management and querying for you, you need the library to upgrade *it's* depentencies.

`lagom` depends on `akka-persistence-cassandra` which inherits the java driver version from `alpakka`.

As of the time of this writing (July 8, 2020) `alpakka` is updated to java driver 4.6.1 (which supports SCB), unfortunately this is only on the master branch today (Chris Batey pushed the dependency in this PR https://github.com/akka/alpakka/pull/2320).

Until this makes it into a release and that release gets pulled by an `akka-persistence-cassandra` release and that gets pulled into a `lagom` release, we'll need a work around.

Workaround

I spent some time and figured out how to get current versions of `akka-persistence-cassandra` to connect to lagom.

application.conf

# Configuration for akka-persistence-cassandra
akka.persistence.cassandra {
  events-by-tag {
    bucket-size = "Day"
    # for reduced latency
    eventual-consistency-delay = 200ms
    flush-interval = 50ms
    pubsub-notification = on
    first-time-bucket = "20200115T00:00"
  }

  query {
    refresh-interval = 2s
  }

  # don't use autocreate in production
  journal.keyspace-autocreate = off
  journal.tables-autocreate = on
  snapshot.keyspace-autocreate = off
  snapshot.tables-autocreate = on
  journal.keyspace = ""
  snapshot.keyspace = ""
}

datastax-java-driver {
  advanced.reconnect-on-init = on
  basic.contact-points = [ ":" ]
  basic.load-balancing-policy.local-datacenter = caas-dc
  local-datacenter = caas-dc
  advanced.ssl-engine-factory {
    class = DefaultSslEngineFactory


    hostname-validation = false

    truststore-path = .//trustStore.jks
    truststore-password = 
    keystore-path = .//identity.jks
    keystore-password = 
  }

  advanced.auth-provider {
    class = PlainTextAuthProvider
    username = 
    password = 
  }

}

akka.projection.cassandra.offset-store.keyspace = ""

First, download your secure-connect-bundle from the Astra UI and unzip it in a directory in the host where you are running akka-persistence-cassandra.

Next, take a look at config.json and cqlshrc. You can pull the fields in <> in the config above from these two files:

$ cat scb/config.json 
{
  "host": "",
  "port": ,
  "keyspace": "",
  "localDC": "caas-dc",
  "caCertLocation": "./ca.crt",
  "keyLocation": "./key",
  "certLocation": "./cert",
  "keyStoreLocation": "./identity.jks",
  "keyStorePassword": "",
  "trustStoreLocation": "./trustStore.jks",
  "trustStorePassword": "",
  "csvLocation": "./data",
  "pfxCertPassword": "Lkl08BhEVq2e4bw6m"
}

$ cat scb/cqlshrc 
[connection]
hostname = 
port = 
ssl = true

[ssl]
validate = true
certfile = ./ca.crt
userkey = ./key
usercert = ./cert

The config is kept in this gist for your viewing and sharing pleasure and will be updated if things change.

Future

The DataStax team is working with the community to help get the driver version updated and documented. Stay tuned!

DataStax Proxy for DynamoDB™ and Apache Cassandra™ - Preview

Sebastian Estevez — Fri, 20 Sep 2019 16:32:33 GMT

Yesterday at ApacheCon, our very own Patrick McFadin announced the public preview of an open source tool that enables developers to run their AWS DynamoDB™ workloads on Apache Cassandra. With the DataStax Proxy for DynamoDB and Cassandra, developers can run DynamoDB workloads on premises, taking advantage of the hybrid, multi-model, and scalability benefits of Cassandra.

This post is cross posted on the DataStax blog.

The Big Picture

Amazon DynamoDB is a key-value and document database which offers developers elasticity and a zero-ops cloud experience. However, the tight AWS integration that makes DynamoDB great for cloud is a barrier for customers that want to use it on premises.

Cassandra has always supported key-value and tabular data sets so supporting DynamoDB workloads just meant that DataStax customers needed a translation layer to their existing storage engine.

Today we are previewing a proxy that provides compatibility with the DynamoDB SDK, allowing existing applications to read/write data to DataStax Enterprise (DSE) or Cassandra without any code changes. It also provides the hybrid + multi-model + scalability benefits of Cassandra to DynamoDB users.

If you’re just here for the code you can find it in GitHub and DataStax Labs: https://github.com/datastax/dynamo-cassandra-proxy/

Possible Scenarios

Application Lifecycle Management: Many customers develop on premises and then deploy to the cloud for production. The proxy enables customers to run their existing DynamoDB applications using Cassandra clusters on-prem.

Hybrid Deployments: DynamoDB Streams can be used to enable hybrid workload management and transfers from DynamoDB cloud deployments to on-prem Cassandra-proxied deployments. This is supported in the current implementation and, like DynamoDB Global Tables, it uses DynamoDB Streams to move the data. For hybrid transfer to DynamoDB, check out the Cassandra CDC improvements which could be leveraged and stay tuned to the DataStax blog for updates on our Change Data Capture (CDC) capabilities.

What’s in the Proxy?

The proxy is designed to enable users to back their DynamoDB applications with Cassandra. We determined that the best way to help users leverage this new tool and to help it flourish was to make it an open source Apache 2 licensed project.The code consists of a scalable proxy layer that sits between your app and the database. It provides compatibility with the DynamoDB SDK which allows existing DynamoDB applications to read and write data to Cassandra without application changes.

How It Works

A few design decisions were made when designing the proxy. As always, these are in line with the design principles that we use to guide development for both Cassandra and our DataStax Enterprise product.

Why A Separate Process?

We could have built this as a Cassandra plugin that would execute as part of the core process but we decided to build it as a separate process for the following reasons:

1) Ability to scale the proxy independently of Cassandra

2) Ability to leverage k8s / cloud-native toolingDeveloper agility and to attract contributors—developers can work on the proxy with limited knowledge of Cassandra internals

3) Independent release cadence, not tied to the Apache Cassandra project

4) Better AWS integration story for stateless apps (i.e., leverage CloudWatch alarm, autoscaling, etc.)

Why Pluggable Persistence?

On quick inspection, DynamoDB’s data model is quite simple. It consists of a hash key, a sort key, and a JSON structure which is referred to as an item. Depending on your goals, the DynamoDB data model can be persisted in Cassandra Query Language (CQL) in different ways. To allow for experimentation and pluggability, we have built the translation layer in a pluggable way that allows for different translators. We continue to build on this scaffolding to test out multiple data models and determine which are best suited for:

1) Different workloads

2) Different support for consistency / linearization requirements

3) Different performance tradeoffs based on SLAs

Conclusion

If you have any interest in running DynamoDB workloads on Cassandra, take a look at the project. Getting started is easy and spelled out in the readme and DynamoDB sections. Features supported by the proxy are quickly increasing and collaborators are welcome.

https://github.com/datastax/dynamo-cassandra-proxy/

All product and company names are trademarks or registered trademarks of their respective owner. Use of these trademarks does not imply any affiliation with or endorsement by the trademark owner.

¹Often in the DynamoDB documentation, this key is referred to as a partition key, but since these are not one-to-one with DynamoDB partitions we will use the term hash key instead.

Integrate Spark Metrics using DSE Insights Metrics Collector

Sebastian Estevez — Wed, 20 Mar 2019 18:28:42 GMT

Metrics and visibility are critical when dealing with distributed systems.

In the case of DSE Analytics we are interested in monitoring the state of the various Spark processes (master, worker, driver, executor) in the cluster, the status of the work the cluster is doing (applications, jobs, stages, and tasks), and finally we are also interested in the detailed metrics provided by the spark cassandra connector. This article focuses on the first two and we leave the integration of the spark cassandra connector monitoring for a second post.

With the DataStax Enterprise (DSE) Metrics Collector (new as of DSE 6.7 and backported to 6.0.5) DataStax makes exporting metrics to your monitoring solution of choice simple and easy. Donnie Robertson wrote an excellent DataStax Academy blog on how to run DSE with the insights collector providing metrics for Prometheus and Grafana in a completely dockerized setup.

At its core, the DSE Metrics Collector is as a managed collectd sidecar bundled with the DSE binaries. DSE server manages the lifecycle of the collectd process and allows users to manage collectd configuration via `dsetool.` Customers can ship the DSE metrics events (generated by the database) to the endpoint of their choice*.

To monitor DSE Analytics (spark jobs) we can leverage a collectd plugin* to monitor DSE Analytics / Spark.

Just show me the code!

This bash script stands up Prometheus & Grafana and hooks up spark metrics assuming DSE is installed and running on localhost via a package install. I'll break down the steps in the rest of the article.

collectd-spark

The lovely folks at Signalfx wrote a spark plugin for collectd that gathers metrics via HTTP from the spark master and worker. To use it, simply clone the plugin and move it to the dse collectd directory (in the case of a package install /usr/share/dse/collectd) as follows:

git clone https://github.com/signalfx/collectd-spark

mkdir /usr/share/dse/collectd/collectd-spark
cp collectd-spark/spark_plugin.py /usr/share/dse/collectd/collectd-spark/

Since collectd-spark is written in python, we need to inform the bundled collectd where to find the python binaries with the following symlink:

# add config for collectd collectd spark plugin
ln -s /usr/lib/python2.7/ /usr/share/dse/collectd/usr/lib/python2.7

We enable and configure both the collectd-spark plugin as well as the write Prometheus plugin by adding a config file in the DSE collectd directory. Notice that I dynamically pulled the spark master URL by hitting the Spark rest API running on localhost.

MASTER_URL=$(curl localhost:7080 -LIs | grep Location | awk -F' ' '{print $2}' | awk -F':' '{print $1 ":" $2}')

mkdir /etc/dse/collectd/
cat << EOF > /etc/dse/collectd/10-spark.conf
LoadPlugin python

  ModulePath "/tmp/spark-insights/collectd-spark"

  Import spark_plugin

  
  MetricsURL "$MASTER_URL"
  MasterPort 7080
  WorkerPorts 7081
  Applications "True"
  Master "$MASTER_URL:7080"
  Cluster "Standalone"
  


LoadPlugin write_prometheus

 Port “9103”

EOF

Insights Collector

We are now ready to bring up collectd. If Metrics Collector is enabled and running, disable and enable it again or kill the collectd process. Killing the collectd process will trigger DSE to bring it back up with the new config.

# turn on collectd
# if insights has already been enabled, either DISABLE and then enable again or kill the collectd process. DSE will bring it back up with the new config.
#dsetool insights_config --mode DISABLE
dsetool insights_config --mode ENABLED_WITH_LOCAL_STORAGE

Grafana and Prometheus

Finally, we bring up Grafana and Prometheus using docker compose. If you already have Grafana and Prometheus running elsewhere, you can add the Prometheus targets to point to 9103 on each of your DSE nodes. Notice that we also clone the dse-metric-reporter dashboards from the DataStax repo that comes with pre-built Grafana dashboards for DSE.

The new Spark metrics will appear under collectd spark in Prometheus and Grafana allowing you to create custom dashboards for them.

export PROMETHEUS_DATA_DIR=/mnt/ephemeral/prometheus
export GRAFANA_DATA_DIR=/mnt/ephemeral/grafana

mkdir $PROMETHEUS_DATA_DIR
mkdir $GRAFANA_DATA_DIR

chmod 777 $PROMETHEUS_DATA_DIR
chmod 777 $GRAFANA_DATA_DIR

git clone https://github.com/datastax/dse-metric-reporter-dashboards.git
cd dse-metric-reporter-dashboards

cat /etc/hosts | grep node | grep -v ext| grep -v allnodes | awk -F' ' '{print $1 ":9103"}'  | jq -R . | jq -s ".| [{targets:[.[]], labels:{cluster: \"test_cluster\" }}]" > prometheus/tg_dse.json

pip install docker-compose

docker-compose up &

The screenshot below shows Prometheus picking up data from three targets, only the first of which has been configured with the DSE Metrics Collector.

The spark data can be visualized in the graph screen in Prometheus for Prometheus query troubleshooting:

And in Grafana as well:

Here's a sample dashboard I hope to contribute to the DataStax Metrics Collector Github repo.

Spark Dashboard

Hope you have found this article useful. Happy monitoring!

Resources

https://github.com/datastax/dse-metric-reporter-dashboards

https://github.com/signalfx/integrations/tree/master/collectd-spark#configuration

https://github.com/signalfx/collectd-spark/tree/v1.0.2/integration-test

https://academy.datastax.com/content/dse-metrics-collector-tutorial-using-dse-docker-images

* Collectd supports most monitoring systems via collectd write plugins.

DSE Gremlin Queries: Good, Better, Best

Sebastian Estevez — Wed, 04 Oct 2017 21:00:59 GMT

Why Some Gremlin Queries Run Faster than Others

Intro

With great power comes great responsibility.

--Spiderman's Uncle

The Gremlin language gives users great power in the form of traversal expressivity. With the dozens of steps available in the Gremlin language, developers can make their little Gremlin(s) jump around the graph every which way they like. However, not all paths that reach the same result are equal.

This article discusses Gremlin query optimization for DSE Graph. We will walk through examples of queries that just do the job, and improve upon them to get better performance while focusing on the reasoning and introspection tools required to identify bottlenecks and measure improvements. With this guidance, readers can reproduce comparable query tuning results in their own environments and achieve performant real-time transactional Gremlin queries.

Note: For the purposes of this article, a good query is one that gets the expected result. These "good" queries will be inefficient but we use them to highlight common mistakes and the techniques used to avoid them.

One Index at a time

DSE Graph queries use one index at a time. When designing a graph query, be conscious of the fact that DSE Graph’s current implementation allows it to take advantage of one index per query. Once that index is used up, the remaining execution will happen in memory at the DSE Graph server in the DSE JVM.

Grab the right index

Take the following "good" query:

g.V().has('state','name', 'california').
  in('lives_in').
  hasLabel('person').
  has('first_name', regex('Seb.+')).
  has('last_name', regex('Est.+'))

In the query above, we are performing regex matches on first_name and last_name (properties on the person vertex) and an exact match on the state name california. Since we are able to leverage one search index (the index on the state vertex or the index on the person vertex) we should consider which filter will return fewer values and perform that filter first.

For example, if we are expecting just a few individuals with names starting with Seb / Est in the dataset, would make sense to write a better query as follows:

g.V().hasLabel('person').  
  has('first_name', regex('Seb.+')).
  has('last_name', regex('Est.+')).as('people').
  out('lives_in').
  has('state','name', 'california').
  select('people')

Notice the use of the as / select steps to ensure that our result set still contains person vertices and not state vertices. Using as / select enables path tracking, which can be computationally expensive so an even better way (best) to write this query would be as follows:

g.V().hasLabel('person').  
  has('first_name', regex('Seb.+')).
  has('last_name', regex('Est.+')).
  where(
    out('lives_in').
    has('state','name', 'california')
)

The where step (subfilter) allows us to return the relevant person vertices without enabling path tracking.

Another way to tackle this 'order of traversal problem' is to leverage the Gremlin match step.

Splitting queries

Note: The following is one of the cases where as the DSE Graph optimizer improves; the difference between good, better, and best queries vanishes. The .or condition will combine the constraints into a single search query in the latest DSE so there is no need to manually optimize!

Sometimes which filter is more selective is not obvious and it makes sense to perform two queries instead of one.

In versions of DSE older than 5.1.3, this is true when an .or condition is present.

In the following "good" query, we have multiple filters (based off of different vertices) but DSE Graph can only execute the query in scan mode (There is a ticket to fix .or handling so this is only a near term problem).

g.V().  
  or(
    out('lives_in').has('name', 'california'),
    has('person','name', regex('Seb.+')),
    has('person', 'company', 'DataStax')
  ).
  values('first_name')

Another (better) way to obtain this result set would be to run two separate queries and glue them together app side:

g.V().has('state','name','california').
  in('lives_in').values('first_name')

g.V().has('person','first_name', regex('Seb.+')).
  has('company', 'DataStax').
  values('first_name')

Notice only two queries (not three) are necessary, because the first name and the works for filters can leverage the same search index on the person vertex.

We can also run the following Groovy statements to obtain the result set in a single call to the database (best).

def names = g.V().has('state','name','california').
  in('lives_in').values('first_name').toSet()


names += g.V().has('person','first_name', regex('Seb.+')).
  has('company', 'DataStax').
  values('first_name').toSet()

Paging in Gremlin

For queries that expect large result sets paging is required. We can accomplish paging by using the .range step. However, the .range step’s latency will degrade with the max value in the return parameters (not the number of values being returned). This is a "good" query in that it will eventually get you the results you need (as long as it doesn't time out!):

g.V().has('person', 'company', 'DataStax').
  where(out('lives_in').has('name','california')).
  has('first_name', regex('S.+')).
  range(0,1).values('first_name')

In the query above, we read 1010 values from DSE and throw 1000 of them away in-memory at the Graph server.
As an (best) alternative to .range for paging, we can use a search index on the person_id field for paging with DSE Search.

Note: We chose person_id because we needed a sortable and unique field we could use to page against. person_id is an integer and integers in DSE Search are indexed as Lucene TrieIntFields. Trie fields in Lucene leverage the trie data structure to allow range filtering and sorting.

In your app, hang on to the max value from the previous page:

def max = 1000

def nameSet = g.V().has('person', 'company', 'DataStax').
  where(out('lives_in').has('name', 'california')).
  has('first_name', regex('S.+')).values('person_id').toSet()


g.V().has('person','person_id', within(nameSet)).
  has('person_id', gt(max)).
  order().by('person_id').range(0,10)

Note: The Gremlin timeLimit steps allows you to set a limit on the time the Gremlin server will spend evaluating a request. When troubleshooting queries where the result set may be large, the timeLimit step is a great way to speed up your itterations.

Conclusion

With Gremlin, DSE Graph provides a powerful interaction mechanism capable of turning graph data into insights. Having come up with queries that work, developers can spend time optimizing them to ensure performance that falls within their SLAs.

Remember that as the optimizer in DSE Graph matures, some of these optimizations will start to happen on their own and the difference between a good, a better, and a best Gremlin query will shrink and may eventually disappear.

Large Graph Loading Best Practices: Tactics (Part 2)

Sebastian Estevez — Tue, 14 Mar 2017 13:55:51 GMT

The previous post introduced DSE Graph and summarized some key considerations related to dealing with large graphs. This post aims to:

describe the tooling available to load large data sets to DSE Graph

point out tips, tricks, and key learnings for bulk loading with DSE Graph Loader
provide code samples to simplify your bulk loading process into DSE Graph

Tooling

DSE Graph Loader (DGL) is a powerful tool for loading graph data into DSE Graph. As shown in the marchitecture diagram below, the DGL supports multiple data input sources for bulk loading and provides high flexibility for manipulating data on ingest by requiring custom groovy data mapping scripts to map source data to graph objects. See the DataStax docs which cover the DGL, it's API, and DGL mapping scripts in detail.

This article breaks down the tactics for efficient data loading with DGL into the following areas:

file processing best practices
mapping script best practices
DGL configuration

Code and Tactics

Code to accompany this section can be found at:

https://github.com/phact/rundgl

Shout out to Daniel Kuppitz, Pierre Laporte, Caroline George, Ulisses Beresi, Bryn Cooke among others who helped create and refine this framework. Any bugs / mistakes are mine.

The code repository consists of:

a wrapper bash script that is used for bookkeeping and calling DGL
a mapping script whith some helpful utilities and * a generic structure that can be used as a starting point for your own custom mapping scripts
a set of analysis scripts that can be used to monitor your load

The rest of this article will be focused on DGL loading best practices linking to specific code in the repo for clarity.

File Bucketing

The main consideration to take when loading large data volumes is that DGL performance will suffer if fed too much data in a single run. At the time of this article (2/28/2017), the DGL has features designed for non-idempotent graphs (including deduplication of vertices via an internal vertex cache) that limit its performance with large idempotent graphs.

Splitting your loading files into chunks of about ~120Million or fewer vertices will ensure that the DGL does not lock up when the vertex cache is saturated.

UPDATE: In modern versions of the graph loader, custom vertices are no longer cached. This means there should not be a performance impact of large files for this kind of load. Chunking may still be useful for percentage tracking / restarting / etc.

The rundgl script found in the repo is designed to traverse a directory and bucket files, feeding them to DGL a bucket at a time to maximize performance.

Track your progress

The analysis directory in the rundgl repo contains a script called chart. This script will aggregate statistics from the DGL loader.log and generate throughput statistics and charts for the different loading phases that have occurred (Vertices, Edges, and Properties).

Note - these scripts have only been tested with DGL < 5.0.6

Navigate to the analysis directory and run ./charts to get a dump of the throughput for your job:

It will also start a simple http server in the directory for easy access to the png charts it generates, here is an example chart output:

Thank you Pierre for building and sharing the analysis scripts

Monitor DGL errors

When the DGL gets timeouts from the server it does not log them to STDOUT, they can only be seen in logger.log. In a busy system, it is normal to see timeouts. These timeouts will be handled by retry policies which are baked into the loader. Too many timeouts may be a sign that you are overwhelming your cluster and need to either slow down (reduce threads in the loader) or scale out (add nodes in DSE). You will know if timeouts are affecting your cluster if your overall throughput starts trending down or if you are seeing backed up threadpools or dropped mutations in OpsCenter.

Aside from Timeouts, you may also see errors in the DGL log that are caused by bad data or bugs in your mapping script. If enough of these errors happen the job will stop. To avoid having to restart from the beginning on data related issues, take a look at the bookeeping section below.

Don't use S3

If you are looking to load significant amounts of data, do not use S3 as a source for performance reasons. It will take less time to parallel rsync your data from s3 to a local SSD and then load it than to load directly from s3.

DGL does have S3 hooks and from a functionality perspective it works quite well (including aws authentication) so if you are not in a hurry, the repo also includes an example for pulling data from S3. Just be aware of the performance overhead.

Groovy mapper best practices

Custom groovy mapping scripts can be error prone and the DGL's error messages sometimes leave a bit to be desired. The framework provided aims to simplify the loading process by providing a standard framework for all your mapping scripts and minimizing the amount of logic that goes into the mapping scripts.

Use the logger

DGL mapping scripts can use the log4j logger to log messages. This can be very useful when troubleshooting issues in a load, especially issues that only show up at runtime with a particular file.

It also allows you to track when particular loading events occur during execution.

INFO, WARN, and ERROR messages will be logged to logger.log and will include a timestamp.

DGL takes arguments

If you need to pass an argument to the groovy mapper just pass it with - in the command line.
The variable argname will be available in your mapping script.

For example, ./rundgl passes -inputfilename to DGL here leveraging this feature. You can see the mapping script use it here.

The argument inputfilename is a list of files to process to the DGL. This helps us avoid having to do complex directory traversals in the mapping script itself.

By traversing files in the wrapper script, we are also able to do some bookeeping.

Bookeeping

loadedfiles.txt tracks the start and end of your job as well as the list of files that were loaded and when each particular load completed. This also enables us to "resume" progress by modifying the STARTBUCKET.

STARTBUCKET represents the first bucket that will be processed by ./rundgl, if you stopped your job and want to continue where you left off, count the number of files in loadedfiles.txt and divide by the BUCKETSIZE, this will give you the bucket you were on. Starting from that bucket will ensure you don't miss any files. Since we are working with idempotent graphs, we don't have to worry about duplicates etc.

DGL Vertex Cache Temp Directory

Especially when using EBS, make sure to move the default directory of the DGL Vertex Cache temp directory.

IO contention against the mapdb files used by DGL for its internal vertex cache will overwhelm amazon instances' root EBS partitions

Djava.io.tmpdir=/data/tmp

Leave search for last

If you are working on a graph schema that uses DSE Search indexes, you can optimize for overall load + indexing time by loading the data without the search indexes first and then creating the indexes once the data is in the system.

Some screenshots

Here are some screenshots from a system using this method to load billons of vertices. The spikes are each run of the DGL kicked off in sequence by the rundgl mapping script. You can see the load and performance are steady throughout the load giving us dependable throughput.
Like with other DSE workloads, if you need more speed, scale out!

Throughput:

OS Load:

Conclusion

With the tooling, code, and tactics in this article you should be ready to load billions of V's, E's, and P's into DSE graph. The ./rundgl repo is there to help with error handling, logging, bookeeping, and file bucketing so that your loading experience is smooth. Enjoy!

Large Graph Loading Best Practices: Strategies (Part 1)

Sebastian Estevez — Tue, 14 Mar 2017 13:55:09 GMT

This post is an intro to DSE Graph with a focus on the strategies that should be used to load large graphs with billions of vertices and edges. For those familiar with DSE Graph and large graph strategies or those who want to dive directly into loading data, proceed to the next post in this two part series entitiled Large Graph Loading Best Practices: Tactics.

Intro to DSE Graph

DSE Graph is differentiated from other graph databases by building on DataStax Enterprise's scalability, replication, and fault tolerance.

Note - To understand how DSE Graph data is stored in DSE's Apache Cassandra(TM) storage engine, check out Matthias and Marko's posts on the matter.

When folks ask me where DSE Graph falls in the greater database / graph database landscape, I use this image to communicate the combination of scalability and value in relationships that make DSE Graph such a unique product:

DSE Graph is positioned on the right side of the chart where relationships are most valuable, and toward the top of the chart due to the scalability it inherits from DSE and Cassandra. The third key aspect that differentiates DSE graph is the velocity of the data it can support. Unlike analytical graph engines which load a static graph to memory and then crunch that static graph for insights, an operational graph is constantly changing as the real world concepts whose relationships and vertices it represents are created, updated, and deleted.

Key Takeaway - DSE Graph is designed as a real-time, operational, distributed graph database.

Motivation and Goals: Playing with Scalable Graphs

If you have distributed graph problem, you may want to bulk load your data into DSE graph and start querying. However, loading significant amounts of data (>1 billion V's or E's) into graph dbs is a time consuming, nontrivial task. The purpose of this article is to summarize some key design considerations related to dealing with large graphs.

Large graphs, idempotence, and scalability

Idempotence is a common concept in distributed systems design. If an operation is idempotent, it can be repeated over and over and still yield the same result. We use idempotence to help us solve problems like the fact that exactly once delivery does not exist, it also greatly simplifies the design of our systems, minimizing bugs and promoting maintainablility. For the purposes of two part article series, we are going to focus on building scalable distributed idempotent graphs. This is one of the design choices that is supported in DSE Graph but note that not all graphs that can be built on DSE graph will have idempotent vertices and idempotent edges.

Idempotent Vertices

DSE graph allows two types of vertices, 1) those with system generated keys and 2) those with custom ids. For the purposes of this two part series we are going to concentrate on custom ids. Custom ids are useful for large graph problems in that they allow developers to take graph partitioning into their own hands. Custom ids will feel familiar if you have used DSE or cassandra and understand data modeling.

You configure the partition key of your Vertex label with a DDL operation:

schema().vertexLabel('MachineSensor').partitionKey('manufacturing_plant_id').clusteringKey('sensor_id').create()

If you are using custom ids, the partition key is required and the clustering key is optional. For more on Cassandra data modeling and clustering keys vs. partition keys see my post on data modeling for DSE.

Note - your partition key, clustering key combination should provide uniqueness for the vertex. With this configuration, reinserting will not generate duplicates

Idempotent Edges

DSE Graph edges support different cardinality options. For multiple cardinality edges (where there can be more than one edge between the same two vertices of the same edge label type) edge creation is not idempotent.

For the purposes of this two part article series, we will focus on single cardinality (thereby idempotent) edges. You can create single cardinality edge lables in DSE Graph use the single() keyword:

schema.edgeLabel('has_sensor').single().create()

Let's load!

Having considered the strategies mentioned above, let's proceed to the second part which adresses the tactical aspects of loading large graphs.

Cluster Migration - Keeping simple things simple

Sebastian Estevez — Mon, 07 Nov 2016 18:48:27 GMT

I often get asked about the different ways to move data across DSE clusters (prod to qa, old cluster to new cluster, multi-cluster ETL). There are different options for these ranging from custom apps, cassandra-loader / unloader (which I've talked about in another post), and Apache Spark (TM).

Out of these options spark is the most scalable and performant option, but is also the most intimidating for a new user. Fortunately, DataStax Enterprise (DSE) makes Apache Spark (TM) integration with Apache Cassandra (TM), trivial. Additionally the spark-cassandra-connector is mature and user friendly so writing the code required for a cluster migration in scala is also trivial.

Unfortunately, the mention of learning how to program a new compute framework and setting up SBT to build a scala job can be daunting and sometimes keeps folks from exploring this avenue.

Here is a pre-built spark job, that you can simply run against your clusters to perform a spark powered migration.

Spark has to be enabled on one of the clusters. SSH into that cluster and run:

wget https://github.com/phact/dse-cluster-migration/releases/download/v0.01/dse-cluster-migration_2.10-0.1.jar

dse [-u  -p ] spark-submit --class phact.MigrateTable --conf spark.dse.cluster.migration.fromClusterHost='' --conf spark.dse.cluster.migration.toClusterHost='' --conf spark.dse.cluster.migration.keyspace='' --conf spark.dse.cluster.migration.table='' --conf spark.dse.cluster.migration.newtableflag='' --conf spark.dse.cluster.migration.fromuser='' --conf spark.dse.cluster.migration.frompassword='' --conf spark.dse.cluster.migration.touser='' --conf spark.dse.cluster.migration.topassword='' ./dse-cluster-migration_2.10-0.1.jar

Update: I added username / password and table creation to the migration app.
Shout out to Russ and Brian for their help.
For the deeper dive on how this code works check out Russ's post.



C* schema changes and compatible types
Sebastian Estevez — Tue, 31 May 2016 16:47:14 GMT
All the schema operations that can be done in c* are done without downtime. You should limit these actions as a best practice to 1 client (not multiple concurrent clients) to avoid schema disagreement problems.
The schema changes that are allowed are as follows (and documented here):
cqlsh> help 
CQL help topics:
================
ALTER                        CREATE_TABLE_OPTIONS  SELECT
ALTER_ADD                    CREATE_TABLE_TYPES    SELECT_COLUMNFAMILY
ALTER_ALTER                  CREATE_USER           SELECT_EXPR
ALTER_DROP                   DELETE                SELECT_LIMIT
ALTER_RENAME                 DELETE_COLUMNS        SELECT_TABLE
ALTER_USER                   DELETE_USING          SELECT_WHERE
ALTER_WITH                   DELETE_WHERE          TEXT_OUTPUT
APPLY                        DROP                  TIMESTAMP_INPUT
ASCII_OUTPUT                 DROP_COLUMNFAMILY     TIMESTAMP_OUTPUT
BEGIN                        DROP_INDEX            TRUNCATE
BLOB_INPUT                   DROP_KEYSPACE         TYPES
BOOLEAN_INPUT                DROP_TABLE            UPDATE
COMPOUND_PRIMARY_KEYS        DROP_USER             UPDATE_COUNTERS
CREATE                       GRANT                 UPDATE_SET
CREATE_COLUMNFAMILY          INSERT                UPDATE_USING
CREATE_COLUMNFAMILY_OPTIONS  LIST                  UPDATE_WHERE
CREATE_COLUMNFAMILY_TYPES    LIST_PERMISSIONS      USE
CREATE_INDEX                 LIST_USERS            UUID_INPUT
CREATE_KEYSPACE              PERMISSIONS
CREATE_TABLE                 REVOKE

For regular columns and partition keys:
Compatible data types are as follows ( <--> denotes both way compatibility ; --> denotes one way compatibility)):

int --> varint

varchar <--> text

int --> blob

text --> blob

ascii --> blob

double -->blob

int --> blob

timeuuid --> blob

varchar --> blob

bigint --> blob

boolean --> blob

decimal --> blob

float --> blob

inet --> blob

ltimestamp --> blob

uuid --> blob

varint --> blob

timeuuid --> uuid
For clustering columns:

int --> varint

varchar <--> text
The reason clustering columns are different is because they must also be order-compatible (clustering columns mandate the order in which we lay out data on disk, hence the stricter requirement).
As you can see, for the most part, CQL3 is relatively strict when it comes to type changes and you want to make sure you pick the right types at design time. You can always create and delete columns if you need to chage due to some unforseen circumstances. We do have tools in DSE Analytics to help with these operational changes when needed.



On Cassandra Collections, Updates, and Tombstones
Sebastian Estevez — Thu, 26 May 2016 00:54:52 GMT
update
I was chatting with a user today who referenced this old post. Most of it is still relevant but sstable2json is no longer supported in modern c*. The new tool is sstabledump. The two tools are pretty much equivalent so you can just replace sstabe2json with sstabledump everywhere you see it here the outputs may have slightly different formatting but it should not matter in substance.
Cassandra collections create tombstones?
Many new cassandra users learn this the hard way, they choose cassandra collections for the wrong reasons, for the wrong use cases, and then experience what is known as death by tombstones.
Update - To hear Luke Tillman, Patrick McFadin, and Eric Stevens talking about this post check out this video on Planet Cassandra! https://t.co/n9a6RFP5mP
TL;DR
When folks ask me if they should use collections here are my recommendations.
Why do cassandra developers choose collections?
Relational mindset:

It feels more natural--warm and fuzzy--to model one to many relationships if you don’t have to de-normalize tables (this is a very common reason, but not a great reason).

Convenient reads:


Need to get a nested java structure directly out of the query
SELECT entitlements from entitlements_by_user WHERE … ;


Access whole collection or parts of the collection based on query patterns:


SELECT * FROM entitlements_by_user WHERE entitlements CONTAINS ‘App ABC';

Convenient writes:
Ability to do incremental updates or deletes :
UPDATE entitlements_by_user ... entitlements= entitlements + ‘App ABC’

This convenience does not come free:

Serialization & deserialization takes time with maps due to the complex java objects
(non incremental) inserts/updates on Maps generate tombstones. Insert/Update heavy workloads are not collection friendly. Excessive tombstones significantly affect compaction performance.
Collections are not designed to hold more than 10’s of fields. Compactions and repairs will be slow if you abuse collections.

**Therefore -- Ensure you have a good use case for collections and that you understand their limitations.

**
Details:
Here are some code examples and results that summarize what kinds of collections generate tombstones and which don't.
Let's create a table with a map and a frozen map.
cqlsh> CREATE TABLE test.map_test (
    a text PRIMARY KEY,
    b map,
    c frozen>
)

and add some data to each:
cqlsh> insert into map_test (a, b, c) VALUES ('a', { '1':'a' }, { '2': 'b' }) ;

cqlsh> select * from test.map_test ;

 a | b          | c
---+------------+------------
 a | {'1': 'a'} | {'2': 'b'}

Let's see what happened under the hood using sstable2json after flushing:
$ sstable2json test-map_test-ka-1-Data.db
[
{"key": "a",
 "cells": [["","",1458266095727275],
           ["b:_","b:!",1458266095727274,"t",1458266095],
           ["b:31","61",1458266095727275],
           ["c","0000000100000001320000000162",1458266095727275]]}
]

Notice the t (tombstone) in b. There is no tombstone in c. This is because frozen collections are stored all together in a single cassandra cell. No tombstone necessary for inserts.
Now let's try an update
$ update test.map_test SET b = { '3': 'c'}, c = {'3':'c'} where a='a' ;

cqlsh> select * from test.map_test ;

 a | b          | c
---+------------+------------
 a | {'3': 'c'} | {'3': 'c'}

(1 rows)

After flushing we get a new sstable, also with a tombstone in b:
$ sstable2json test-map_test-ka-2-Data.db 
[
{"key": "a",
 "cells": [["b:_","b:!",1458266473158221,"t",1458266473],
           ["b:33","63",1458266473158222],
           ["c","0000000100000001330000000163",1458266473158222]]}
]

Does a compaction get rid of the tombstone?
$ nodetool compact

$ sstable2json test-map_test-ka-3-Data.db

[
{"key": "a",
 "cells": [["","",1458266095727275],
           ["b:_","b:!",1458266473158221,"t",1458266473],
           ["b:33","63",1458266473158222],
           ["c","0000000100000001330000000163",1458266473158222]]}
]

No! remember tombstones must live longer than gc_grace AND meet the criteria in your tombstone compaction subproperties to get deleted. This helps avoid zombie data.
Now let's try incremental update:
cqlsh> update test.map_test SET b = b + { '4': 'd'}, c = c + {'4':'d'} where a='a' ;

InvalidRequest: code=2200 [Invalid query] message="Invalid operation (c = c + {'4':'d'}) for frozen collection column c"

cqlsh> update test.map_test SET b = b + { '4': 'd'} where a='a' ;

$ sstable2json test-map_test-ka-4-Data.db
[
{"key": "a",
 "cells": [["b:34","64",1458266948817380]]}
]


Only the non frozen collection supports this fancy kind of updates. Notice that it did not produce a tombstone. Tombstones only happen for inserts and non incrememntal updates on non frozen collections.



Tuning DSE Search - Indexing latency and query latency
Sebastian Estevez — Thu, 17 Mar 2016 22:02:39 GMT
Introduction
DSE offers out of the box search indexing for your Cassandra data. The days of double writes or ETL's between separate DBMS and Search clusters are gone.
I have my cql table, I execute the following API call, and (boom) my cassandra data is available for:

full text/fuzzy search
ad hoc lucene secondary index powered filtering, and
geospatial search

Here is my API call:
$ bin/dsetool create_core .
 generateResources=true reindex=true
or if you prefer curl (or are using basic auth) use the following:
$ curl "http://localhost:8983/solr/admin/cores?action=CREATE&name=.


&generateResources=true"
Rejoice! we are in inverted index, single cluster, operational simplicity bliss!
The remainder of this post will be focused on advanced tuning for DSE Search both for a) search indexing latency (the time it takes for data to be searchable after it has been inserted through cql), and b) search query latency (timings for your search requests).
Indexing latency
In this section I'll talk about the kinds of things we can do in order to

instrument and monitor DSE Search indexing and
tune indexing for lower latencies and increased performance
Note: DSE Search ships with Real Time (RT) indexing which will give you faster indexing with 4.7.3, especially when it comes to the tails of your latency distribution. Here's one of our performance tests. It shows you real time vs near-real time indexing as of 4.7.0:
Perhaps more importantly, as you get machines with more cores, you can continue to increase your indexing performance linearly:

Be aware, however, that you should only run one RT search core per cluster since it is significantly more resource hungry than near-real time (NRT).
Side note on GC: Because solr and cassandra run on the same JVM in DSE Search and the indexing process generates a lot of java objects, running Search requires a larger JVM Heap. When running traditional CMS, we recommend a 14gb heap with about 2gb new gen. Consider the Stump's CASSANDRA-8150 settings when running search with CMS. G1GC has been found to perform quite well with search workloads, I personally run with a 25gb heap (do not set new gen with G1, the whole point of G1 is that it sets it itself based on your workload!) and gc_pause_ms at about 1000 (go higher for higher throughput or lower to minimize latencies / p99's; don't go below 500). Update (thanks mc) you configure this setting in cassandra-env.sh.
1) Instrumentation
Index Pool Stats:
DSE Search parallelizes the indexing process and allocates work to a thread pool for indexing of your data.
Using JMX, you can see statistics on your indexing threadpool depth, completion, timings, and whether backpressure is active.
This is important because if your indexing queues get too deep, we risk having too much heap pressure => OOM's. Backpressure will throttle commits and eventually load shed if search can't keep up with an indexing workload. Backpressure gets triggered when the queues get too large.
The mbean is called:
com.datastax.bdp.search..

















.IndexPool




Commit/Update Stats:
You can also see statistics on indexing performance (in microseconds) based on the particular stage of the indexing process for both commits and updates.
Commit:

The stages are:

FLUSH - Comprising the time spent by flushing the async indexing queue.

EXECUTE - Comprising the time spent by actually executing the commit on the index.

The mbean is called:

com.datastax.bdp.search..








.CommitMetrics
Update:

The stages are:

WRITE - Comprising the time spent to convert the Solr document and write it into Cassandra (only available when indexing via the Solrj HTTP APIs). If you're using cql this will be 0.

QUEUE - Comprising the time spent by the index update task into the index pool.

PREPARE- Comprising the time spent preparing the actual index update.

EXECUTE - Comprising the time spent to actually executing the index update on Lucene.

The mbean is:

`com.datastax.bdp.search..






.UpdateMetrics`
Here, the average latency for the QUEUE stage of the update is 767 micros. See our docs for more details on the metrics mbeans and their stages.
2) Tuning
Almost everything in c* and DSE is configurable. Here's the key levers to get you better search indexing performance. Based on what you see in your instrumentation you can tune accordingly.
The main lever is soft autocommit, that's the minimum amount of time that will go by before queries are available for search. With RT we can set it to 250 ms or even as low as 100ms--given the right hardware. Tune this based on your SLA's.
The next most important lever is concurrency per core (or max_solr_concurrency_per_core). You can usually set this to number of CPU cores available to maximize indexing throughput.
Backpressure threshold will become more important as your load increases. Larger boxes can handle higher bp thresholds.
Don't forget to set up the ramBuffer to 2gb per the docs when you turn on RT indexing.
Query Latency
Now, I'll go over how we can monitor query performance in DSE Search, identify issues, and some of the tips / tricks we can use to improve search query performance. I will cover how to:

instrument and monitor DSE Search indexing and
tune indexing for lower latencies and increased performance.
Simliar to how search indexing performance scales with CPU's, search query performance scales with RAM. Keeping your search indexes in OS page cache is the biggest thing you can do to minimize latencies; so scale deliberately!
1) Instrumentation
There are multiple tools available for monitoring search performance.
OpsCenter:
OpsCenter supports a few search metrics that can be configured per node, datacenter, and solr core:

search latencies
search requests
index size
search timeouts
search errors
Metrics mbeans:
In the same way that indexing has performance metrics, DSE Search query performance metrics are available through JMX and can be useful for troubleshooting perofrmance issues. We can use the query.name parameter in your DSE Serch queries to capture metrics for specifically tagged queries.
Query:
The stages are:

COORDINATE - Comprises the total amount of time spent by the coordinator node to distribute the query and gather/process results from shards. This value is computed only on query coordinator nodes.

EXECUTE - Comprises the time spent by a single shard to execute the actual index query. This value is computed on the local node executing the shard query.

RETRIEVE - Comprises the time spent by a single shard to retrieve the actual data from Cassandra. This value will be computed on the local node hosting the requested data.
The mbean is:

com.datastax.bdp.search..




























.QueryMetrics
Query Tracing:
When using solr_query via cql, query tracing can provide  useful information as to where a particular query spent time in the cluster.
Query tracing is available in cqlsh tracing on, in devcenter (in the tab at the bottom of the screen), and via probabilistic tracing which is configurable via nodetool.
DSE Search slow query log:
When users complain about a slow query and you need to find out what it is, the DSE Search slow query log is a good starting point.
dsetool perf solrslowlog enable
Stores to a table in cassandra in the dse_perf.solr_slow_sub_query_log table
2) Tuning
Now let's focus on some tips for how you can improve search query performance.
Index size
Index size is so important that, I wrote a separate post just on that subject:
Q vs. FQ
In order to take advantage of the solr filter cache, build your queries using fq not q. The filter cache is the only solr cache that persists across commits so don't spend time or valuable RAM trying to leverage the other caches.
Solr query routing
Partition routing is a great multi-tennancy feature in DSE Search that lets you limit the amount of fan out that a search query will take under the hood. Essentially, you're able to specify a Cassandra partition that you are interested in limiting your search to. This will limit the number of nodes that DSE Search requires to fullfil your request.
Use docvalues for Faceting and Sorting.
To get improved performance and to avoid OOMs from the field cache, always remember to turn on docvalues on fields that you will be sorting and faceting over. This may become mandatory in DSE at some point so plan ahead.
Other DSE Differentiators
If you're comparing DSE Search against other search offerings / technologies, the following two differentiators are unique to DSE Search.
Fault tolerant distributed queries
If a node dies during a query, we retry the query on another node.
Node health
Node health and shard router behavior.

DSE Search monitors node health and makes distributed query routing decisions based on the following:

Uptime: a node that just started may well be lacking the most up-to-date data (to be repaired via HH or AE).
Number of dropped mutations.
Number of hints the node is a target for.
"failed reindex" status.
All you need to take advantage of this is be on a modern DSE version.

Things you didn't think you could do with DSE Search and CQL
Sebastian Estevez — Fri, 11 Mar 2016 19:55:48 GMT
Intro
CQL and DSE Search promise to make access to a lucene backed index scalable, highly avaliable, operationally simple, and user friendly.
There have been a couple of developments in DSE 4.8 point releases that may have gone unnoticed by the community of DSE Search users.
One of the main benefits of using DSE Search is that you are able to query the search indexes from through CQL directly from your favorite DataStax driver. Avoiding the solr HTTP API all together means that you:


Don't need to two sets of DAO's in your app and have application logic around which to use for what purpose


You don't need a load balancer in front of Solr/Tomcat because the DataStax drivers are cluster aware and will load balance for you


You don't need to worry about one node going down under your load balancer and having a fraction of your queries failing upon node failure


When security is enabled, requests through the HTTP API are significantly slower, to quote the DSE docs:



"Due to the stateless nature of HTTP Basic Authentication, this can have a significant performance impact as the authentication process must be executed on each HTTP request."

Why ever use the HTTP API?
The CQL interface is designed to return rows and columns, so features like Solr's numFound and faceting, were not built in the first few releases.
These features have snuck in via patches in point releases and users that aren't studiously reading the release notes may not have noticed the changes.
How would I go about getting numfound and performing facet queries in the latest (DSE 4.8.1+) version of DSE?
Show me how
If you know you just need the count (and not the data that comes along with it) then you can just specify count(*) and keep the solr_query where clause. DSE intercepts the query and brings back numDocs from DSE Search instead of actually performing the count in cassandra:
SELECT count(*) FROM test.pymt WHERE solr_query = '{"q":"countryoftravel:\"United States\""}' ;

 count
-------
 39709 

Here it is with tracing enabled, notice that even my wide open count(*) query comes back in micros
cqlsh> SELECT count(*) FROM test.pymt WHERE solr_query = '{"q":"*:*"}' ;

 count
--------
 817000

(1 rows)

Tracing session: 7020df80-e7a9-11e5-9c31-37116dd067c6

 activity                                                                                        | timestamp                  | source    | source_elapsed
-------------------------------------------------------------------------------------------------+----------------------------+-----------+----------------
                                                                              Execute CQL3 query | 2016-03-11 11:51:02.136000 | 127.0.0.1 |              0
 Parsing SELECT count(*) FROM test.pymt WHERE solr_query = '{"q":"*:*"}' ; [SharedPool-Worker-1] | 2016-03-11 11:51:02.136000 | 127.0.0.1 |             34
                                                       Preparing statement [SharedPool-Worker-1] | 2016-03-11 11:51:02.136000 | 127.0.0.1 |             84
                                                                                Request complete | 2016-03-11 11:51:02.146918 | 127.0.0.1 |          10918

The same goes for facet queries. Note that because of the way the cql protocol is designed (around rows and columns), DSE returns the facet results inside a single cell in JSON format. Pretty slick:
select * FROM test.pymt WHERE solr_query='{"facet":{"pivot":"physicianprimarytype"},"q":"*:*"}' ;
 facet_pivot
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 {"physicianprimarytype":[{"field":"physicianprimarytype","value":"doctor","count":813638},{"field":"physicianprimarytype","value":"medical","count":720967},{"field":"physicianprimarytype","value":"of","count":92671},{"field":"physicianprimarytype","value":"osteopathy","count":60123},{"field":"physicianprimarytype","value":"dentistry","count":17132},{"field":"physicianprimarytype","value":"optometry","count":11447},{"field":"physicianprimarytype","value":"medicine","count":3969},{"field":"physicianprimarytype","value":"podiatric","count":3969},{"field":""physicianprimarytype","value":"chiropractor","count":192}]}

TL;DR
You don't have to use the HTTP API for seach queries, even if you need numFound and faceting. It is now supported via CQL and solr_query.
Futures
Remember I mentioned that the cql protocol is desinged round rows and columns? Well check out this ticket resolved in C* 2.2.0 beta 1 CASSANDRA-8553. If you use your imagination, there are some improvements that can be made once DSE gets c* 3.0 under the hood to make Search functionality even more slick.
Stay tuned!
More Features!
I meant to stop here but when I asked folks (char0) to review this post, they had some additional DSE Search features that get overlooked. I'll breifly describe them and link to documentation. If you're new to DSE Search definitely read on:
Partiton routing:
Partition routing is a great multi-tennant feature that lets you limit the amount of fan out that a search query will take under the hood.

Essentially, you're able to specify a Cassandra partition that you are interested in limiting your search to. This will limit the number of nodes that DSE Search requires to fullfil your request.
JSON queries
If you're looking to do advanced queries thorugh cql (beyond just a simple search) check out the datastax documentation for json queries.
timeAllowed
Many search use cases don't actually require the backend to scan the entire dataset. If you're just trying to fill out a page with search results, and latency matters more than having a complete results set (when you dont care about numFound), the timeAllowed parameter let's you set a maximum latency and DSE Search will just return the results it has found so far.
Please comment if you have any additional Search DSE Features that you think are overlooked!


Minimizing DSE Search (solr) Indexes
Sebastian Estevez — Tue, 19 Jan 2016 15:52:00 GMT
Intro / why?
Search query performance depends on our ability to utilize the OS page cache effectively to keep search indexes hot. The smaller the size of your indexes, the easier it will be for the OS to maintain them in memory.
This article shows 6 tactics that can be used to minimize the size of your DSE Search index.
Tactics
Here are the tactics you can employ to minimize your DSE Search index size:

Turn off Term Vector information if you're not using highlighting or other functionality that relies on it:

• termVectors="false"
• termPositions="false"
• termOffsets="false"

Turn on omit norms if you're not using Boosts:

• omitNorms="true"
Note: From what I've seen term vectors and omit norms can be a substantial percentage of your index ~50%


Only index fields you intend to search. Most use cases don't require users to index all their fields for search.


Make sure you're not indexing your _partition_key (this may happen by default in modern DSE versions):





Use StrField rather than TextField (no tokenizers)


TrieField precisionStep - A higher precision step will increase query latency but it will decrease the index size.


Learn more about your indexes
You can also introspect your indexes using Luke. Luke is bundled in DSE so you can access it from a browser by hitting:
http://:8983/solr/./admin/luke?&numTerms=0