I often get asked about the different ways to move data across DSE clusters (prod to qa, old cluster to new cluster, multi-cluster ETL). There are different options for these ranging from custom apps, cassandra-loader / unloader (which I've talked about in another post), and Apache Spark (TM).

Out of these options spark is the most scalable and performant option, but is also the most intimidating for a new user. Fortunately, DataStax Enterprise (DSE) makes Apache Spark (TM) integration with Apache Cassandra (TM), trivial. Additionally the spark-cassandra-connector is mature and user friendly so writing the code required for a cluster migration in scala is also trivial.

Unfortunately, the mention of learning how to program a new compute framework and setting up SBT to build a scala job can be daunting and sometimes keeps folks from exploring this avenue.

Here is a pre-built spark job, that you can simply run against your clusters to perform a spark powered migration.

Spark has to be enabled on one of the clusters. SSH into that cluster and run:

wget https://github.com/phact/dse-cluster-migration/releases/download/v0.01/dse-cluster-migration_2.10-0.1.jar

dse [-u <usrer> -p <password>] spark-submit --class phact.MigrateTable --conf spark.dse.cluster.migration.fromClusterHost='<from host>' --conf spark.dse.cluster.migration.toClusterHost='<to host>' --conf spark.dse.cluster.migration.keyspace='<keyspace>' --conf spark.dse.cluster.migration.table='<table>' --conf spark.dse.cluster.migration.newtableflag='<true | false>' --conf spark.dse.cluster.migration.fromuser='<username>' --conf spark.dse.cluster.migration.frompassword='<password>' --conf spark.dse.cluster.migration.touser='<username>' --conf spark.dse.cluster.migration.topassword='<password>' ./dse-cluster-migration_2.10-0.1.jar

Update: I added username / password and table creation to the migration app.

Shout out to Russ and Brian for their help.

For the deeper dive on how this code works check out Russ's post.