• Spark SQL Query Explain Plan

    DSE 5.0.6, Spark 1.6.2 Example 1 CREATE TABLE avm.transactions ( txn_status text, txn_date date, user_name text, merchant_name text, txn_amt int, txn_id uuid, txn_time timestamp, PRIMARY KEY ((txn_status, txn_date), txn_time, txn_id) WITH CLUSTERING ORDER BY (txn_time DESC) ) From cqlsh, below query works cassandra@cqlsh:avm> select * from transactions where txn_status IN ('SUCCESS','FAILED') and txn_date IN ('2017-07-16', '2017-07-17') limit 10; Below query does a Full Table Scan(FTS) spark-sql>select * from transactions where txn_date IN (cast('2017-07-16' as date),cast('2017-07-17' as date)) and event_name in ('SUCCESS','FAILED') AND MERCHANT_NAME='merchant1' limit 10 Below query does not do FTS, because leading partition key column is txn_status
  • Accessing Datastax Spark - Basic Examples

    DSE 5.0.7, Spark 1.6.3 Accessing Spark Accessing spark from outside the spark cluster dse spark --master spark://MASTER_ADDRESS:7077 Spark SQL $ dse -u avm_analytics -p avm1111 spark-sql spark-sql> select count(*) from avm.transactions where txn_date='2017_05_08'; 446313 Time taken: 523.556 seconds, Fetched 1 row(s) Pyspark $ dse -u avm_analytics -p avm1111 pyspark >>> sqlContext.sql("select count(*) from avm.test") DataFrame[_c0: bigint] >>> sqlContext.sql("select count(*) from avm.test").collect() [Row(_c0=4)] >>> sqlContext.sql("select * from avm.test").collect() [Row(event_id=30, event_name=u'cat'), Row(event_id=1, event_name=u'aaa'), Row(event_id=2, event_name=u'bbb'), Row(event_id=40, event_name=u'fox')] Scala $ dse -u avm_analytics -p avm1111 spark Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .