• Spark and Qlik Integration

    Steps: Start Spark Thrift Server on Datastax Cluster $ dse -u cassandra -p <password> spark-sql-thriftserver start --conf spark.cores.max=4 --conf spark.executor.memory=2G --conf spark.driver.maxResultSize=1G --conf spark.kryoserializer.buffer.max=512M --conf spark.sql.thriftServer.incrementalCollect=true Enable Qlik Server’s Security Group on AWS to access port 10000 (basically from qlik, need to connect to thrift server port 10000) Install Simba ODBC Driver for Spark on the Qilk Windows EC2 Instance Create System DSN as follows:
  • Spark SQL Query Explain Plan

    DSE 5.0.6, Spark 1.6.2 Example 1 CREATE TABLE avm.transactions ( txn_status text, txn_date date, user_name text, merchant_name text, txn_amt int, txn_id uuid, txn_time timestamp, PRIMARY KEY ((txn_status, txn_date), txn_time, txn_id) WITH CLUSTERING ORDER BY (txn_time DESC) ) From cqlsh, below query works cassandra@cqlsh:avm> select * from transactions where txn_status IN ('SUCCESS','FAILED') and txn_date IN ('2017-07-16', '2017-07-17') limit 10; Below query does a Full Table Scan(FTS) spark-sql>select * from transactions where txn_date IN (cast('2017-07-16' as date),cast('2017-07-17' as date)) and event_name in ('SUCCESS','FAILED') AND MERCHANT_NAME='merchant1' limit 10 Below query does not do FTS, because leading partition key column is txn_status
  • Accessing Datastax Spark - Basic Examples

    DSE 5.0.7, Spark 1.6.3 Accessing Spark Accessing spark from outside the spark cluster dse spark --master spark://MASTER_ADDRESS:7077 Spark SQL $ dse -u avm_analytics -p avm1111 spark-sql spark-sql> select count(*) from avm.transactions where txn_date='2017_05_08'; 446313 Time taken: 523.556 seconds, Fetched 1 row(s) Pyspark $ dse -u avm_analytics -p avm1111 pyspark >>> sqlContext.sql("select count(*) from avm.test") DataFrame[_c0: bigint] >>> sqlContext.sql("select count(*) from avm.test").collect() [Row(_c0=4)] >>> sqlContext.sql("select * from avm.test").collect() [Row(event_id=30, event_name=u'cat'), Row(event_id=1, event_name=u'aaa'), Row(event_id=2, event_name=u'bbb'), Row(event_id=40, event_name=u'fox')] Scala $ dse -u avm_analytics -p avm1111 spark Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .