• How to create a simple Cassandra Cluster on AWS

    What is Cassandra? Apache Cassandra is a free and open-source distributed wide column store NoSQL database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. Wikipedia Apache Cassandra is a high performance, extremely scalable, fault tolerant (i.e. no single point of failure), distributed post-relational database solution. Cassandra combines all the benefits of Google Bigtable and Amazon Dynamo to handle the types of database management needs that traditional RDBMS vendors cannot support.
  • Datastax Cassandra benchmark on OCI

    In this blog, we show how we created a Datastax Cassandra cluster on Oracle Cloud Infrastructure (OCI) using Terraform and benchmark Oracle Cloud baremetal machines running Cassandra stress. 1. Steps to Create the Cluster Go here to download the Terraform project Follow the guide to setup your environment Edit the file env-vars and fill in all the relevant info for your OCI account edit variables.tf and select the shape of your OCI instance, we used DenseIO1.
  • Spark and Qlik Integration

    Steps: Start Spark Thrift Server on Datastax Cluster $ dse -u cassandra -p <password> spark-sql-thriftserver start --conf spark.cores.max=4 --conf spark.executor.memory=2G --conf spark.driver.maxResultSize=1G --conf spark.kryoserializer.buffer.max=512M --conf spark.sql.thriftServer.incrementalCollect=true Enable Qlik Server’s Security Group on AWS to access port 10000 (basically from qlik, need to connect to thrift server port 10000) Install Simba ODBC Driver for Spark on the Qilk Windows EC2 Instance Create System DSN as follows:
  • Datastax Spark on AWS

    Configuration: DSE 5.0.6 (See Datastax Cassandra on AWS for Installation Details) /etc/dse/spark/spark-env.sh export SPARK_PUBLIC_DNS=<node1_public_ip> export SPARK_DRIVER_MEMORY="2048M" export SPARK_WORKER_CORES=2 export SPARK_WORKER_MEMORY="4G" /etc/dse/spark/spark-defaults.conf spark.scheduler.mode FAIR spark.cores.max 2 spark.executor.memory 1g spark.cassandra.auth.username analytics spark.cassandra.auth.password ***** spark.scheduler.allocation.file /etc/dse/spark/fairscheduler.xml spark.eventLog.enabled True #spark.default.parallelism: 3*4cores=12 spark.default.parallelism 12 /etc/dse/spark/fairscheduler.xml <allocations> <pool name="default"> <schedulingMode>FAIR</schedulingMode> <weight>1</weight> <minShare>4</minShare> </pool> <pool name="admin"> <schedulingMode>FAIR</schedulingMode> <weight>1</weight> <minShare>4</minShare> </pool> </allocations> $ grep initial_spark_worker_resources /etc/dse/dse.yaml initial_spark_worker_resources: 0.7 so when you start dse spark or dse spark-sql, in spark UI, you can see 3 out of 4 cores allocated
  • Datastax Cassandra on AWS

    Setting up AWS EC2 Instance Type: m4.xlarge, 4 node cluster, 2 in each AZ Storage: Two EBS volumes, data volume 400GB, 150GB log volume, root volume 150GB (General Purpose SSD) OS:Amazon Linux AMI 2016.09.0 (HVM), SSD Volume Type - ami-b953f2da The Amazon Linux AMI is an EBS-backed, AWS-supported image. The default image includes AWS command line tools, Python, Ruby, Perl, and Java. The repositories include Docker, PHP, MySQL, PostgreSQL, and other packages.