Big Data

Big Data Integration with Cassandra Using Spark

Notebook and pipeline code
Photo: Christina @ wocintechchat.com / Unsplash · Royalty-free
  • Why Cassandra:
    • Before we discuss Cassandara, we have to also discuss about something called as CAP Theorem – As per CAP(Consistency, Availability and Partition tolerance) theorem  – “you can achieve max 2 of the 3 at max for a system” …
      • Consistency – means if you write some data, the system should be consistent  to get that data back asap. Consistency can be ‘read after write’ or ‘eventual’
        • read after write : you can read the data as soon as you write
        • eventual : there is a lag and although the data is guaranteed to be read but may be after a second or two.
      • Availability – means the system should be always available no matter what using master-slave, live-backup, primary-secondary mechanisms. Even if one or few nodes in a fleet go down, still is available through other nodes.
      • Partition tolerance – means that the system should be highly distributed, scalable and partitionable.
      • It is a NoSQL /non-relational database with syntax like SQL.
      • Lets draw a CAP model and see how different data access technologies fare:
        • Capture
          Cassandra – has tunable consistency(more of eventual, but can be tuned), highly distributed and high availability
           
  • Cassandra Architecture:
    •  It is a NoSQL distributed database with SQL-like commands called as CQL.
    • CQL is similar in syntax like SQL but has limitations like
      • No Joins are supported so all data but me denormalized
      • Each table must have a primary key
      • Databases in Cassandra are called Keyspace
      • the command like interface is called CQLSH
    • DataStax is a connector for Cassandra + Spark
      • alllows Sparks to use data-frames to write and read data to/fro from Cassandra tables.
      • can be used in following use cases:
        • data transformed in Spark is saved in Cassandra to be viewed by various presentation tools
        • Data stored in Cassandra can be pulled by Spark to analyse.
    • Capture
      Cassandra Simplified Architecture
  • How to install Cassandra:
    • Its not part of Hortonworks Ambari or Cloudera clusters
    • Need to installed by either a docker or manually
    • Login to Ambari Sandox box using ‘maria_dev’ credentials and elevate to root user.
    • First update the sandbox – yum -y update
    • In case you get update issues  – move the sandbox.repo file from  /etc/yum.repos.d folder to /tmp folder like this:
      • mv /etc/yum.repos.d/sandbox.repo /tmp
    • Install tools to install multiple versions of Python  and use them without breaking each other –
      • yum -y install scl-utils – to install scl utilities
      • yum -y install centos-release-scl-rh – to install centos related scl utilities
      • yum -y install python27 – to install python 2.7
      • scl enable python27 bash – to enable python 2.7
    • Create a repository for datastax as : nano /etc/yum.repos.d/datastax.repo
      • Copy the below content ans save file.
      • [datastax]
        name = Datastax Apache Cassandara Repository
        baseurl = http://rpm.datastax.com/community
        enabled = 1
        gpgcheck = 0
    • CaptureNow you are ready to install Cassandra: yum -y install dsc30
    • Now lets install some dependencies for its CQLSH : yum -y install sqlsh and pip install cqlsh
    • Now lets start Cassandra service – service cassandra start
    • Run command nodetool enablethrift
    • Now lets start CQLSH command line – cqlsh –cqlversion=’3.4.0′
    • Capture
    • Once you are in CQLSH shell now you can create a Keyspace and then tables.
    • First create a KeySpace movielens – CREATE KEYSPACE movielens with replication = {‘class’ : ‘SimpleStrategy’, ‘replication_factor’ : ‘1’} AND durable_writes = true;
    • Now use Movielens Keyspave : use movielens;
    • Now create table : CREATE TABLE usersData (userID int, age int, gender text, occupation text, zip text, PRIMARY KEY(userID));
    • View Empty table – SELECT * FROM usersData;
    • View table structure – DESCRIBE usersData;
    • CaptureWow, we create a table too now we will use spark to push data into this table
    • Please download script https://s3.amazonaws.com/testbucket786786/SparkCassandraIntegration.py
    • Capture
    • Lets execute the script – spark-submit –packages datastax:spark-cassandra-connector:2.0.0-M2-s_2.11 SparkCassandraIntegration.py –conf spark.cassandra.connection.host=”127.0.0.1″
    • CaptureCapture
    • Lets verify if data was written to the table, login back to cqlsh and use movie lens Keyspace and do a select.
    • Capture
    • To stop a Cassandra connection – service cassandra stop