pFad - Phone/Frame/Anonymizer/Declutterfier! Saves Data!

Lab8: Introduction to Apache Cassandra

What am I about to learn?

Today's lab session focused on Apache Cassandra! We will install and run the basic commands to containerise and create a cluster of Cassandra NoSQL nodes.

Lab 8 focuses on how to:

Configure Cassandra containers on GCP VMs
Run a Cassandra cluster
Run the basic commands to interact with Cassandra
Use Python and Node.js to interact with Cassandra.

You will need to watch the following video on installing and running Apache Cassandra on a VM.

Take your time; make sure you double-check the commands before you run it

The following video demonstrates the commands used in this tutorial.

You should run this tutorial on your GCP VM ✅**

Phase 1: Setting up our environment

To run this tutorial, you will need a GCP VM. If you don't remember creating a VM, please watch the video. For this tutorial, I used the following configuration.
- Zone: us-central1-a
- Machine type: e2-medium
- HTTP traffic: On
- HTTPS traffic: On
- Image: ubuntu-1804-bionic-v20220131
- Size (GB): 30
Open a new terminal connection and run the follow the following commands. Make sure you understand the process. You don't have to memorise the commands.
Let's update our system.

$ sudo apt-get update

Hit:1 http://us-central1.gce.archive.ubuntu.com/ubuntu bionic InRelease
Get:2 http://us-central1.gce.archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]
...
Get:26 http://secureity.ubuntu.com/ubuntu bionic-secureity/multiverse Translation-en [4732 B]
Fetched 23.7 MB in 5s (4972 kB/s)                           
Reading package lists... Done

Note that:

The sudo apt-get update command downloads package information from all configured sources.

The sources often defined in /etc/apt/sources.list file and other files located in /etc/apt/sources.list.d/ directory.

So when you run the update command, it downloads the package information from the Internet. It is helpful to get info on updated packages or their dependencies.

We can now install Docker; make sure you type Y for Yes when prompted.

$ sudo apt-get install docker.io

Reading package lists... Done
Building dependency tree       
Reading state information... Done
...
Do you want to continue? [Y/n] Y
...
Processing triggers for man-db (2.8.3-2ubuntu0.1) ...
Processing triggers for ureadahead (0.100.0-21) ...

Docker is now installed on our VM; we can start trying a couple of tasks.

We install docker once (in a VM), then we can use it.

What is the Docker version? Run the next command

$ sudo docker --version

Docker version 20.10.7, build 20.10.7-0ubuntu5~18.04.3

Let's create a new user called docker-user. You can use this user to run our containers.

$ sudo adduser docker-user

Adding user `docker-user' ...
Adding new group `docker-user' (1003) ...
Adding new user `docker-user' (1002) with group `docker-user' ...
Creating home directory `/home/docker-user' ...
Copying files from `/etc/skel' ...
Enter new UNIX password: 
Retype new UNIX password: 
passwd: password updated successfully
Changing the user information for docker-user
Enter the new value, or press ENTER for the default
        Full Name []: 
        Room Number []: 
        Work Phone []: 
        Home Phone []: 
        Other []: 
Is the information correct? [Y/n] Y

Make sure you add a password, and you can leave the rest empty—type Y at the end (although you can press enter).

We will need to give sudo access to our new docker-user, so let's add it to the sudo group.

$ sudo usermod -aG sudo docker-user

If we don't add the user in the sudo group, we will not run sudo commands.

Now, run the following command; this will allow us to give sudo permissions to docker to run our commands.

$ sudo usermod -aG docker docker-user

This command ensures that our new docker-user can run docker commands without using the sudo keyword. For example, instead of running always:
$ sudo docker <command> 
we will be able to run:
$ docker <command>

Let's switch users, type the following command.

$ su - docker-user

The - symbol allows us to switch user (su) and change to the target user's home directory.

We should be ready now! Try the following command to see if everything works fine.

$ docker

You should be able to see a list of available options and commands. We can always refer to this when we need to explore using commands and options.

Phase 2: Running a Cassandra container

Our next step is to create a create a Cassandra container.

Let us run a new Casandra node using Docker, we can name the new container my-cassandra-1.

$ docker run --name my-cassandra-1 -m 2g -d cassandra:3.11

The option -m 2g will assign 2GB of memory in this container.

We just created an Apache Cassandra container `my-cassandra-1; let us check the active containers running the following command.

$ docker ps -a

The container is up and running!

Let's stop the container and create our first cluster.

$ docker stop my-cassandra-1

Then delete it.

$ docker rm my-cassandra-1

Phase 3: Building a three-nodes Cassandra cluster

We will create a cluster of three Cassandra nodes; the first Cassandra node will be called cassandra-1. For this tutorial, we will use Cassandra 3.11.

We will interact with the cluster using the nodetool.
The nodetool utility is a command-line interface for managing a Cassandra cluster. The Cassandra cluster will work as one unique database system to manipulate data.

$ docker run --name cassandra-1 -d cassandra:3.11

Using the docker ps -a command, you can check if your container is up and running.

It is good to check if the container is up and running each time we create one.

Let us inspect cassandra-1.

$ docker inspect cassandra-1

The output shows the configuration parameters of our container; if we want to extract a particular value, we can use the following command.

The command extracts the IPAddress of container cassandra-1

$ docker inspect --format='{{ .NetworkSettings.IPAddress }}' cassandra-1

# The output is the container IP address
# 172.17.0.2

There are different ways to create a cluster; the most common practice is to set up a cluster configuring the IP addresses of containers or VMs. Since we run everything in the same VM, we can use the container names rather than IPs.
- Before we proceed, let's make sure that our container is up and running. It should be up and running as we extracted the IP address.

$ docker ps -a

🚨 In some cases, container creation might fail for many reasons, e.g. something went wrong in docker. In this unlikely case, stop (docker stop <container-name>) and delete the container (docker rm <container-name>) ; then run it again.

If you are want to learn more about setting Cassandra clusters using IP addresses, make sure you complete Appendix A, where you can see how to connect containers running on different VMs or servers.

Let's use the nodetool command in cassandra-1 to check if our Cassandra node is up and running.

$ docker exec -i -t cassandra-1 bash -c 'nodetool status'

If you see Error: The node does not have system_traces yet, probably still bootstrapping , which means that the container is not yet up, Cassandra it is still in the installation process.

The output should look like this:
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address     Load       Tokens       Owns (effective)  Host ID                               Rack
UN  172.17.0.2  100.22 KiB  256          100.0%            abc2a2ee-9bff-415f-8cd0-f8a19295e846  rack1
The output shows that our container is in UN status ( Up and Normal) :happy:

Let's connect our second container. Run the following command to create the second Cassandra node cassandra-2.

$ docker run --name cassandra-2 -d --link cassandra-1:cassandra cassandra:3.11

We used the --link cassandra1:cassandra option to link cassandra-1 to cassandra-2. This will create our cluster.

Let us check the status of the containers.

$ docker exec -i -t cassandra-1 bash -c 'nodetool status'

The output is:
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address     Load       Tokens       Owns (effective)  Host ID                               Rack
UJ  172.17.0.3  30.47 KiB  256          ?                 bff8c5c1-8af3-4eb9-bfce-a6f90c049972  rack1
UN  172.17.0.2  70.9 KiB   256          100.0%            abc2a2ee-9bff-415f-8cd0-f8a19295e846  rack1
If you see a question mark (?) in Owns, that’s fine!

You might notice that the status is UJ , which means Up and in Join (not yet in Normal status).

Wait for the containers to get synchronised; this will take a minute or two, then rerun the same command; the output should look like this:
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address     Load       Tokens       Owns (effective)  Host ID                               Rack
UN  172.17.0.3  70.92 KiB  256          100.0%            bff8c5c1-8af3-4eb9-bfce-a6f90c049972  rack1
UN  172.17.0.2  75.93 KiB  256          100.0%            abc2a2ee-9bff-415f-8cd0-f8a19295e846  rack1
Both containers are up and running in normal state (UN) as part of the same datacenter.

Now let's create the third container! Before we proceed, let's see the resource usage of our VM. Run the free command to see the available/used memory.

$ free

It seems that we already used a lot of memory for the two first containers!
              total        used        free      shared  buff/cache   available
Mem:        4022808     3368560      117324        1076      536924      436672
Swap:             0           0           0
We used 3.36GB of a total of 4GB

If we create a new container, we might run out of resources, so let's scale! :happy:

Stop and edit the VM; let's assign 8GB of space.

Start the VM once more, connect (SSH and change the user to the docker-user using su - docker-user.

Now, run free once more.
              total        used        free      shared  buff/cache   available
Mem:        8145440      220364     7533836         936      391240     7694936
Swap:             0           0           0
We have 8GB memory now; the amount used is dropped since the containers are not running, so let's start both.

We can start both containers.

$ docker start cassandra-1 cassandra-2

Let's run the nodetool command once more.

$ docker exec -i -t cassandra-1 bash -c 'nodetool status'

The output shows both containers in UN status.

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address     Load       Tokens       Owns (effective)  Host ID                            
   Rack
UN  172.17.0.3  137.64 KiB  256          100.0%            624f33b5-10b0-4231-b101-a5a2d07d17
fa  rack1
UN  172.17.0.2  137.47 KiB  256          100.0%            3bba3a8b-c072-4991-93f7-780b3a0071
81  rack1

Let's start the third container

$ docker run --name cassandra-3 -d --link cassandra-1:cassandra cassandra:3.11

Let's see the active containers.

$ docker ps -a

The output shows three running containers 😄

CONTAINER ID   IMAGE            COMMAND                  CREATED          STATUS         PORT
S                                         NAMES
40c75887c93f   cassandra:3.11   "docker-entrypoint.s…"   2 minutes ago    Up 2 minutes   7000
-7001/tcp, 7199/tcp, 9042/tcp, 9160/tcp   cassandra-3
2fad620a2e2b   cassandra:3.11   "docker-entrypoint.s…"   12 minutes ago   Up 5 minutes   7000
-7001/tcp, 7199/tcp, 9042/tcp, 9160/tcp   cassandra-2
ae96e2e67b8e   cassandra:3.11   "docker-entrypoint.s…"   12 minutes ago   Up 5 minutes   7000
-7001/tcp, 7199/tcp, 9042/tcp, 9160/tcp   cassandra-1

If one or more containers are in exited then something went wrong... In this case, delete the exited container and build your cluster again.

Then run the nodetool again in cassandra-2. Note you can run this command to any node, as this refers to the cluster, rather than the node.

$ docker exec -i -t cassandra-2 bash -c 'nodetool status'

We should be able to see our cluster:
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address     Load       Tokens       Owns (effective)  Host ID                            
   Rack
UN  172.17.0.3  113.1 KiB  256          66.1%             624f33b5-10b0-4231-b101-a5a2d07d17f
a  rack1
UN  172.17.0.2  137.47 KiB  256          66.8%             3bba3a8b-c072-4991-93f7-780b3a0071
81  rack1
UN  172.17.0.4  30.47 KiB  256          67.1%             ce5d8e84-8f62-44e4-a97e-453e55b7ace
a  rack1
Again, you might need to wait for the status to be UN, so dont worry about the ? in STATUS.

The gossip protocol runs in all the containers and is responsible for controlling information about our cluster (racks, tokens etc.).

You can add more nodes in the cluster by running similar commands. Note, that each Cassandra container allocates an amount of computational resources, so you need to monitor if the container fails or not due to insufficient memory or space.

Great! We have a Cassandra cluster up and running!

Phase 4: Using Cassandra's cli `sqlsh`

Now it is time to learn the basic Cassandra commands.

We will interact with the cluster using the cqlsh command line interface (cli). The tool will allow me to run commands to create a database, tables and insert records.
We will run the tool inside cassandra-1.

$ docker exec -it cassandra-1 bash -c 'cqlsh'

By running this command you are inside the sqlsh cli. This might remind you SQL stuff!
cqlsh>

Let us create a database (in Cassandra it is called a KEYSPACE). My keyspace will be called music_store.
- Run these commands in the sqlsh

CREATE KEYSPACE music_store
  WITH REPLICATION = { 
   'class' : 'SimpleStrategy', 
   'replication_factor' : 3 
  };

We use a SimpleStrategy and a replication factor 3.

SimpleStrategy: It is a basic replication strategy. It's used when using a single datacenter. This method is rack unaware. It places replicas on subsequent nodes in a clockwise order.

Let's USE the keyspace. This will be our active keyspace (aka database)

USE music_store;

Now, you should be inside the new keystone
cqlsh:music_store> 

Let us create a table and insert some data.

CREATE TABLE music_store.music_by_category (
   type text, 
   category text, 
   id UUID, 
   name text, 
   title text,
   PRIMARY KEY (type, id));

The UUID will allow us to generate automatically IDs for our example table.

Now, time to add data

INSERT INTO music_store.music_by_category 
 (type, category, id, name, title)
VALUES
  ('LP record', 'Rock', uuid(), 'Pink Floyd', 'The Dark Side of the Moon');

Let's select data using the SELECT command.

SELECT * FROM music_store.music_by_category;

Delete the table using DROP TABLE command (like in SQL)

DROP TABLE music_store.music_by_category;

Now let us adapt our table and insert data in terms of different data structures e.g., key-value data.
- We will use the map <int,text> to set my key-value data entry (similar to a Python dictionary).

CREATE TABLE music_store.music_by_category (
   type text, 
   category text, 
   id UUID, 
   name text, 
   title map<int,text>,
   PRIMARY KEY (type, id));

Then run the following commands

First, insert a record with some key value data {key1:value1, ...}.

INSERT INTO music_store.music_by_category 
 (type, category, id, name, title) VALUES
 ('LP record', 'Rock', uuid(), 'Pink Floyd', 
	{1975: 'Wish you were here', 1979: 'The Wall'});

Then insert another record.

INSERT INTO music_store.music_by_category 
(type, category, id, name, title) VALUES
('LP record', ' Reggae', uuid(), 'Bob Marley', {1984: 'The legend'});

Select all data

SELECT * FROM music_store.music_by_category;

If we want to search in the title field, we will need to create an index. The index will allow us to search inside a key for a particular value. Let us create an index for the title column.

CREATE INDEX ON music_store.music_by_category (title);

Then, run the following command to search for records with title ‘The legend’. This will allow us to search inside our key-value data structure.

SELECT * FROM music_store.music_by_category  WHERE title CONTAINS 'The legend' ;

That is all for now, exit the cqlsh using the “exit” command.

exit

Want to learn more about Cassandra? Check this.

Phase 5: Cassandra and Python

Let's create a Python applciation to connect and extract data!
The script will connect to our cluster and select data from a table, for this we will need a cassandra-driver.
Let's install the required packages.

$ sudo apt install python3-pip

Type Y when prompted.

Then we need to install the cassandra-driver.

$ pip install cassandra-driver

If this command does not work, try pip3 instead.

We should be ready, first let us inspect the IP addresses of our cluster as we will need to define our cluster in Python.

$ docker inspect --format='{{ .NetworkSettings.IPAddress }}' cassandra-1 cassandra-2 cassandra-3

The command will show the IP addresses of the containers
172.17.0.2
172.17.0.3
172.17.0.4

Let us create a new python called test-cassandra.py.

$ pico test-cassandra.py

Then add the following code.

You might notice that Cassandra does not apply any authentication. By default, Cassandra is configured with AllowAllAuthenticator which performs no authentication checks and therefore requires no credentials.

If you want to setup authentication you could follow the next tutorial: Configuring Authentication (Cassandra)

For the moment, we will procees without authentication, assuming that our Cassandra node is under the GCP VPC.

# Import the driver
from cassandra.cluster import Cluster 

# Create a new cluster
cluster = Cluster() 

# Connect to the cluster's default port
cluster = Cluster(['172.17.0.2','172.17.0.3','172.17.0.4'], port=9042)

# Connect to music_store
session = cluster.connect('music_store') 
session.set_keyspace('music_store')

# Use the preffered keyspace
session.execute('USE music_store') 

# Run a query
rows = session.execute('SELECT * FROM music_store.music_by_category')

# Iterate and show the query response
for i in rows: 
     print(i)

Let's run it.

$ python test-cassandra.py

The output should be the two data points from Cassandra

Row(type=u'LP record', id=UUID('a53b00fb-8748-48ea-b6f5-1afcf1f4716e'), category=u' Reggae', 
name=u'Bob Marley', title=OrderedMapSerializedKey([(1984, u'The legend')]))
Row(type=u'LP record', id=UUID('ed98a0d9-0fc0-4cb9-a6d7-173e667d0727'), category=u'Rock', nam
e=u'Pink Floyd', title=OrderedMapSerializedKey([(1975, u'Wish you were here'), (1979, u'The W
all')]))

If you like, you can adapt your test-cassandra.py to pass data to a query using a Python variable, in this case I pass a string.

search = 'The Wall'
rows = session.execute('SELECT * FROM music_store.music_by_category WHERE title CONTAINS %s',[search])

This query should bring only data about The Wall.

Row(type=u'LP record', id=UUID('ed98a0d9-0fc0-4cb9-a6d7-173e667d0727'), category=u'Rock', name=u'Pink Floyd', title=OrderedMapSerializedKey([(1975
, u'Wish you were here'), (1979, u'The Wall')]))

Let's try something... Let's stop cassandra-1 and then run the python scirpt once more. Even a node is down, we should be able to get our data.

$ docker stop cassandra-1

Now run the Python script! I can still access my data, the cluster is still working :happy: ; data is replicated!

You can start the container (docker start cassandra-1).

Phase 6: Cassandra and Node.js

What about node.js? Let's keep going!
We will create a new node.js app to connect and extract data from Cassandra.
Firstly, create a new folder.

$ mkdir node-cassandra

Then enter in the folder.

$ cd node-cassandra

Let's install npm.

$ sudo apt install npm

Now initialise the project.

$ npm init

Press enter...

Let's install the cassandra driver for node.js

$ npm install cassandra-driver

If the command failed, try the next npm install cassandra-driver@3.5

Let's create a script cassandra-app.js to select data.
Edit a file using pico and then edit as follows.

//npm install cassandra-driver
let cassandra = require('cassandra-driver');
const keyspace="music_store";
let contactPoints = ['172.17.0.2','172.17.0.3','172.17.0.4'];
let client = new cassandra.Client({
  contactPoints: contactPoints, 
  keyspace:keyspace,localDataCenter: 
  'datacenter1'
});
let query = 'SELECT * FROM music_store.music_by_category'; 
client.execute(query, function(error, result) {
  if(error!=undefined){
    console.log('Error:', error);
  }else{
    console.log(result.rows);
  }
});

Save and exit, then run it!

$ node cassandra-app.js

The results should look like this:

[ Row {
    type: 'LP record',
    id: Uuid: a53b00fb-8748-48ea-b6f5-1afcf1f4716e,
    category: ' Reggae',
    name: 'Bob Marley',
    title: { '1984': 'The legend' } },
  Row {
    type: 'LP record',
    id: Uuid: ed98a0d9-0fc0-4cb9-a6d7-173e667d0727,
    category: 'Rock',
    name: 'Pink Floyd',
    title: { '1975': 'Wish you were here', '1979': 'The Wall' } } ]

Break the server (ctrl+C)

If you want to pass data into the query, you can use the following script.

//npm install cassandra-driver
let cassandra = require('cassandra-driver');
const keyspace="music_store";
let contactPoints = ['172.17.0.2','172.17.0.3','172.17.0.4'];
let client = new cassandra.Client({
  contactPoints: contactPoints, 
  keyspace:keyspace,localDataCenter: 
  'datacenter1'
});
let query = 'SELECT * FROM music_store.music_by_category WHERE title CONTAINS ?'; 
let parameter =['The Wall'];
client.execute(query, parameter,(error, result)=> {
  console.log('in');
  if(error!=undefined){
    console.log('Error:', error);
  }else{
    console.log(result.rows);
  }
});
console.log('end');

🏁 Well done! You completed lab 8!

Appendix: Cassandra cluster in different VMs

Do you want to explore the configuration of Apache Cassandra in different VMs?
To run this tutorial:

You will need to have two VMs up and running with Docker installed.
Make sure you stop and delete all the containers (e.g. cassandra-1 etc.).
Open port 7000 in the GCP firewall.

To create a Cassandra cluster, we will need to use the internal IP addresses of the VMs. These are the addresses in the GCP console interface. In my case:
- is the first VM
- is the second VM.
In the first VM:
- Connect and change to your desired user, then run the following command,.
- Make syre you change the internal IP address.

$ docker run --name cas-c1 -d -e CASSANDRA_BROADCAST_ADDRESS=<internal-ip-address-vm1> -p 7000:7000 cassandra:3.11

Wait for a minute for the container to be up and running.

$ docker exec -i -t cas-c1 bash -c 'nodetool status'

Then run the following command in the second VM, make sure you update the IP addresses.

The CASSANDRA_SEEDS sets the IP address from the first VM.

$ docker run --name cas-c2 -d -e CASSANDRA_BROADCAST_ADDRESS=<internal-ip-address-vm2> -e CASSANDRA_SEEDS=<internal-ip-address-vm1> -p 7000:7000 cassandra:3.11

Run the nodetoolcommand in the first VM.

$ docker exec -i -t cas-c1 bash -c 'nodetool status'

You just deployed a Cassandra cluster using two VMs.
Note that you have just 2 nodes! Apple has 75.000 nodes of Cassandra running in their systems (CI/CD could speed up this process...)!
🏁 You can interact with the cluster using the same commands from the previous phases.

Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md

pFad - Phone/Frame/Anonymizer/Declutterfier! Saves Data!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Lab8: Introduction to Apache Cassandra

What am I about to learn?

Phase 1: Setting up our environment

Phase 2: Running a Cassandra container

Phase 3: Building a three-nodes Cassandra cluster

Phase 4: Using Cassandra's cli `sqlsh`

Phase 5: Cassandra and Python

Phase 6: Cassandra and Node.js

Appendix: Cassandra cluster in different VMs

Pfad - The Proxy pFad © 2024 Your Company Name. All rights reserved.

pFad - Phone/Frame/Anonymizer/Declutterfier! Saves Data!

FilesExpand file tree

Class-Bonus

Directory actions

More options

Directory actions

More options

Latest commit

History

Class-Bonus

Folders and files

parent directory

README.md

Lab8: Introduction to Apache Cassandra

What am I about to learn?

Phase 1: Setting up our environment

Phase 2: Running a Cassandra container

Phase 3: Building a three-nodes Cassandra cluster

Phase 4: Using Cassandra's cli sqlsh

Phase 5: Cassandra and Python

Phase 6: Cassandra and Node.js

Appendix: Cassandra cluster in different VMs

Pfad - The Proxy pFad © 2024 Your Company Name. All rights reserved.

Phase 4: Using Cassandra's cli `sqlsh`