Apache Cassandra

Introduction

Apache Cassandra is a highly scalable, distributed NoSQL database designed to handle large amounts of data across multiple commodity servers, providing high availability with no single point of failure.

Key Features:

Distributed architecture
Linear scalability
High availability
Tunable consistency
Multi-datacenter replication
Fault tolerance

Architecture

Distributed Design

Components:

Node: Basic unit of storage
Ring: Collection of nodes
Datacenter: Group of related nodes
Cluster: Collection of datacenters
Partition: Unit of data replication

Data Distribution

-- Replication strategy example
CREATE KEYSPACE example_keyspace
WITH replication = {
    'class': 'NetworkTopologyStrategy',
    'DC1': 3,
    'DC2': 2
};

Data Model

Table Design

-- Create table with partition and clustering keys
CREATE TABLE users (
    organization_id uuid,
    email text,
    username text,
    created_at timestamp,
    last_login timestamp,
    PRIMARY KEY ((organization_id), email)
) WITH CLUSTERING ORDER BY (email ASC);

-- Materialized view
CREATE MATERIALIZED VIEW users_by_username AS
SELECT * FROM users
WHERE username IS NOT NULL AND organization_id IS NOT NULL
PRIMARY KEY ((username), organization_id);

Data Types

Available Types:

Basic: text, int, boolean, timestamp
Collections: set, list, map
Custom: user-defined types (UDT)
Counter: distributed counter

CQL Basics

Basic Operations

-- Insert data
INSERT INTO users (organization_id, email, username)
VALUES (uuid(), 'user@example.com', 'johndoe');

-- Query data
SELECT * FROM users
WHERE organization_id = 123e4567-e89b-12d3-a456-426614174000
AND email = 'user@example.com';

-- Update data
UPDATE users
SET last_login = toTimestamp(now())
WHERE organization_id = 123e4567-e89b-12d3-a456-426614174000
AND email = 'user@example.com';

-- Delete data
DELETE FROM users
WHERE organization_id = 123e4567-e89b-12d3-a456-426614174000
AND email = 'user@example.com';

Operations

Node Operations

# Start Cassandra
cassandra -f

# Check cluster status
nodetool status

# Repair node
nodetool repair

# Cleanup after node removal
nodetool cleanup

Backup and Recovery

# Create snapshot
nodetool snapshot keyspace_name

# Clear snapshot
nodetool clearsnapshot keyspace_name

# Incremental backup
nodetool flush keyspace_name table_name

Best Practices

Data Modeling

Guidelines:

Model around query patterns
Denormalize data
Use wide rows sparingly
Choose appropriate partition keys
Consider data distribution

Performance Optimization

Tips:

Use appropriate consistency levels
Batch operations wisely
Monitor partition sizes
Use proper compaction strategy
Configure garbage collection

Tools and Resources

Management Tools

nodetool - Node management
cqlsh - CQL shell
JMX monitoring tools
DataStax DevCenter
Cassandra Reaper - Repair service

Monitoring

# Check cluster health
nodetool describecluster

# View thread pools
nodetool tpstats

# Check compaction stats
nodetool compactionstats

# View table statistics
nodetool tablestats keyspace_name