Apache Cassandra

Introduction

Apache Cassandra is a highly scalable, distributed NoSQL database designed to handle large amounts of data across multiple commodity servers, providing high availability with no single point of failure.

Key Features:

  • Distributed architecture
  • Linear scalability
  • High availability
  • Tunable consistency
  • Multi-datacenter replication
  • Fault tolerance

Architecture

Distributed Design

Components:

  • Node: Basic unit of storage
  • Ring: Collection of nodes
  • Datacenter: Group of related nodes
  • Cluster: Collection of datacenters
  • Partition: Unit of data replication

Data Distribution

-- Replication strategy example
CREATE KEYSPACE example_keyspace
WITH replication = {
    'class': 'NetworkTopologyStrategy',
    'DC1': 3,
    'DC2': 2
};

Data Model

Table Design

-- Create table with partition and clustering keys
CREATE TABLE users (
    organization_id uuid,
    email text,
    username text,
    created_at timestamp,
    last_login timestamp,
    PRIMARY KEY ((organization_id), email)
) WITH CLUSTERING ORDER BY (email ASC);

-- Materialized view
CREATE MATERIALIZED VIEW users_by_username AS
SELECT * FROM users
WHERE username IS NOT NULL AND organization_id IS NOT NULL
PRIMARY KEY ((username), organization_id);

Data Types

Available Types:

  • Basic: text, int, boolean, timestamp
  • Collections: set, list, map
  • Custom: user-defined types (UDT)
  • Counter: distributed counter

CQL Basics

Basic Operations

-- Insert data
INSERT INTO users (organization_id, email, username)
VALUES (uuid(), 'user@example.com', 'johndoe');

-- Query data
SELECT * FROM users
WHERE organization_id = 123e4567-e89b-12d3-a456-426614174000
AND email = 'user@example.com';

-- Update data
UPDATE users
SET last_login = toTimestamp(now())
WHERE organization_id = 123e4567-e89b-12d3-a456-426614174000
AND email = 'user@example.com';

-- Delete data
DELETE FROM users
WHERE organization_id = 123e4567-e89b-12d3-a456-426614174000
AND email = 'user@example.com';

Operations

Node Operations

# Start Cassandra
cassandra -f

# Check cluster status
nodetool status

# Repair node
nodetool repair

# Cleanup after node removal
nodetool cleanup

Backup and Recovery

# Create snapshot
nodetool snapshot keyspace_name

# Clear snapshot
nodetool clearsnapshot keyspace_name

# Incremental backup
nodetool flush keyspace_name table_name

Best Practices

Data Modeling

Guidelines:

  • Model around query patterns
  • Denormalize data
  • Use wide rows sparingly
  • Choose appropriate partition keys
  • Consider data distribution

Performance Optimization

Tips:

  • Use appropriate consistency levels
  • Batch operations wisely
  • Monitor partition sizes
  • Use proper compaction strategy
  • Configure garbage collection

Tools and Resources

Management Tools

  • nodetool - Node management
  • cqlsh - CQL shell
  • JMX monitoring tools
  • DataStax DevCenter
  • Cassandra Reaper - Repair service

Monitoring

# Check cluster health
nodetool describecluster

# View thread pools
nodetool tpstats

# Check compaction stats
nodetool compactionstats

# View table statistics
nodetool tablestats keyspace_name