Apache Cassandra
Introduction
Apache Cassandra is a highly scalable, distributed NoSQL database designed to handle large amounts of data across multiple commodity servers, providing high availability with no single point of failure.
Key Features:
- Distributed architecture
- Linear scalability
- High availability
- Tunable consistency
- Multi-datacenter replication
- Fault tolerance
Architecture
Distributed Design
Components:
- Node: Basic unit of storage
- Ring: Collection of nodes
- Datacenter: Group of related nodes
- Cluster: Collection of datacenters
- Partition: Unit of data replication
Data Distribution
-- Replication strategy example
CREATE KEYSPACE example_keyspace
WITH replication = {
'class': 'NetworkTopologyStrategy',
'DC1': 3,
'DC2': 2
};
Data Model
Table Design
-- Create table with partition and clustering keys
CREATE TABLE users (
organization_id uuid,
email text,
username text,
created_at timestamp,
last_login timestamp,
PRIMARY KEY ((organization_id), email)
) WITH CLUSTERING ORDER BY (email ASC);
-- Materialized view
CREATE MATERIALIZED VIEW users_by_username AS
SELECT * FROM users
WHERE username IS NOT NULL AND organization_id IS NOT NULL
PRIMARY KEY ((username), organization_id);
Data Types
Available Types:
- Basic: text, int, boolean, timestamp
- Collections: set, list, map
- Custom: user-defined types (UDT)
- Counter: distributed counter
CQL Basics
Basic Operations
-- Insert data
INSERT INTO users (organization_id, email, username)
VALUES (uuid(), 'user@example.com', 'johndoe');
-- Query data
SELECT * FROM users
WHERE organization_id = 123e4567-e89b-12d3-a456-426614174000
AND email = 'user@example.com';
-- Update data
UPDATE users
SET last_login = toTimestamp(now())
WHERE organization_id = 123e4567-e89b-12d3-a456-426614174000
AND email = 'user@example.com';
-- Delete data
DELETE FROM users
WHERE organization_id = 123e4567-e89b-12d3-a456-426614174000
AND email = 'user@example.com';
Operations
Node Operations
# Start Cassandra
cassandra -f
# Check cluster status
nodetool status
# Repair node
nodetool repair
# Cleanup after node removal
nodetool cleanup
Backup and Recovery
# Create snapshot
nodetool snapshot keyspace_name
# Clear snapshot
nodetool clearsnapshot keyspace_name
# Incremental backup
nodetool flush keyspace_name table_name
Best Practices
Data Modeling
Guidelines:
- Model around query patterns
- Denormalize data
- Use wide rows sparingly
- Choose appropriate partition keys
- Consider data distribution
Performance Optimization
Tips:
- Use appropriate consistency levels
- Batch operations wisely
- Monitor partition sizes
- Use proper compaction strategy
- Configure garbage collection
Tools and Resources
Management Tools
- nodetool - Node management
- cqlsh - CQL shell
- JMX monitoring tools
- DataStax DevCenter
- Cassandra Reaper - Repair service
Monitoring
# Check cluster health
nodetool describecluster
# View thread pools
nodetool tpstats
# Check compaction stats
nodetool compactionstats
# View table statistics
nodetool tablestats keyspace_name