Deep Learning - Complete Guide to Neural Networks, CNNs, RNNs, and Advanced Architectures

1. Introduction to Deep Learning

What is Deep Learning?

Deep Learning is a subset of machine learning that uses neural networks with multiple layers (deep neural networks) to learn hierarchical representations of data. Unlike traditional machine learning algorithms that require manual feature engineering, deep learning can automatically discover the representations needed for feature detection or classification from raw data.

Key Characteristics of Deep Learning:

Hierarchical Learning: Lower layers learn simple features (edges, colors), higher layers learn complex concepts (faces, objects)
End-to-End Learning: Learn directly from raw data to output without manual feature engineering
Scalability: Performance improves with more data and computational resources
Representation Learning: Automatically discovers useful representations from data

Deep Learning vs Traditional ML

Aspect	Traditional ML	Deep Learning
Feature Engineering	Manual feature extraction required	Automatic feature learning
Data Requirements	Works well with small datasets	Requires large amounts of data
Computational Resources	Runs on CPUs	Benefits from GPUs/TPUs
Training Time	Minutes to hours	Hours to days/weeks
Interpretability	Often interpretable	Black box (less interpretable)
Performance	Plateaus with more data	Improves with more data

Why Deep Learning Now?

Big Data

Explosion of data from internet, IoT sensors, social media provides fuel for deep learning models.

Example: ImageNet dataset with 14M images

Computational Power

Modern GPUs and specialized hardware (TPUs) enable training large neural networks efficiently.

Example: NVIDIA A100 GPU: 624 TFLOPS

Better Algorithms

Improvements in activation functions, optimization, regularization, and architectures.

Example: ReLU, Adam optimizer, Batch Normalization

2. Artificial Neural Networks

The Perceptron: Building Block

A perceptron is the simplest neural network unit, inspired by biological neurons.

Perceptron Formula:
output = activation(Σ(weights × inputs) + bias)
y = f(w₁x₁ + w₂x₂ + ... + wₙxₙ + b)

# Simple Perceptron in NumPy
import numpy as np

class Perceptron:
    def __init__(self, input_size):
        self.weights = np.random.randn(input_size)
        self.bias = np.random.randn()

    def forward(self, x):
        # Linear combination
        z = np.dot(self.weights, x) + self.bias
        # Step function activation
        return 1 if z > 0 else 0

    def train(self, X, y, learning_rate=0.1, epochs=100):
        for _ in range(epochs):
            for xi, yi in zip(X, y):
                prediction = self.forward(xi)
                error = yi - prediction
                # Update weights
                self.weights += learning_rate * error * xi
                self.bias += learning_rate * error

# Example: Learning AND gate
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([0, 0, 0, 1])

perceptron = Perceptron(input_size=2)
perceptron.train(X, y)

# Test
for xi in X:
    print(f"Input: {xi}, Output: {perceptron.forward(xi)}")

Multi-Layer Perceptron (MLP)

Multiple layers of perceptrons can learn complex non-linear functions.

# MLP with TensorFlow/Keras
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# Build MLP model
model = keras.Sequential([
    # Input layer (implicit)
    layers.Dense(128, activation='relu', input_shape=(784,)),
    layers.Dropout(0.2),

    # Hidden layers
    layers.Dense(64, activation='relu'),
    layers.Dropout(0.2),

    layers.Dense(32, activation='relu'),

    # Output layer
    layers.Dense(10, activation='softmax')  # 10 classes
])

# Compile model
model.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

# Model summary
model.summary()

# Train model
history = model.fit(
    X_train, y_train,
    epochs=10,
    batch_size=32,
    validation_split=0.2,
    verbose=1
)

# Evaluate
test_loss, test_acc = model.evaluate(X_test, y_test)
print(f"Test accuracy: {test_acc:.3f}")

Activation Functions

Activation functions introduce non-linearity, enabling networks to learn complex patterns.

Function	Formula	Range	Use Case	Pros/Cons
ReLU	f(x) = max(0, x)	[0, ∞)	Hidden layers (most common)	✓ Fast, simple \| ✗ Dying ReLU problem
Leaky ReLU	f(x) = max(0.01x, x)	(-∞, ∞)	Hidden layers (fixes dying ReLU)	✓ No dead neurons \| ✗ Inconsistent predictions
Sigmoid	f(x) = 1/(1+e⁻ˣ)	(0, 1)	Binary classification output	✓ Smooth gradient \| ✗ Vanishing gradient
Tanh	f(x) = (eˣ-e⁻ˣ)/(eˣ+e⁻ˣ)	(-1, 1)	Hidden layers (RNNs)	✓ Zero-centered \| ✗ Vanishing gradient
Softmax	f(xᵢ) = eˣⁱ/Σeˣʲ	(0, 1), Σ=1	Multi-class classification output	✓ Probability distribution \| ✗ Only for output
GELU	f(x) = x·Φ(x)	(-∞, ∞)	Transformers, modern architectures	✓ Smooth, non-monotonic \| ✗ Slower computation

# Implementing activation functions
import numpy as np

# ReLU
def relu(x):
    return np.maximum(0, x)

# Leaky ReLU
def leaky_relu(x, alpha=0.01):
    return np.where(x > 0, x, alpha * x)

# Sigmoid
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

# Tanh
def tanh(x):
    return np.tanh(x)

# Softmax
def softmax(x):
    exp_x = np.exp(x - np.max(x))  # Numerical stability
    return exp_x / exp_x.sum(axis=0)

# Example usage
x = np.array([-2, -1, 0, 1, 2])
print(f"Input: {x}")
print(f"ReLU: {relu(x)}")
print(f"Sigmoid: {sigmoid(x)}")
print(f"Tanh: {tanh(x)}")

# Softmax example (class probabilities)
logits = np.array([2.0, 1.0, 0.1])
probs = softmax(logits)
print(f"Logits: {logits}")
print(f"Softmax probabilities: {probs}")
print(f"Sum: {probs.sum()}")  # Should be 1.0

Backpropagation

The algorithm for training neural networks by computing gradients efficiently using the chain rule.

Backpropagation Steps:

Forward Pass: Compute output predictions
Compute Loss: Calculate error between prediction and actual
Backward Pass: Compute gradients using chain rule (∂Loss/∂weights)
Update Weights: Adjust weights in opposite direction of gradient

# Simple backpropagation implementation
import numpy as np

class NeuralNetwork:
    def __init__(self, input_size, hidden_size, output_size):
        # Initialize weights
        self.W1 = np.random.randn(input_size, hidden_size) * 0.01
        self.b1 = np.zeros((1, hidden_size))
        self.W2 = np.random.randn(hidden_size, output_size) * 0.01
        self.b2 = np.zeros((1, output_size))

    def forward(self, X):
        # Layer 1
        self.z1 = np.dot(X, self.W1) + self.b1
        self.a1 = np.maximum(0, self.z1)  # ReLU

        # Layer 2
        self.z2 = np.dot(self.a1, self.W2) + self.b2
        # Sigmoid for binary classification
        self.a2 = 1 / (1 + np.exp(-self.z2))

        return self.a2

    def backward(self, X, y, output, learning_rate=0.01):
        m = X.shape[0]  # Number of samples

        # Output layer gradients
        dz2 = output - y
        dW2 = np.dot(self.a1.T, dz2) / m
        db2 = np.sum(dz2, axis=0, keepdims=True) / m

        # Hidden layer gradients
        da1 = np.dot(dz2, self.W2.T)
        dz1 = da1 * (self.z1 > 0)  # ReLU derivative
        dW1 = np.dot(X.T, dz1) / m
        db1 = np.sum(dz1, axis=0, keepdims=True) / m

        # Update weights
        self.W2 -= learning_rate * dW2
        self.b2 -= learning_rate * db2
        self.W1 -= learning_rate * dW1
        self.b1 -= learning_rate * db1

    def train(self, X, y, epochs=1000, learning_rate=0.01):
        for epoch in range(epochs):
            # Forward pass
            output = self.forward(X)

            # Backward pass
            self.backward(X, y, output, learning_rate)

            # Print loss every 100 epochs
            if epoch % 100 == 0:
                loss = -np.mean(y * np.log(output) +
                               (1 - y) * np.log(1 - output))
                print(f"Epoch {epoch}, Loss: {loss:.4f}")

# Example: XOR problem
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([[0], [1], [1], [0]])

nn = NeuralNetwork(input_size=2, hidden_size=4, output_size=1)
nn.train(X, y, epochs=5000, learning_rate=0.1)

# Test predictions
predictions = nn.forward(X)
print("\nPredictions:")
for i, (xi, pred) in enumerate(zip(X, predictions)):
    print(f"Input: {xi}, Predicted: {pred[0]:.4f}, Actual: {y[i][0]}")

3. Training Deep Networks

Loss Functions

Loss Function	Use Case	Formula
Binary Cross-Entropy	Binary classification	-[y·log(ŷ) + (1-y)·log(1-ŷ)]
Categorical Cross-Entropy	Multi-class classification	-Σ yᵢ·log(ŷᵢ)
Mean Squared Error (MSE)	Regression	Σ(y - ŷ)² / n
Mean Absolute Error (MAE)	Regression (robust to outliers)	Σ\|y - ŷ\| / n
Huber Loss	Regression (robust)	Smooth combination of MSE and MAE

Optimization Algorithms

Stochastic Gradient Descent (SGD)

Updates weights using gradient of single sample or mini-batch.

# SGD implementation
optimizer = keras.optimizers.SGD(
    learning_rate=0.01,
    momentum=0.9,
    nesterov=True
)

Adam (Adaptive Moment Estimation)

Combines momentum and adaptive learning rates. Most popular optimizer.

# Adam optimizer
optimizer = keras.optimizers.Adam(
    learning_rate=0.001,
    beta_1=0.9,
    beta_2=0.999
)

RMSprop

Adapts learning rate for each parameter based on recent gradient magnitudes.

# RMSprop optimizer
optimizer = keras.optimizers.RMSprop(
    learning_rate=0.001,
    rho=0.9
)

AdaGrad

Adapts learning rate based on historical gradients. Good for sparse data.

# AdaGrad optimizer
optimizer = keras.optimizers.Adagrad(
    learning_rate=0.01
)

Learning Rate Scheduling

# Learning rate schedules in Keras

# 1. Step Decay
def step_decay(epoch):
    initial_lr = 0.1
    drop = 0.5
    epochs_drop = 10.0
    lr = initial_lr * (drop ** np.floor((1 + epoch) / epochs_drop))
    return lr

lr_scheduler = keras.callbacks.LearningRateScheduler(step_decay)

# 2. Exponential Decay
lr_schedule = keras.optimizers.schedules.ExponentialDecay(
    initial_learning_rate=0.1,
    decay_steps=1000,
    decay_rate=0.96
)

# 3. Cosine Decay
lr_schedule = keras.optimizers.schedules.CosineDecay(
    initial_learning_rate=0.1,
    decay_steps=1000
)

# 4. Reduce on Plateau
reduce_lr = keras.callbacks.ReduceLROnPlateau(
    monitor='val_loss',
    factor=0.2,
    patience=5,
    min_lr=0.00001
)

# Use in training
model.fit(
    X_train, y_train,
    epochs=50,
    callbacks=[reduce_lr]
)

Regularization Techniques

L1/L2 Regularization

Add penalty term to loss function to prevent large weights.

from tensorflow.keras import regularizers

model = keras.Sequential([
    layers.Dense(64,
        activation='relu',
        kernel_regularizer=regularizers.l2(0.01),
        bias_regularizer=regularizers.l1(0.01)
    )
])

Dropout

Randomly drop neurons during training to prevent co-adaptation.

model = keras.Sequential([
    layers.Dense(128, activation='relu'),
    layers.Dropout(0.5),  # Drop 50% of neurons
    layers.Dense(64, activation='relu'),
    layers.Dropout(0.3),
    layers.Dense(10, activation='softmax')
])

Batch Normalization

Normalize activations to reduce internal covariate shift.

model = keras.Sequential([
    layers.Dense(128),
    layers.BatchNormalization(),
    layers.Activation('relu'),
    layers.Dense(64),
    layers.BatchNormalization(),
    layers.Activation('relu')
])

Early Stopping

Stop training when validation performance stops improving.

early_stop = keras.callbacks.EarlyStopping(
    monitor='val_loss',
    patience=10,
    restore_best_weights=True
)

model.fit(X_train, y_train,
    validation_data=(X_val, y_val),
    callbacks=[early_stop]
)

4. Convolutional Neural Networks (CNNs)

What are CNNs?

CNNs are specialized neural networks for processing grid-like data (images, video, time series). They use convolution operations to automatically learn spatial hierarchies of features.

Key CNN Layers

Convolutional Layer

Applies filters/kernels to extract features like edges, textures, patterns.

Parameters: filters, kernel_size, stride, padding

Pooling Layer

Downsamples feature maps to reduce dimensionality and computation.

Types: MaxPooling (most common), AveragePooling

CNN Architecture Example

# CNN for Image Classification (CIFAR-10)
import tensorflow as tf
from tensorflow.keras import layers, models

model = models.Sequential([
    # First Convolutional Block
    layers.Conv2D(32, (3, 3), activation='relu',
                  input_shape=(32, 32, 3), padding='same'),
    layers.BatchNormalization(),
    layers.Conv2D(32, (3, 3), activation='relu', padding='same'),
    layers.BatchNormalization(),
    layers.MaxPooling2D((2, 2)),
    layers.Dropout(0.2),

    # Second Convolutional Block
    layers.Conv2D(64, (3, 3), activation='relu', padding='same'),
    layers.BatchNormalization(),
    layers.Conv2D(64, (3, 3), activation='relu', padding='same'),
    layers.BatchNormalization(),
    layers.MaxPooling2D((2, 2)),
    layers.Dropout(0.3),

    # Third Convolutional Block
    layers.Conv2D(128, (3, 3), activation='relu', padding='same'),
    layers.BatchNormalization(),
    layers.Conv2D(128, (3, 3), activation='relu', padding='same'),
    layers.BatchNormalization(),
    layers.MaxPooling2D((2, 2)),
    layers.Dropout(0.4),

    # Fully Connected Layers
    layers.Flatten(),
    layers.Dense(512, activation='relu'),
    layers.BatchNormalization(),
    layers.Dropout(0.5),
    layers.Dense(10, activation='softmax')  # 10 classes
])

model.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

model.summary()

# Data augmentation
datagen = tf.keras.preprocessing.image.ImageDataGenerator(
    rotation_range=15,
    width_shift_range=0.1,
    height_shift_range=0.1,
    horizontal_flip=True,
    zoom_range=0.1
)

# Train
history = model.fit(
    datagen.flow(X_train, y_train, batch_size=32),
    epochs=50,
    validation_data=(X_test, y_test),
    callbacks=[early_stop, reduce_lr]
)

Popular CNN Architectures

LeNet-5 (1998)

Pioneer: First successful CNN for digit recognition

Layers: 7 layers (Conv-Pool-Conv-Pool-FC-FC-Output)

Use: Digit recognition (MNIST)

AlexNet (2012)

Breakthrough: Won ImageNet 2012, popularized deep learning

Innovation: ReLU, Dropout, GPU training

Params: 60M parameters, 8 layers

VGG-16/19 (2014)

Key Idea: Deeper is better (16-19 layers)

Architecture: Small 3x3 filters throughout

Params: 138M (VGG-16)

ResNet (2015)

Innovation: Skip connections solve vanishing gradient

Depth: 50, 101, 152 layers possible

Impact: Enabled very deep networks

Inception (GoogLeNet)

Key Idea: Multiple filter sizes in parallel

Efficiency: Fewer params than VGG

Versions: v1, v2, v3, v4, Xception

EfficientNet (2019)

Innovation: Compound scaling (depth, width, resolution)

Efficiency: Best accuracy/parameters trade-off

Variants: B0-B7

Using Pre-trained CNNs

# Transfer Learning with ResNet50
from tensorflow.keras.applications import ResNet50
from tensorflow.keras.applications.resnet50 import preprocess_input

# Load pre-trained model (without top classification layer)
base_model = ResNet50(
    weights='imagenet',
    include_top=False,
    input_shape=(224, 224, 3)
)

# Freeze base model layers
base_model.trainable = False

# Add custom classification head
model = models.Sequential([
    base_model,
    layers.GlobalAveragePooling2D(),
    layers.Dense(256, activation='relu'),
    layers.Dropout(0.5),
    layers.Dense(num_classes, activation='softmax')
])

model.compile(
    optimizer=keras.optimizers.Adam(1e-4),
    loss='categorical_crossentropy',
    metrics=['accuracy']
)

# Train only the new layers
model.fit(X_train, y_train, epochs=10)

# Fine-tuning: Unfreeze some base layers
base_model.trainable = True
# Freeze early layers, train later layers
for layer in base_model.layers[:100]:
    layer.trainable = False

# Recompile with lower learning rate
model.compile(
    optimizer=keras.optimizers.Adam(1e-5),
    loss='categorical_crossentropy',
    metrics=['accuracy']
)

# Continue training
model.fit(X_train, y_train, epochs=10)

CNN Applications

Image Classification: Categorize images into classes (cats, dogs, vehicles)
Object Detection: Locate and classify objects (YOLO, Faster R-CNN, SSD)
Semantic Segmentation: Classify each pixel (U-Net, DeepLab)
Face Recognition: Identify individuals from facial features
Medical Imaging: Detect diseases from X-rays, MRIs, CT scans
Style Transfer: Apply artistic styles to images

5. Recurrent Neural Networks (RNNs)

What are RNNs?

RNNs are neural networks designed for sequential data (time series, text, audio). They have loops that allow information to persist, maintaining a "memory" of previous inputs.

RNN Characteristics:

Sequential Processing: Process inputs one at a time, maintaining hidden state
Variable Length: Handle sequences of different lengths
Parameter Sharing: Same weights applied at each time step
Memory: Hidden state captures information from previous inputs

Simple RNN Architecture

# Simple RNN for sequence classification
model = keras.Sequential([
    layers.SimpleRNN(64,
                     return_sequences=True,
                     input_shape=(timesteps, features)),
    layers.SimpleRNN(32),
    layers.Dense(10, activation='softmax')
])

# Example: Text classification
# Input shape: (batch_size, sequence_length, embedding_dim)
model = keras.Sequential([
    layers.Embedding(vocab_size, 128, input_length=max_length),
    layers.SimpleRNN(64, return_sequences=False),
    layers.Dense(1, activation='sigmoid')  # Binary classification
])

model.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['accuracy']
)

model.fit(X_train, y_train, epochs=10, validation_split=0.2)

Long Short-Term Memory (LSTM)

LSTMs solve the vanishing gradient problem in RNNs using gates to control information flow.

LSTM Gates:

Forget Gate: Decides what information to discard from cell state
Input Gate: Decides what new information to store in cell state
Output Gate: Decides what to output based on cell state

# LSTM for sentiment analysis
from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing import sequence

# Load IMDB dataset
max_features = 10000
maxlen = 200

(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=max_features)

# Pad sequences to same length
X_train = sequence.pad_sequences(X_train, maxlen=maxlen)
X_test = sequence.pad_sequences(X_test, maxlen=maxlen)

# Build LSTM model
model = keras.Sequential([
    layers.Embedding(max_features, 128, input_length=maxlen),
    layers.LSTM(64, return_sequences=True, dropout=0.2, recurrent_dropout=0.2),
    layers.LSTM(32, dropout=0.2, recurrent_dropout=0.2),
    layers.Dense(1, activation='sigmoid')
])

model.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['accuracy']
)

# Train
history = model.fit(
    X_train, y_train,
    batch_size=32,
    epochs=10,
    validation_data=(X_test, y_test)
)

# Evaluate
score = model.evaluate(X_test, y_test)
print(f"Test accuracy: {score[1]:.3f}")

Gated Recurrent Unit (GRU)

Simplified version of LSTM with fewer parameters. Often performs similarly to LSTM.

# GRU for time series prediction
model = keras.Sequential([
    layers.GRU(64,
               return_sequences=True,
               input_shape=(timesteps, features)),
    layers.Dropout(0.2),
    layers.GRU(32),
    layers.Dropout(0.2),
    layers.Dense(1)  # Regression output
])

model.compile(optimizer='adam', loss='mse')

# Example: Stock price prediction
model.fit(X_train, y_train,
          epochs=50,
          batch_size=32,
          validation_split=0.2)

Bidirectional RNNs

Process sequences in both forward and backward directions for better context understanding.

# Bidirectional LSTM
model = keras.Sequential([
    layers.Embedding(vocab_size, 128),
    layers.Bidirectional(layers.LSTM(64, return_sequences=True)),
    layers.Bidirectional(layers.LSTM(32)),
    layers.Dense(num_classes, activation='softmax')
])

# Useful for:
# - Named Entity Recognition
# - Part-of-speech tagging
# - Any task where future context helps understand current position

Sequence-to-Sequence Models

Encoder-Decoder architecture for tasks like machine translation.

# Seq2Seq model for machine translation
# Encoder
encoder_inputs = layers.Input(shape=(None, num_encoder_tokens))
encoder_lstm = layers.LSTM(256, return_state=True)
encoder_outputs, state_h, state_c = encoder_lstm(encoder_inputs)
encoder_states = [state_h, state_c]

# Decoder
decoder_inputs = layers.Input(shape=(None, num_decoder_tokens))
decoder_lstm = layers.LSTM(256, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_inputs,
                                      initial_state=encoder_states)
decoder_dense = layers.Dense(num_decoder_tokens, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)

# Full model
model = keras.Model([encoder_inputs, decoder_inputs], decoder_outputs)

model.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

# Train
model.fit([encoder_input_data, decoder_input_data],
          decoder_target_data,
          batch_size=64,
          epochs=100,
          validation_split=0.2)

RNN Applications

Natural Language Processing: Text classification, sentiment analysis, named entity recognition
Machine Translation: Translate text from one language to another
Speech Recognition: Convert audio to text
Time Series Forecasting: Stock prices, weather, energy consumption
Video Analysis: Action recognition, video captioning
Music Generation: Compose melodies and harmonies

6. Transformers and Attention Mechanisms

What are Transformers?

Transformers are the architecture behind modern LLMs (GPT, BERT, T5). They use self-attention mechanisms to process sequences in parallel, unlike RNNs that process sequentially.

Self-Attention Mechanism

Attention allows the model to focus on relevant parts of the input when processing each element.

Attention Formula:
Attention(Q, K, V) = softmax(QK^T / √d_k) V
Q = Query, K = Key, V = Value

# Simple self-attention implementation
import numpy as np

def self_attention(X):
    # X shape: (sequence_length, d_model)
    d_k = X.shape[1]

    # In practice, these are learned linear projections
    Q = X  # Query
    K = X  # Key
    V = X  # Value

    # Compute attention scores
    scores = np.matmul(Q, K.T) / np.sqrt(d_k)

    # Apply softmax
    attention_weights = np.exp(scores) / np.sum(np.exp(scores), axis=-1, keepdims=True)

    # Compute weighted sum of values
    output = np.matmul(attention_weights, V)

    return output, attention_weights

# Example
sequence_length = 5
d_model = 8
X = np.random.randn(sequence_length, d_model)

output, weights = self_attention(X)
print("Attention weights shape:", weights.shape)
print("Output shape:", output.shape)

Transformer Architecture

# Multi-Head Self-Attention layer
class MultiHeadAttention(layers.Layer):
    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        self.num_heads = num_heads
        self.d_model = d_model

        assert d_model % num_heads == 0

        self.depth = d_model // num_heads

        self.wq = layers.Dense(d_model)
        self.wk = layers.Dense(d_model)
        self.wv = layers.Dense(d_model)

        self.dense = layers.Dense(d_model)

    def split_heads(self, x, batch_size):
        x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth))
        return tf.transpose(x, perm=[0, 2, 1, 3])

    def call(self, v, k, q, mask=None):
        batch_size = tf.shape(q)[0]

        # Linear projections
        q = self.wq(q)
        k = self.wk(k)
        v = self.wv(v)

        # Split into multiple heads
        q = self.split_heads(q, batch_size)
        k = self.split_heads(k, batch_size)
        v = self.split_heads(v, batch_size)

        # Scaled dot-product attention
        matmul_qk = tf.matmul(q, k, transpose_b=True)
        dk = tf.cast(tf.shape(k)[-1], tf.float32)
        scaled_attention_logits = matmul_qk / tf.math.sqrt(dk)

        if mask is not None:
            scaled_attention_logits += (mask * -1e9)

        attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1)

        output = tf.matmul(attention_weights, v)
        output = tf.transpose(output, perm=[0, 2, 1, 3])
        output = tf.reshape(output, (batch_size, -1, self.d_model))

        output = self.dense(output)

        return output, attention_weights

# Transformer Encoder Block
class TransformerBlock(layers.Layer):
    def __init__(self, d_model, num_heads, dff, dropout_rate=0.1):
        super(TransformerBlock, self).__init__()

        self.mha = MultiHeadAttention(d_model, num_heads)
        self.ffn = keras.Sequential([
            layers.Dense(dff, activation='relu'),
            layers.Dense(d_model)
        ])

        self.layernorm1 = layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = layers.LayerNormalization(epsilon=1e-6)

        self.dropout1 = layers.Dropout(dropout_rate)
        self.dropout2 = layers.Dropout(dropout_rate)

    def call(self, x, training, mask=None):
        # Multi-head attention
        attn_output, _ = self.mha(x, x, x, mask)
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.layernorm1(x + attn_output)

        # Feed-forward network
        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output, training=training)
        out2 = self.layernorm2(out1 + ffn_output)

        return out2

# Build Transformer model
def create_transformer_model(vocab_size, d_model=128, num_heads=8,
                             dff=512, num_layers=4, max_length=100):
    inputs = layers.Input(shape=(max_length,))

    # Embedding + Positional Encoding
    x = layers.Embedding(vocab_size, d_model)(inputs)
    x *= tf.math.sqrt(tf.cast(d_model, tf.float32))

    # Add positional encoding
    pos_encoding = positional_encoding(max_length, d_model)
    x += pos_encoding[:, :max_length, :]

    # Transformer blocks
    for _ in range(num_layers):
        x = TransformerBlock(d_model, num_heads, dff)(x)

    # Global average pooling
    x = layers.GlobalAveragePooling1D()(x)

    # Classification head
    x = layers.Dropout(0.1)(x)
    outputs = layers.Dense(num_classes, activation='softmax')(x)

    return keras.Model(inputs=inputs, outputs=outputs)

Popular Transformer Models

BERT

Type: Bidirectional Encoder

Training: Masked Language Modeling

Use: Text classification, QA, NER

Variants: RoBERTa, ALBERT, DistilBERT

GPT

Type: Autoregressive Decoder

Training: Next token prediction

Use: Text generation, completion

Versions: GPT-2, GPT-3, GPT-4

T5

Type: Encoder-Decoder

Approach: Text-to-Text framework

Use: Translation, summarization, QA

Sizes: Small, Base, Large, XL, XXL

Using Pre-trained Transformers

# Using Hugging Face Transformers
from transformers import BertTokenizer, TFBertForSequenceClassification
import tensorflow as tf

# Load pre-trained BERT model and tokenizer
model_name = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(model_name)
model = TFBertForSequenceClassification.from_pretrained(
    model_name,
    num_labels=2
)

# Prepare data
texts = ["This movie is great!", "This movie is terrible."]
labels = [1, 0]  # positive, negative

# Tokenize
encodings = tokenizer(
    texts,
    padding=True,
    truncation=True,
    max_length=128,
    return_tensors='tf'
)

# Train
optimizer = tf.keras.optimizers.Adam(learning_rate=2e-5)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])

model.fit(
    encodings['input_ids'],
    labels,
    epochs=3,
    batch_size=8
)

# Inference
new_texts = ["I loved this movie!"]
new_encodings = tokenizer(
    new_texts,
    padding=True,
    truncation=True,
    return_tensors='tf'
)

predictions = model(new_encodings['input_ids'])
predicted_class = tf.argmax(predictions.logits, axis=1)
print(f"Predicted sentiment: {predicted_class.numpy()}")

7. Advanced Architectures

Autoencoders

Unsupervised learning models that learn compressed representations of data.

# Autoencoder for dimensionality reduction
# Encoder
encoder_input = layers.Input(shape=(784,))
encoded = layers.Dense(128, activation='relu')(encoder_input)
encoded = layers.Dense(64, activation='relu')(encoded)
encoded = layers.Dense(32, activation='relu')(encoded)

# Decoder
decoded = layers.Dense(64, activation='relu')(encoded)
decoded = layers.Dense(128, activation='relu')(decoded)
decoded = layers.Dense(784, activation='sigmoid')(decoded)

# Full autoencoder
autoencoder = keras.Model(encoder_input, decoded)

# Encoder model (for extracting features)
encoder = keras.Model(encoder_input, encoded)

# Train
autoencoder.compile(optimizer='adam', loss='binary_crossentropy')
autoencoder.fit(X_train, X_train,  # Input = Output
                epochs=50,
                batch_size=256,
                validation_data=(X_test, X_test))

# Use encoder for dimensionality reduction
encoded_imgs = encoder.predict(X_test)

Variational Autoencoders (VAE)

Generative models that learn probability distributions in latent space.

Generative Adversarial Networks (GANs)

Two networks compete: Generator creates fake data, Discriminator tries to detect fakes.

# Simple GAN for generating images
# Generator
def build_generator(latent_dim):
    model = keras.Sequential([
        layers.Dense(256, input_dim=latent_dim),
        layers.LeakyReLU(alpha=0.2),
        layers.BatchNormalization(momentum=0.8),

        layers.Dense(512),
        layers.LeakyReLU(alpha=0.2),
        layers.BatchNormalization(momentum=0.8),

        layers.Dense(1024),
        layers.LeakyReLU(alpha=0.2),
        layers.BatchNormalization(momentum=0.8),

        layers.Dense(784, activation='tanh'),
        layers.Reshape((28, 28, 1))
    ])
    return model

# Discriminator
def build_discriminator():
    model = keras.Sequential([
        layers.Flatten(input_shape=(28, 28, 1)),

        layers.Dense(512),
        layers.LeakyReLU(alpha=0.2),
        layers.Dropout(0.3),

        layers.Dense(256),
        layers.LeakyReLU(alpha=0.2),
        layers.Dropout(0.3),

        layers.Dense(1, activation='sigmoid')
    ])
    return model

# Build and compile
latent_dim = 100
generator = build_generator(latent_dim)
discriminator = build_discriminator()

discriminator.compile(
    loss='binary_crossentropy',
    optimizer=keras.optimizers.Adam(0.0002, 0.5),
    metrics=['accuracy']
)

# Combined model (Generator + Discriminator)
discriminator.trainable = False
gan_input = layers.Input(shape=(latent_dim,))
generated_img = generator(gan_input)
gan_output = discriminator(generated_img)
gan = keras.Model(gan_input, gan_output)

gan.compile(
    loss='binary_crossentropy',
    optimizer=keras.optimizers.Adam(0.0002, 0.5)
)

# Training loop
def train_gan(epochs, batch_size=128):
    for epoch in range(epochs):
        # Train Discriminator
        idx = np.random.randint(0, X_train.shape[0], batch_size)
        real_imgs = X_train[idx]

        noise = np.random.normal(0, 1, (batch_size, latent_dim))
        fake_imgs = generator.predict(noise)

        d_loss_real = discriminator.train_on_batch(real_imgs, np.ones((batch_size, 1)))
        d_loss_fake = discriminator.train_on_batch(fake_imgs, np.zeros((batch_size, 1)))

        # Train Generator
        noise = np.random.normal(0, 1, (batch_size, latent_dim))
        g_loss = gan.train_on_batch(noise, np.ones((batch_size, 1)))

        if epoch % 100 == 0:
            print(f"Epoch {epoch}, D Loss: {d_loss_real[0]}, G Loss: {g_loss}")

Graph Neural Networks (GNNs)

Neural networks for graph-structured data (social networks, molecules, knowledge graphs).

Neural Architecture Search (NAS)

Automatically discover optimal neural network architectures.

8. Optimization and Regularization

Common Training Challenges

Vanishing Gradient Problem:

Gradients become extremely small in deep networks, preventing early layers from learning.

Solutions:

Use ReLU instead of sigmoid/tanh
Batch Normalization
Residual connections (ResNet)
LSTM/GRU for RNNs
Gradient clipping

Exploding Gradient Problem:

Gradients become extremely large, causing unstable training.

Solutions:

Gradient clipping
Lower learning rate
Batch Normalization
Weight regularization

Practical Training Tips

Training Best Practices:

Data Preprocessing: Normalize inputs (mean=0, std=1)
Weight Initialization: Use He initialization for ReLU, Xavier for tanh
Batch Size: Start with 32-128, larger for more stable gradients
Learning Rate: Start with 1e-3, use learning rate finder
Monitor Training: Plot loss curves, watch for overfitting
Validation: Always use validation set for hyperparameter tuning

# Advanced training setup
from tensorflow.keras.callbacks import ModelCheckpoint, TensorBoard, ReduceLROnPlateau

# Callbacks
callbacks = [
    # Save best model
    ModelCheckpoint(
        'best_model.h5',
        monitor='val_loss',
        save_best_only=True
    ),

    # Reduce learning rate on plateau
    ReduceLROnPlateau(
        monitor='val_loss',
        factor=0.5,
        patience=5,
        min_lr=1e-7
    ),

    # Early stopping
    keras.callbacks.EarlyStopping(
        monitor='val_loss',
        patience=10,
        restore_best_weights=True
    ),

    # TensorBoard logging
    TensorBoard(
        log_dir='./logs',
        histogram_freq=1
    )
]

# Data augmentation
datagen = keras.preprocessing.image.ImageDataGenerator(
    rotation_range=20,
    width_shift_range=0.2,
    height_shift_range=0.2,
    horizontal_flip=True,
    zoom_range=0.2,
    fill_mode='nearest'
)

# Train with all optimizations
history = model.fit(
    datagen.flow(X_train, y_train, batch_size=32),
    steps_per_epoch=len(X_train) // 32,
    epochs=100,
    validation_data=(X_val, y_val),
    callbacks=callbacks,
    class_weight=class_weights  # Handle imbalanced data
)

9. Transfer Learning

What is Transfer Learning?

Leverage knowledge from pre-trained models to solve new tasks with less data and training time.

Transfer Learning Strategies

Feature Extraction

Freeze pre-trained model, add new classifier on top.

When: Small dataset, similar task

Fast: Only train new layers

Fine-Tuning

Unfreeze some layers and train with low learning rate.

When: Medium dataset, related task

Better performance: Adapt to new domain

# Transfer Learning workflow

# 1. Load pre-trained model
base_model = keras.applications.MobileNetV2(
    input_shape=(224, 224, 3),
    include_top=False,
    weights='imagenet'
)

# 2. Freeze base model
base_model.trainable = False

# 3. Add custom head
model = keras.Sequential([
    base_model,
    layers.GlobalAveragePooling2D(),
    layers.Dense(128, activation='relu'),
    layers.Dropout(0.5),
    layers.Dense(num_classes, activation='softmax')
])

# 4. Train top layers
model.compile(
    optimizer=keras.optimizers.Adam(1e-3),
    loss='categorical_crossentropy',
    metrics=['accuracy']
)

model.fit(train_dataset, epochs=5, validation_data=val_dataset)

# 5. Fine-tune: Unfreeze and train with low LR
base_model.trainable = True

# Freeze early layers
for layer in base_model.layers[:100]:
    layer.trainable = False

model.compile(
    optimizer=keras.optimizers.Adam(1e-5),  # Lower LR
    loss='categorical_crossentropy',
    metrics=['accuracy']
)

model.fit(train_dataset, epochs=10, validation_data=val_dataset)

Domain Adaptation

Adapt models trained on one domain (e.g., synthetic images) to another (e.g., real images).

10. Model Deployment

Model Optimization for Production

Quantization

Reduce precision (float32 → int8) for faster inference.

Benefit: 4x smaller model, 2-4x faster

Pruning

Remove unnecessary weights/neurons.

Benefit: Smaller model, faster inference

Knowledge Distillation

Train small model to mimic large model.

Benefit: Maintain performance, reduce size

Model Compression

Combine multiple techniques for maximum efficiency.

Benefit: Deploy on edge devices

# TensorFlow Lite conversion (mobile deployment)
import tensorflow as tf

# Convert to TFLite
converter = tf.lite.TFLiteConverter.from_keras_model(model)

# Apply optimizations
converter.optimizations = [tf.lite.Optimize.DEFAULT]

# Quantization
converter.representative_dataset = representative_dataset
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.uint8
converter.inference_output_type = tf.uint8

# Convert
tflite_model = converter.convert()

# Save
with open('model.tflite', 'wb') as f:
    f.write(tflite_model)

# Deploy on mobile/edge devices

Deployment Platforms

TensorFlow Serving: Production-ready serving system
TorchServe: PyTorch model serving
ONNX Runtime: Cross-platform, high-performance inference
TensorFlow Lite: Mobile and embedded devices
CoreML: iOS deployment
TensorRT: NVIDIA GPU optimization

11. Real-World Applications

Computer Vision

Self-driving cars
Medical image diagnosis
Facial recognition
Object detection
Image generation (DALL-E)

Natural Language Processing

Chatbots (ChatGPT)
Machine translation
Sentiment analysis
Text summarization
Question answering

Speech and Audio

Speech recognition (Siri, Alexa)
Text-to-speech
Music generation
Audio classification
Voice cloning

Healthcare

Disease diagnosis
Drug discovery
Protein folding (AlphaFold)
Patient monitoring
Personalized treatment

Finance

Fraud detection
Algorithmic trading
Credit scoring
Risk assessment
Market prediction

Recommendation Systems

Netflix, YouTube recommendations
E-commerce product suggestions
Music recommendations (Spotify)
News feed personalization
Ad targeting

12. Best Practices and Tips

Getting Started:

Start with simple architectures, increase complexity as needed
Use transfer learning when possible
Always split data: train/validation/test
Visualize training curves to diagnose issues
Use pre-trained models and established architectures
Experiment with Colab/Kaggle for free GPU access

Common Mistakes to Avoid:

Not shuffling training data
Data leakage between train/test sets
Training on unnormalized data
Using too high learning rate
Not using validation set
Overfitting to training data

Learning Resources

Courses: Fast.ai, Stanford CS230, Coursera Deep Learning Specialization
Books: Deep Learning (Goodfellow), Deep Learning with Python (Chollet)
Frameworks: TensorFlow, PyTorch, Keras
Practice: Kaggle competitions, personal projects
Papers: arXiv.org for latest research

Deep Learning - Complete Guide

What You'll Learn