Deep Learning - Complete Guide

Master neural networks, CNNs, RNNs, transformers, and advanced deep learning architectures

What You'll Learn

1. Introduction to Deep Learning

What is Deep Learning?

Deep Learning is a subset of machine learning that uses neural networks with multiple layers (deep neural networks) to learn hierarchical representations of data. Unlike traditional machine learning algorithms that require manual feature engineering, deep learning can automatically discover the representations needed for feature detection or classification from raw data.

Key Characteristics of Deep Learning:

Deep Learning vs Traditional ML

Aspect Traditional ML Deep Learning
Feature Engineering Manual feature extraction required Automatic feature learning
Data Requirements Works well with small datasets Requires large amounts of data
Computational Resources Runs on CPUs Benefits from GPUs/TPUs
Training Time Minutes to hours Hours to days/weeks
Interpretability Often interpretable Black box (less interpretable)
Performance Plateaus with more data Improves with more data

Why Deep Learning Now?

Big Data

Explosion of data from internet, IoT sensors, social media provides fuel for deep learning models.

Example: ImageNet dataset with 14M images

Computational Power

Modern GPUs and specialized hardware (TPUs) enable training large neural networks efficiently.

Example: NVIDIA A100 GPU: 624 TFLOPS

Better Algorithms

Improvements in activation functions, optimization, regularization, and architectures.

Example: ReLU, Adam optimizer, Batch Normalization

2. Artificial Neural Networks

The Perceptron: Building Block

A perceptron is the simplest neural network unit, inspired by biological neurons.

Perceptron Formula:
output = activation(Σ(weights × inputs) + bias)
y = f(w₁x₁ + w₂x₂ + ... + wₙxₙ + b)
# Simple Perceptron in NumPy import numpy as np class Perceptron: def __init__(self, input_size): self.weights = np.random.randn(input_size) self.bias = np.random.randn() def forward(self, x): # Linear combination z = np.dot(self.weights, x) + self.bias # Step function activation return 1 if z > 0 else 0 def train(self, X, y, learning_rate=0.1, epochs=100): for _ in range(epochs): for xi, yi in zip(X, y): prediction = self.forward(xi) error = yi - prediction # Update weights self.weights += learning_rate * error * xi self.bias += learning_rate * error # Example: Learning AND gate X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]]) y = np.array([0, 0, 0, 1]) perceptron = Perceptron(input_size=2) perceptron.train(X, y) # Test for xi in X: print(f"Input: {xi}, Output: {perceptron.forward(xi)}")

Multi-Layer Perceptron (MLP)

Multiple layers of perceptrons can learn complex non-linear functions.

# MLP with TensorFlow/Keras import tensorflow as tf from tensorflow import keras from tensorflow.keras import layers # Build MLP model model = keras.Sequential([ # Input layer (implicit) layers.Dense(128, activation='relu', input_shape=(784,)), layers.Dropout(0.2), # Hidden layers layers.Dense(64, activation='relu'), layers.Dropout(0.2), layers.Dense(32, activation='relu'), # Output layer layers.Dense(10, activation='softmax') # 10 classes ]) # Compile model model.compile( optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'] ) # Model summary model.summary() # Train model history = model.fit( X_train, y_train, epochs=10, batch_size=32, validation_split=0.2, verbose=1 ) # Evaluate test_loss, test_acc = model.evaluate(X_test, y_test) print(f"Test accuracy: {test_acc:.3f}")

Activation Functions

Activation functions introduce non-linearity, enabling networks to learn complex patterns.

Function Formula Range Use Case Pros/Cons
ReLU f(x) = max(0, x) [0, ∞) Hidden layers (most common) ✓ Fast, simple | ✗ Dying ReLU problem
Leaky ReLU f(x) = max(0.01x, x) (-∞, ∞) Hidden layers (fixes dying ReLU) ✓ No dead neurons | ✗ Inconsistent predictions
Sigmoid f(x) = 1/(1+e⁻ˣ) (0, 1) Binary classification output ✓ Smooth gradient | ✗ Vanishing gradient
Tanh f(x) = (eˣ-e⁻ˣ)/(eˣ+e⁻ˣ) (-1, 1) Hidden layers (RNNs) ✓ Zero-centered | ✗ Vanishing gradient
Softmax f(xᵢ) = eˣⁱ/Σeˣʲ (0, 1), Σ=1 Multi-class classification output ✓ Probability distribution | ✗ Only for output
GELU f(x) = x·Φ(x) (-∞, ∞) Transformers, modern architectures ✓ Smooth, non-monotonic | ✗ Slower computation
# Implementing activation functions import numpy as np # ReLU def relu(x): return np.maximum(0, x) # Leaky ReLU def leaky_relu(x, alpha=0.01): return np.where(x > 0, x, alpha * x) # Sigmoid def sigmoid(x): return 1 / (1 + np.exp(-x)) # Tanh def tanh(x): return np.tanh(x) # Softmax def softmax(x): exp_x = np.exp(x - np.max(x)) # Numerical stability return exp_x / exp_x.sum(axis=0) # Example usage x = np.array([-2, -1, 0, 1, 2]) print(f"Input: {x}") print(f"ReLU: {relu(x)}") print(f"Sigmoid: {sigmoid(x)}") print(f"Tanh: {tanh(x)}") # Softmax example (class probabilities) logits = np.array([2.0, 1.0, 0.1]) probs = softmax(logits) print(f"Logits: {logits}") print(f"Softmax probabilities: {probs}") print(f"Sum: {probs.sum()}") # Should be 1.0

Backpropagation

The algorithm for training neural networks by computing gradients efficiently using the chain rule.

Backpropagation Steps:
  1. Forward Pass: Compute output predictions
  2. Compute Loss: Calculate error between prediction and actual
  3. Backward Pass: Compute gradients using chain rule (∂Loss/∂weights)
  4. Update Weights: Adjust weights in opposite direction of gradient
# Simple backpropagation implementation import numpy as np class NeuralNetwork: def __init__(self, input_size, hidden_size, output_size): # Initialize weights self.W1 = np.random.randn(input_size, hidden_size) * 0.01 self.b1 = np.zeros((1, hidden_size)) self.W2 = np.random.randn(hidden_size, output_size) * 0.01 self.b2 = np.zeros((1, output_size)) def forward(self, X): # Layer 1 self.z1 = np.dot(X, self.W1) + self.b1 self.a1 = np.maximum(0, self.z1) # ReLU # Layer 2 self.z2 = np.dot(self.a1, self.W2) + self.b2 # Sigmoid for binary classification self.a2 = 1 / (1 + np.exp(-self.z2)) return self.a2 def backward(self, X, y, output, learning_rate=0.01): m = X.shape[0] # Number of samples # Output layer gradients dz2 = output - y dW2 = np.dot(self.a1.T, dz2) / m db2 = np.sum(dz2, axis=0, keepdims=True) / m # Hidden layer gradients da1 = np.dot(dz2, self.W2.T) dz1 = da1 * (self.z1 > 0) # ReLU derivative dW1 = np.dot(X.T, dz1) / m db1 = np.sum(dz1, axis=0, keepdims=True) / m # Update weights self.W2 -= learning_rate * dW2 self.b2 -= learning_rate * db2 self.W1 -= learning_rate * dW1 self.b1 -= learning_rate * db1 def train(self, X, y, epochs=1000, learning_rate=0.01): for epoch in range(epochs): # Forward pass output = self.forward(X) # Backward pass self.backward(X, y, output, learning_rate) # Print loss every 100 epochs if epoch % 100 == 0: loss = -np.mean(y * np.log(output) + (1 - y) * np.log(1 - output)) print(f"Epoch {epoch}, Loss: {loss:.4f}") # Example: XOR problem X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]]) y = np.array([[0], [1], [1], [0]]) nn = NeuralNetwork(input_size=2, hidden_size=4, output_size=1) nn.train(X, y, epochs=5000, learning_rate=0.1) # Test predictions predictions = nn.forward(X) print("\nPredictions:") for i, (xi, pred) in enumerate(zip(X, predictions)): print(f"Input: {xi}, Predicted: {pred[0]:.4f}, Actual: {y[i][0]}")

3. Training Deep Networks

Loss Functions

Loss Function Use Case Formula
Binary Cross-Entropy Binary classification -[y·log(ŷ) + (1-y)·log(1-ŷ)]
Categorical Cross-Entropy Multi-class classification -Σ yᵢ·log(ŷᵢ)
Mean Squared Error (MSE) Regression Σ(y - ŷ)² / n
Mean Absolute Error (MAE) Regression (robust to outliers) Σ|y - ŷ| / n
Huber Loss Regression (robust) Smooth combination of MSE and MAE

Optimization Algorithms

Stochastic Gradient Descent (SGD)

Updates weights using gradient of single sample or mini-batch.

# SGD implementation optimizer = keras.optimizers.SGD( learning_rate=0.01, momentum=0.9, nesterov=True )

Adam (Adaptive Moment Estimation)

Combines momentum and adaptive learning rates. Most popular optimizer.

# Adam optimizer optimizer = keras.optimizers.Adam( learning_rate=0.001, beta_1=0.9, beta_2=0.999 )

RMSprop

Adapts learning rate for each parameter based on recent gradient magnitudes.

# RMSprop optimizer optimizer = keras.optimizers.RMSprop( learning_rate=0.001, rho=0.9 )

AdaGrad

Adapts learning rate based on historical gradients. Good for sparse data.

# AdaGrad optimizer optimizer = keras.optimizers.Adagrad( learning_rate=0.01 )

Learning Rate Scheduling

# Learning rate schedules in Keras # 1. Step Decay def step_decay(epoch): initial_lr = 0.1 drop = 0.5 epochs_drop = 10.0 lr = initial_lr * (drop ** np.floor((1 + epoch) / epochs_drop)) return lr lr_scheduler = keras.callbacks.LearningRateScheduler(step_decay) # 2. Exponential Decay lr_schedule = keras.optimizers.schedules.ExponentialDecay( initial_learning_rate=0.1, decay_steps=1000, decay_rate=0.96 ) # 3. Cosine Decay lr_schedule = keras.optimizers.schedules.CosineDecay( initial_learning_rate=0.1, decay_steps=1000 ) # 4. Reduce on Plateau reduce_lr = keras.callbacks.ReduceLROnPlateau( monitor='val_loss', factor=0.2, patience=5, min_lr=0.00001 ) # Use in training model.fit( X_train, y_train, epochs=50, callbacks=[reduce_lr] )

Regularization Techniques

L1/L2 Regularization

Add penalty term to loss function to prevent large weights.

from tensorflow.keras import regularizers model = keras.Sequential([ layers.Dense(64, activation='relu', kernel_regularizer=regularizers.l2(0.01), bias_regularizer=regularizers.l1(0.01) ) ])

Dropout

Randomly drop neurons during training to prevent co-adaptation.

model = keras.Sequential([ layers.Dense(128, activation='relu'), layers.Dropout(0.5), # Drop 50% of neurons layers.Dense(64, activation='relu'), layers.Dropout(0.3), layers.Dense(10, activation='softmax') ])

Batch Normalization

Normalize activations to reduce internal covariate shift.

model = keras.Sequential([ layers.Dense(128), layers.BatchNormalization(), layers.Activation('relu'), layers.Dense(64), layers.BatchNormalization(), layers.Activation('relu') ])

Early Stopping

Stop training when validation performance stops improving.

early_stop = keras.callbacks.EarlyStopping( monitor='val_loss', patience=10, restore_best_weights=True ) model.fit(X_train, y_train, validation_data=(X_val, y_val), callbacks=[early_stop] )

4. Convolutional Neural Networks (CNNs)

What are CNNs?

CNNs are specialized neural networks for processing grid-like data (images, video, time series). They use convolution operations to automatically learn spatial hierarchies of features.

Key CNN Layers

Convolutional Layer

Applies filters/kernels to extract features like edges, textures, patterns.

Parameters: filters, kernel_size, stride, padding

Pooling Layer

Downsamples feature maps to reduce dimensionality and computation.

Types: MaxPooling (most common), AveragePooling

CNN Architecture Example

# CNN for Image Classification (CIFAR-10) import tensorflow as tf from tensorflow.keras import layers, models model = models.Sequential([ # First Convolutional Block layers.Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3), padding='same'), layers.BatchNormalization(), layers.Conv2D(32, (3, 3), activation='relu', padding='same'), layers.BatchNormalization(), layers.MaxPooling2D((2, 2)), layers.Dropout(0.2), # Second Convolutional Block layers.Conv2D(64, (3, 3), activation='relu', padding='same'), layers.BatchNormalization(), layers.Conv2D(64, (3, 3), activation='relu', padding='same'), layers.BatchNormalization(), layers.MaxPooling2D((2, 2)), layers.Dropout(0.3), # Third Convolutional Block layers.Conv2D(128, (3, 3), activation='relu', padding='same'), layers.BatchNormalization(), layers.Conv2D(128, (3, 3), activation='relu', padding='same'), layers.BatchNormalization(), layers.MaxPooling2D((2, 2)), layers.Dropout(0.4), # Fully Connected Layers layers.Flatten(), layers.Dense(512, activation='relu'), layers.BatchNormalization(), layers.Dropout(0.5), layers.Dense(10, activation='softmax') # 10 classes ]) model.compile( optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'] ) model.summary() # Data augmentation datagen = tf.keras.preprocessing.image.ImageDataGenerator( rotation_range=15, width_shift_range=0.1, height_shift_range=0.1, horizontal_flip=True, zoom_range=0.1 ) # Train history = model.fit( datagen.flow(X_train, y_train, batch_size=32), epochs=50, validation_data=(X_test, y_test), callbacks=[early_stop, reduce_lr] )

Popular CNN Architectures

LeNet-5 (1998)

Pioneer: First successful CNN for digit recognition

Layers: 7 layers (Conv-Pool-Conv-Pool-FC-FC-Output)

Use: Digit recognition (MNIST)

AlexNet (2012)

Breakthrough: Won ImageNet 2012, popularized deep learning

Innovation: ReLU, Dropout, GPU training

Params: 60M parameters, 8 layers

VGG-16/19 (2014)

Key Idea: Deeper is better (16-19 layers)

Architecture: Small 3x3 filters throughout

Params: 138M (VGG-16)

ResNet (2015)

Innovation: Skip connections solve vanishing gradient

Depth: 50, 101, 152 layers possible

Impact: Enabled very deep networks

Inception (GoogLeNet)

Key Idea: Multiple filter sizes in parallel

Efficiency: Fewer params than VGG

Versions: v1, v2, v3, v4, Xception

EfficientNet (2019)

Innovation: Compound scaling (depth, width, resolution)

Efficiency: Best accuracy/parameters trade-off

Variants: B0-B7

Using Pre-trained CNNs

# Transfer Learning with ResNet50 from tensorflow.keras.applications import ResNet50 from tensorflow.keras.applications.resnet50 import preprocess_input # Load pre-trained model (without top classification layer) base_model = ResNet50( weights='imagenet', include_top=False, input_shape=(224, 224, 3) ) # Freeze base model layers base_model.trainable = False # Add custom classification head model = models.Sequential([ base_model, layers.GlobalAveragePooling2D(), layers.Dense(256, activation='relu'), layers.Dropout(0.5), layers.Dense(num_classes, activation='softmax') ]) model.compile( optimizer=keras.optimizers.Adam(1e-4), loss='categorical_crossentropy', metrics=['accuracy'] ) # Train only the new layers model.fit(X_train, y_train, epochs=10) # Fine-tuning: Unfreeze some base layers base_model.trainable = True # Freeze early layers, train later layers for layer in base_model.layers[:100]: layer.trainable = False # Recompile with lower learning rate model.compile( optimizer=keras.optimizers.Adam(1e-5), loss='categorical_crossentropy', metrics=['accuracy'] ) # Continue training model.fit(X_train, y_train, epochs=10)

CNN Applications

5. Recurrent Neural Networks (RNNs)

What are RNNs?

RNNs are neural networks designed for sequential data (time series, text, audio). They have loops that allow information to persist, maintaining a "memory" of previous inputs.

RNN Characteristics:

Simple RNN Architecture

# Simple RNN for sequence classification model = keras.Sequential([ layers.SimpleRNN(64, return_sequences=True, input_shape=(timesteps, features)), layers.SimpleRNN(32), layers.Dense(10, activation='softmax') ]) # Example: Text classification # Input shape: (batch_size, sequence_length, embedding_dim) model = keras.Sequential([ layers.Embedding(vocab_size, 128, input_length=max_length), layers.SimpleRNN(64, return_sequences=False), layers.Dense(1, activation='sigmoid') # Binary classification ]) model.compile( optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'] ) model.fit(X_train, y_train, epochs=10, validation_split=0.2)

Long Short-Term Memory (LSTM)

LSTMs solve the vanishing gradient problem in RNNs using gates to control information flow.

LSTM Gates:
# LSTM for sentiment analysis from tensorflow.keras.datasets import imdb from tensorflow.keras.preprocessing import sequence # Load IMDB dataset max_features = 10000 maxlen = 200 (X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=max_features) # Pad sequences to same length X_train = sequence.pad_sequences(X_train, maxlen=maxlen) X_test = sequence.pad_sequences(X_test, maxlen=maxlen) # Build LSTM model model = keras.Sequential([ layers.Embedding(max_features, 128, input_length=maxlen), layers.LSTM(64, return_sequences=True, dropout=0.2, recurrent_dropout=0.2), layers.LSTM(32, dropout=0.2, recurrent_dropout=0.2), layers.Dense(1, activation='sigmoid') ]) model.compile( optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'] ) # Train history = model.fit( X_train, y_train, batch_size=32, epochs=10, validation_data=(X_test, y_test) ) # Evaluate score = model.evaluate(X_test, y_test) print(f"Test accuracy: {score[1]:.3f}")

Gated Recurrent Unit (GRU)

Simplified version of LSTM with fewer parameters. Often performs similarly to LSTM.

# GRU for time series prediction model = keras.Sequential([ layers.GRU(64, return_sequences=True, input_shape=(timesteps, features)), layers.Dropout(0.2), layers.GRU(32), layers.Dropout(0.2), layers.Dense(1) # Regression output ]) model.compile(optimizer='adam', loss='mse') # Example: Stock price prediction model.fit(X_train, y_train, epochs=50, batch_size=32, validation_split=0.2)

Bidirectional RNNs

Process sequences in both forward and backward directions for better context understanding.

# Bidirectional LSTM model = keras.Sequential([ layers.Embedding(vocab_size, 128), layers.Bidirectional(layers.LSTM(64, return_sequences=True)), layers.Bidirectional(layers.LSTM(32)), layers.Dense(num_classes, activation='softmax') ]) # Useful for: # - Named Entity Recognition # - Part-of-speech tagging # - Any task where future context helps understand current position

Sequence-to-Sequence Models

Encoder-Decoder architecture for tasks like machine translation.

# Seq2Seq model for machine translation # Encoder encoder_inputs = layers.Input(shape=(None, num_encoder_tokens)) encoder_lstm = layers.LSTM(256, return_state=True) encoder_outputs, state_h, state_c = encoder_lstm(encoder_inputs) encoder_states = [state_h, state_c] # Decoder decoder_inputs = layers.Input(shape=(None, num_decoder_tokens)) decoder_lstm = layers.LSTM(256, return_sequences=True, return_state=True) decoder_outputs, _, _ = decoder_lstm(decoder_inputs, initial_state=encoder_states) decoder_dense = layers.Dense(num_decoder_tokens, activation='softmax') decoder_outputs = decoder_dense(decoder_outputs) # Full model model = keras.Model([encoder_inputs, decoder_inputs], decoder_outputs) model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy']) # Train model.fit([encoder_input_data, decoder_input_data], decoder_target_data, batch_size=64, epochs=100, validation_split=0.2)

RNN Applications

6. Transformers and Attention Mechanisms

What are Transformers?

Transformers are the architecture behind modern LLMs (GPT, BERT, T5). They use self-attention mechanisms to process sequences in parallel, unlike RNNs that process sequentially.

Self-Attention Mechanism

Attention allows the model to focus on relevant parts of the input when processing each element.

Attention Formula:
Attention(Q, K, V) = softmax(QK^T / √d_k) V
Q = Query, K = Key, V = Value
# Simple self-attention implementation import numpy as np def self_attention(X): # X shape: (sequence_length, d_model) d_k = X.shape[1] # In practice, these are learned linear projections Q = X # Query K = X # Key V = X # Value # Compute attention scores scores = np.matmul(Q, K.T) / np.sqrt(d_k) # Apply softmax attention_weights = np.exp(scores) / np.sum(np.exp(scores), axis=-1, keepdims=True) # Compute weighted sum of values output = np.matmul(attention_weights, V) return output, attention_weights # Example sequence_length = 5 d_model = 8 X = np.random.randn(sequence_length, d_model) output, weights = self_attention(X) print("Attention weights shape:", weights.shape) print("Output shape:", output.shape)

Transformer Architecture

# Multi-Head Self-Attention layer class MultiHeadAttention(layers.Layer): def __init__(self, d_model, num_heads): super(MultiHeadAttention, self).__init__() self.num_heads = num_heads self.d_model = d_model assert d_model % num_heads == 0 self.depth = d_model // num_heads self.wq = layers.Dense(d_model) self.wk = layers.Dense(d_model) self.wv = layers.Dense(d_model) self.dense = layers.Dense(d_model) def split_heads(self, x, batch_size): x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth)) return tf.transpose(x, perm=[0, 2, 1, 3]) def call(self, v, k, q, mask=None): batch_size = tf.shape(q)[0] # Linear projections q = self.wq(q) k = self.wk(k) v = self.wv(v) # Split into multiple heads q = self.split_heads(q, batch_size) k = self.split_heads(k, batch_size) v = self.split_heads(v, batch_size) # Scaled dot-product attention matmul_qk = tf.matmul(q, k, transpose_b=True) dk = tf.cast(tf.shape(k)[-1], tf.float32) scaled_attention_logits = matmul_qk / tf.math.sqrt(dk) if mask is not None: scaled_attention_logits += (mask * -1e9) attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1) output = tf.matmul(attention_weights, v) output = tf.transpose(output, perm=[0, 2, 1, 3]) output = tf.reshape(output, (batch_size, -1, self.d_model)) output = self.dense(output) return output, attention_weights # Transformer Encoder Block class TransformerBlock(layers.Layer): def __init__(self, d_model, num_heads, dff, dropout_rate=0.1): super(TransformerBlock, self).__init__() self.mha = MultiHeadAttention(d_model, num_heads) self.ffn = keras.Sequential([ layers.Dense(dff, activation='relu'), layers.Dense(d_model) ]) self.layernorm1 = layers.LayerNormalization(epsilon=1e-6) self.layernorm2 = layers.LayerNormalization(epsilon=1e-6) self.dropout1 = layers.Dropout(dropout_rate) self.dropout2 = layers.Dropout(dropout_rate) def call(self, x, training, mask=None): # Multi-head attention attn_output, _ = self.mha(x, x, x, mask) attn_output = self.dropout1(attn_output, training=training) out1 = self.layernorm1(x + attn_output) # Feed-forward network ffn_output = self.ffn(out1) ffn_output = self.dropout2(ffn_output, training=training) out2 = self.layernorm2(out1 + ffn_output) return out2 # Build Transformer model def create_transformer_model(vocab_size, d_model=128, num_heads=8, dff=512, num_layers=4, max_length=100): inputs = layers.Input(shape=(max_length,)) # Embedding + Positional Encoding x = layers.Embedding(vocab_size, d_model)(inputs) x *= tf.math.sqrt(tf.cast(d_model, tf.float32)) # Add positional encoding pos_encoding = positional_encoding(max_length, d_model) x += pos_encoding[:, :max_length, :] # Transformer blocks for _ in range(num_layers): x = TransformerBlock(d_model, num_heads, dff)(x) # Global average pooling x = layers.GlobalAveragePooling1D()(x) # Classification head x = layers.Dropout(0.1)(x) outputs = layers.Dense(num_classes, activation='softmax')(x) return keras.Model(inputs=inputs, outputs=outputs)

Popular Transformer Models

BERT

Type: Bidirectional Encoder

Training: Masked Language Modeling

Use: Text classification, QA, NER

Variants: RoBERTa, ALBERT, DistilBERT

GPT

Type: Autoregressive Decoder

Training: Next token prediction

Use: Text generation, completion

Versions: GPT-2, GPT-3, GPT-4

T5

Type: Encoder-Decoder

Approach: Text-to-Text framework

Use: Translation, summarization, QA

Sizes: Small, Base, Large, XL, XXL

Using Pre-trained Transformers

# Using Hugging Face Transformers from transformers import BertTokenizer, TFBertForSequenceClassification import tensorflow as tf # Load pre-trained BERT model and tokenizer model_name = 'bert-base-uncased' tokenizer = BertTokenizer.from_pretrained(model_name) model = TFBertForSequenceClassification.from_pretrained( model_name, num_labels=2 ) # Prepare data texts = ["This movie is great!", "This movie is terrible."] labels = [1, 0] # positive, negative # Tokenize encodings = tokenizer( texts, padding=True, truncation=True, max_length=128, return_tensors='tf' ) # Train optimizer = tf.keras.optimizers.Adam(learning_rate=2e-5) loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True) model.compile(optimizer=optimizer, loss=loss, metrics=['accuracy']) model.fit( encodings['input_ids'], labels, epochs=3, batch_size=8 ) # Inference new_texts = ["I loved this movie!"] new_encodings = tokenizer( new_texts, padding=True, truncation=True, return_tensors='tf' ) predictions = model(new_encodings['input_ids']) predicted_class = tf.argmax(predictions.logits, axis=1) print(f"Predicted sentiment: {predicted_class.numpy()}")

7. Advanced Architectures

Autoencoders

Unsupervised learning models that learn compressed representations of data.

# Autoencoder for dimensionality reduction # Encoder encoder_input = layers.Input(shape=(784,)) encoded = layers.Dense(128, activation='relu')(encoder_input) encoded = layers.Dense(64, activation='relu')(encoded) encoded = layers.Dense(32, activation='relu')(encoded) # Decoder decoded = layers.Dense(64, activation='relu')(encoded) decoded = layers.Dense(128, activation='relu')(decoded) decoded = layers.Dense(784, activation='sigmoid')(decoded) # Full autoencoder autoencoder = keras.Model(encoder_input, decoded) # Encoder model (for extracting features) encoder = keras.Model(encoder_input, encoded) # Train autoencoder.compile(optimizer='adam', loss='binary_crossentropy') autoencoder.fit(X_train, X_train, # Input = Output epochs=50, batch_size=256, validation_data=(X_test, X_test)) # Use encoder for dimensionality reduction encoded_imgs = encoder.predict(X_test)

Variational Autoencoders (VAE)

Generative models that learn probability distributions in latent space.

Generative Adversarial Networks (GANs)

Two networks compete: Generator creates fake data, Discriminator tries to detect fakes.

# Simple GAN for generating images # Generator def build_generator(latent_dim): model = keras.Sequential([ layers.Dense(256, input_dim=latent_dim), layers.LeakyReLU(alpha=0.2), layers.BatchNormalization(momentum=0.8), layers.Dense(512), layers.LeakyReLU(alpha=0.2), layers.BatchNormalization(momentum=0.8), layers.Dense(1024), layers.LeakyReLU(alpha=0.2), layers.BatchNormalization(momentum=0.8), layers.Dense(784, activation='tanh'), layers.Reshape((28, 28, 1)) ]) return model # Discriminator def build_discriminator(): model = keras.Sequential([ layers.Flatten(input_shape=(28, 28, 1)), layers.Dense(512), layers.LeakyReLU(alpha=0.2), layers.Dropout(0.3), layers.Dense(256), layers.LeakyReLU(alpha=0.2), layers.Dropout(0.3), layers.Dense(1, activation='sigmoid') ]) return model # Build and compile latent_dim = 100 generator = build_generator(latent_dim) discriminator = build_discriminator() discriminator.compile( loss='binary_crossentropy', optimizer=keras.optimizers.Adam(0.0002, 0.5), metrics=['accuracy'] ) # Combined model (Generator + Discriminator) discriminator.trainable = False gan_input = layers.Input(shape=(latent_dim,)) generated_img = generator(gan_input) gan_output = discriminator(generated_img) gan = keras.Model(gan_input, gan_output) gan.compile( loss='binary_crossentropy', optimizer=keras.optimizers.Adam(0.0002, 0.5) ) # Training loop def train_gan(epochs, batch_size=128): for epoch in range(epochs): # Train Discriminator idx = np.random.randint(0, X_train.shape[0], batch_size) real_imgs = X_train[idx] noise = np.random.normal(0, 1, (batch_size, latent_dim)) fake_imgs = generator.predict(noise) d_loss_real = discriminator.train_on_batch(real_imgs, np.ones((batch_size, 1))) d_loss_fake = discriminator.train_on_batch(fake_imgs, np.zeros((batch_size, 1))) # Train Generator noise = np.random.normal(0, 1, (batch_size, latent_dim)) g_loss = gan.train_on_batch(noise, np.ones((batch_size, 1))) if epoch % 100 == 0: print(f"Epoch {epoch}, D Loss: {d_loss_real[0]}, G Loss: {g_loss}")

Graph Neural Networks (GNNs)

Neural networks for graph-structured data (social networks, molecules, knowledge graphs).

Neural Architecture Search (NAS)

Automatically discover optimal neural network architectures.

8. Optimization and Regularization

Common Training Challenges

Vanishing Gradient Problem:

Gradients become extremely small in deep networks, preventing early layers from learning.

Solutions:

Exploding Gradient Problem:

Gradients become extremely large, causing unstable training.

Solutions:

Practical Training Tips

Training Best Practices:
# Advanced training setup from tensorflow.keras.callbacks import ModelCheckpoint, TensorBoard, ReduceLROnPlateau # Callbacks callbacks = [ # Save best model ModelCheckpoint( 'best_model.h5', monitor='val_loss', save_best_only=True ), # Reduce learning rate on plateau ReduceLROnPlateau( monitor='val_loss', factor=0.5, patience=5, min_lr=1e-7 ), # Early stopping keras.callbacks.EarlyStopping( monitor='val_loss', patience=10, restore_best_weights=True ), # TensorBoard logging TensorBoard( log_dir='./logs', histogram_freq=1 ) ] # Data augmentation datagen = keras.preprocessing.image.ImageDataGenerator( rotation_range=20, width_shift_range=0.2, height_shift_range=0.2, horizontal_flip=True, zoom_range=0.2, fill_mode='nearest' ) # Train with all optimizations history = model.fit( datagen.flow(X_train, y_train, batch_size=32), steps_per_epoch=len(X_train) // 32, epochs=100, validation_data=(X_val, y_val), callbacks=callbacks, class_weight=class_weights # Handle imbalanced data )

9. Transfer Learning

What is Transfer Learning?

Leverage knowledge from pre-trained models to solve new tasks with less data and training time.

Transfer Learning Strategies

Feature Extraction

Freeze pre-trained model, add new classifier on top.

When: Small dataset, similar task

Fast: Only train new layers

Fine-Tuning

Unfreeze some layers and train with low learning rate.

When: Medium dataset, related task

Better performance: Adapt to new domain

# Transfer Learning workflow # 1. Load pre-trained model base_model = keras.applications.MobileNetV2( input_shape=(224, 224, 3), include_top=False, weights='imagenet' ) # 2. Freeze base model base_model.trainable = False # 3. Add custom head model = keras.Sequential([ base_model, layers.GlobalAveragePooling2D(), layers.Dense(128, activation='relu'), layers.Dropout(0.5), layers.Dense(num_classes, activation='softmax') ]) # 4. Train top layers model.compile( optimizer=keras.optimizers.Adam(1e-3), loss='categorical_crossentropy', metrics=['accuracy'] ) model.fit(train_dataset, epochs=5, validation_data=val_dataset) # 5. Fine-tune: Unfreeze and train with low LR base_model.trainable = True # Freeze early layers for layer in base_model.layers[:100]: layer.trainable = False model.compile( optimizer=keras.optimizers.Adam(1e-5), # Lower LR loss='categorical_crossentropy', metrics=['accuracy'] ) model.fit(train_dataset, epochs=10, validation_data=val_dataset)

Domain Adaptation

Adapt models trained on one domain (e.g., synthetic images) to another (e.g., real images).

10. Model Deployment

Model Optimization for Production

Quantization

Reduce precision (float32 → int8) for faster inference.

Benefit: 4x smaller model, 2-4x faster

Pruning

Remove unnecessary weights/neurons.

Benefit: Smaller model, faster inference

Knowledge Distillation

Train small model to mimic large model.

Benefit: Maintain performance, reduce size

Model Compression

Combine multiple techniques for maximum efficiency.

Benefit: Deploy on edge devices

# TensorFlow Lite conversion (mobile deployment) import tensorflow as tf # Convert to TFLite converter = tf.lite.TFLiteConverter.from_keras_model(model) # Apply optimizations converter.optimizations = [tf.lite.Optimize.DEFAULT] # Quantization converter.representative_dataset = representative_dataset converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8] converter.inference_input_type = tf.uint8 converter.inference_output_type = tf.uint8 # Convert tflite_model = converter.convert() # Save with open('model.tflite', 'wb') as f: f.write(tflite_model) # Deploy on mobile/edge devices

Deployment Platforms

11. Real-World Applications

Computer Vision

  • Self-driving cars
  • Medical image diagnosis
  • Facial recognition
  • Object detection
  • Image generation (DALL-E)

Natural Language Processing

  • Chatbots (ChatGPT)
  • Machine translation
  • Sentiment analysis
  • Text summarization
  • Question answering

Speech and Audio

  • Speech recognition (Siri, Alexa)
  • Text-to-speech
  • Music generation
  • Audio classification
  • Voice cloning

Healthcare

  • Disease diagnosis
  • Drug discovery
  • Protein folding (AlphaFold)
  • Patient monitoring
  • Personalized treatment

Finance

  • Fraud detection
  • Algorithmic trading
  • Credit scoring
  • Risk assessment
  • Market prediction

Recommendation Systems

  • Netflix, YouTube recommendations
  • E-commerce product suggestions
  • Music recommendations (Spotify)
  • News feed personalization
  • Ad targeting

12. Best Practices and Tips

Getting Started:
Common Mistakes to Avoid:

Learning Resources