What You'll Learn
Understand fundamental concepts of neural networks and deep learning
Master different architectures: CNNs, RNNs, LSTMs, Transformers
Learn activation functions, optimization algorithms, and regularization techniques
Implement deep learning models using TensorFlow and PyTorch
Apply transfer learning and fine-tuning for practical applications
Handle common challenges like overfitting, vanishing gradients, and training instability
Deploy and optimize deep learning models in production
Guide Contents
1. Introduction to Deep Learning
What is Deep Learning?
Deep Learning is a subset of machine learning that uses neural networks with multiple layers (deep neural networks) to learn hierarchical representations of data. Unlike traditional machine learning algorithms that require manual feature engineering, deep learning can automatically discover the representations needed for feature detection or classification from raw data.
Key Characteristics of Deep Learning:
Hierarchical Learning: Lower layers learn simple features (edges, colors), higher layers learn complex concepts (faces, objects)
End-to-End Learning: Learn directly from raw data to output without manual feature engineering
Scalability: Performance improves with more data and computational resources
Representation Learning: Automatically discovers useful representations from data
Deep Learning vs Traditional ML
Aspect
Traditional ML
Deep Learning
Feature Engineering
Manual feature extraction required
Automatic feature learning
Data Requirements
Works well with small datasets
Requires large amounts of data
Computational Resources
Runs on CPUs
Benefits from GPUs/TPUs
Training Time
Minutes to hours
Hours to days/weeks
Interpretability
Often interpretable
Black box (less interpretable)
Performance
Plateaus with more data
Improves with more data
Why Deep Learning Now?
Big Data
Explosion of data from internet, IoT sensors, social media provides fuel for deep learning models.
Example: ImageNet dataset with 14M images
Computational Power
Modern GPUs and specialized hardware (TPUs) enable training large neural networks efficiently.
Example: NVIDIA A100 GPU: 624 TFLOPS
Better Algorithms
Improvements in activation functions, optimization, regularization, and architectures.
Example: ReLU, Adam optimizer, Batch Normalization
2. Artificial Neural Networks
The Perceptron: Building Block
A perceptron is the simplest neural network unit, inspired by biological neurons.
Perceptron Formula:
output = activation(Σ(weights × inputs) + bias)
y = f(w₁x₁ + w₂x₂ + ... + wₙxₙ + b)
# Simple Perceptron in NumPy
import numpy as np
class Perceptron:
def __init__(self, input_size):
self.weights = np.random.randn(input_size)
self.bias = np.random.randn()
def forward(self, x):
# Linear combination
z = np.dot(self.weights, x) + self.bias
# Step function activation
return 1 if z > 0 else 0
def train(self, X, y, learning_rate=0.1, epochs=100):
for _ in range(epochs):
for xi, yi in zip(X, y):
prediction = self.forward(xi)
error = yi - prediction
# Update weights
self.weights += learning_rate * error * xi
self.bias += learning_rate * error
# Example: Learning AND gate
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([0, 0, 0, 1])
perceptron = Perceptron(input_size=2)
perceptron.train(X, y)
# Test
for xi in X:
print(f"Input: {xi}, Output: {perceptron.forward(xi)}")
Multi-Layer Perceptron (MLP)
Multiple layers of perceptrons can learn complex non-linear functions.
# MLP with TensorFlow/Keras
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
# Build MLP model
model = keras.Sequential([
# Input layer (implicit)
layers.Dense(128, activation='relu', input_shape=(784,)),
layers.Dropout(0.2),
# Hidden layers
layers.Dense(64, activation='relu'),
layers.Dropout(0.2),
layers.Dense(32, activation='relu'),
# Output layer
layers.Dense(10, activation='softmax') # 10 classes
])
# Compile model
model.compile(
optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy']
)
# Model summary
model.summary()
# Train model
history = model.fit(
X_train, y_train,
epochs=10,
batch_size=32,
validation_split=0.2,
verbose=1
)
# Evaluate
test_loss, test_acc = model.evaluate(X_test, y_test)
print(f"Test accuracy: {test_acc:.3f}")
Activation Functions
Activation functions introduce non-linearity, enabling networks to learn complex patterns.
Function
Formula
Range
Use Case
Pros/Cons
ReLU
f(x) = max(0, x)
[0, ∞)
Hidden layers (most common)
✓ Fast, simple | ✗ Dying ReLU problem
Leaky ReLU
f(x) = max(0.01x, x)
(-∞, ∞)
Hidden layers (fixes dying ReLU)
✓ No dead neurons | ✗ Inconsistent predictions
Sigmoid
f(x) = 1/(1+e⁻ˣ)
(0, 1)
Binary classification output
✓ Smooth gradient | ✗ Vanishing gradient
Tanh
f(x) = (eˣ-e⁻ˣ)/(eˣ+e⁻ˣ)
(-1, 1)
Hidden layers (RNNs)
✓ Zero-centered | ✗ Vanishing gradient
Softmax
f(xᵢ) = eˣⁱ/Σeˣʲ
(0, 1), Σ=1
Multi-class classification output
✓ Probability distribution | ✗ Only for output
GELU
f(x) = x·Φ(x)
(-∞, ∞)
Transformers, modern architectures
✓ Smooth, non-monotonic | ✗ Slower computation
# Implementing activation functions
import numpy as np
# ReLU
def relu(x):
return np.maximum(0, x)
# Leaky ReLU
def leaky_relu(x, alpha=0.01):
return np.where(x > 0, x, alpha * x)
# Sigmoid
def sigmoid(x):
return 1 / (1 + np.exp(-x))
# Tanh
def tanh(x):
return np.tanh(x)
# Softmax
def softmax(x):
exp_x = np.exp(x - np.max(x)) # Numerical stability
return exp_x / exp_x.sum(axis=0)
# Example usage
x = np.array([-2, -1, 0, 1, 2])
print(f"Input: {x}")
print(f"ReLU: {relu(x)}")
print(f"Sigmoid: {sigmoid(x)}")
print(f"Tanh: {tanh(x)}")
# Softmax example (class probabilities)
logits = np.array([2.0, 1.0, 0.1])
probs = softmax(logits)
print(f"Logits: {logits}")
print(f"Softmax probabilities: {probs}")
print(f"Sum: {probs.sum()}") # Should be 1.0
Backpropagation
The algorithm for training neural networks by computing gradients efficiently using the chain rule.
Backpropagation Steps:
Forward Pass: Compute output predictions
Compute Loss: Calculate error between prediction and actual
Backward Pass: Compute gradients using chain rule (∂Loss/∂weights)
Update Weights: Adjust weights in opposite direction of gradient
# Simple backpropagation implementation
import numpy as np
class NeuralNetwork:
def __init__(self, input_size, hidden_size, output_size):
# Initialize weights
self.W1 = np.random.randn(input_size, hidden_size) * 0.01
self.b1 = np.zeros((1, hidden_size))
self.W2 = np.random.randn(hidden_size, output_size) * 0.01
self.b2 = np.zeros((1, output_size))
def forward(self, X):
# Layer 1
self.z1 = np.dot(X, self.W1) + self.b1
self.a1 = np.maximum(0, self.z1) # ReLU
# Layer 2
self.z2 = np.dot(self.a1, self.W2) + self.b2
# Sigmoid for binary classification
self.a2 = 1 / (1 + np.exp(-self.z2))
return self.a2
def backward(self, X, y, output, learning_rate=0.01):
m = X.shape[0] # Number of samples
# Output layer gradients
dz2 = output - y
dW2 = np.dot(self.a1.T, dz2) / m
db2 = np.sum(dz2, axis=0, keepdims=True) / m
# Hidden layer gradients
da1 = np.dot(dz2, self.W2.T)
dz1 = da1 * (self.z1 > 0) # ReLU derivative
dW1 = np.dot(X.T, dz1) / m
db1 = np.sum(dz1, axis=0, keepdims=True) / m
# Update weights
self.W2 -= learning_rate * dW2
self.b2 -= learning_rate * db2
self.W1 -= learning_rate * dW1
self.b1 -= learning_rate * db1
def train(self, X, y, epochs=1000, learning_rate=0.01):
for epoch in range(epochs):
# Forward pass
output = self.forward(X)
# Backward pass
self.backward(X, y, output, learning_rate)
# Print loss every 100 epochs
if epoch % 100 == 0:
loss = -np.mean(y * np.log(output) +
(1 - y) * np.log(1 - output))
print(f"Epoch {epoch}, Loss: {loss:.4f}")
# Example: XOR problem
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([[0], [1], [1], [0]])
nn = NeuralNetwork(input_size=2, hidden_size=4, output_size=1)
nn.train(X, y, epochs=5000, learning_rate=0.1)
# Test predictions
predictions = nn.forward(X)
print("\nPredictions:")
for i, (xi, pred) in enumerate(zip(X, predictions)):
print(f"Input: {xi}, Predicted: {pred[0]:.4f}, Actual: {y[i][0]}")
3. Training Deep Networks
Loss Functions
Loss Function
Use Case
Formula
Binary Cross-Entropy
Binary classification
-[y·log(ŷ) + (1-y)·log(1-ŷ)]
Categorical Cross-Entropy
Multi-class classification
-Σ yᵢ·log(ŷᵢ)
Mean Squared Error (MSE)
Regression
Σ(y - ŷ)² / n
Mean Absolute Error (MAE)
Regression (robust to outliers)
Σ|y - ŷ| / n
Huber Loss
Regression (robust)
Smooth combination of MSE and MAE
Optimization Algorithms
Stochastic Gradient Descent (SGD)
Updates weights using gradient of single sample or mini-batch.
# SGD implementation
optimizer = keras.optimizers.SGD(
learning_rate=0.01,
momentum=0.9,
nesterov=True
)
Adam (Adaptive Moment Estimation)
Combines momentum and adaptive learning rates. Most popular optimizer.
# Adam optimizer
optimizer = keras.optimizers.Adam(
learning_rate=0.001,
beta_1=0.9,
beta_2=0.999
)
RMSprop
Adapts learning rate for each parameter based on recent gradient magnitudes.
# RMSprop optimizer
optimizer = keras.optimizers.RMSprop(
learning_rate=0.001,
rho=0.9
)
AdaGrad
Adapts learning rate based on historical gradients. Good for sparse data.
# AdaGrad optimizer
optimizer = keras.optimizers.Adagrad(
learning_rate=0.01
)
Learning Rate Scheduling
# Learning rate schedules in Keras
# 1. Step Decay
def step_decay(epoch):
initial_lr = 0.1
drop = 0.5
epochs_drop = 10.0
lr = initial_lr * (drop ** np.floor((1 + epoch) / epochs_drop))
return lr
lr_scheduler = keras.callbacks.LearningRateScheduler(step_decay)
# 2. Exponential Decay
lr_schedule = keras.optimizers.schedules.ExponentialDecay(
initial_learning_rate=0.1,
decay_steps=1000,
decay_rate=0.96
)
# 3. Cosine Decay
lr_schedule = keras.optimizers.schedules.CosineDecay(
initial_learning_rate=0.1,
decay_steps=1000
)
# 4. Reduce on Plateau
reduce_lr = keras.callbacks.ReduceLROnPlateau(
monitor='val_loss',
factor=0.2,
patience=5,
min_lr=0.00001
)
# Use in training
model.fit(
X_train, y_train,
epochs=50,
callbacks=[reduce_lr]
)
Regularization Techniques
L1/L2 Regularization
Add penalty term to loss function to prevent large weights.
from tensorflow.keras import regularizers
model = keras.Sequential([
layers.Dense(64,
activation='relu',
kernel_regularizer=regularizers.l2(0.01),
bias_regularizer=regularizers.l1(0.01)
)
])
Dropout
Randomly drop neurons during training to prevent co-adaptation.
model = keras.Sequential([
layers.Dense(128, activation='relu'),
layers.Dropout(0.5), # Drop 50% of neurons
layers.Dense(64, activation='relu'),
layers.Dropout(0.3),
layers.Dense(10, activation='softmax')
])
Batch Normalization
Normalize activations to reduce internal covariate shift.
model = keras.Sequential([
layers.Dense(128),
layers.BatchNormalization(),
layers.Activation('relu'),
layers.Dense(64),
layers.BatchNormalization(),
layers.Activation('relu')
])
Early Stopping
Stop training when validation performance stops improving.
early_stop = keras.callbacks.EarlyStopping(
monitor='val_loss',
patience=10,
restore_best_weights=True
)
model.fit(X_train, y_train,
validation_data=(X_val, y_val),
callbacks=[early_stop]
)
4. Convolutional Neural Networks (CNNs)
What are CNNs?
CNNs are specialized neural networks for processing grid-like data (images, video, time series). They use convolution operations to automatically learn spatial hierarchies of features.
Key CNN Layers
Convolutional Layer
Applies filters/kernels to extract features like edges, textures, patterns.
Parameters: filters, kernel_size, stride, padding
Pooling Layer
Downsamples feature maps to reduce dimensionality and computation.
Types: MaxPooling (most common), AveragePooling
CNN Architecture Example
# CNN for Image Classification (CIFAR-10)
import tensorflow as tf
from tensorflow.keras import layers, models
model = models.Sequential([
# First Convolutional Block
layers.Conv2D(32, (3, 3), activation='relu',
input_shape=(32, 32, 3), padding='same'),
layers.BatchNormalization(),
layers.Conv2D(32, (3, 3), activation='relu', padding='same'),
layers.BatchNormalization(),
layers.MaxPooling2D((2, 2)),
layers.Dropout(0.2),
# Second Convolutional Block
layers.Conv2D(64, (3, 3), activation='relu', padding='same'),
layers.BatchNormalization(),
layers.Conv2D(64, (3, 3), activation='relu', padding='same'),
layers.BatchNormalization(),
layers.MaxPooling2D((2, 2)),
layers.Dropout(0.3),
# Third Convolutional Block
layers.Conv2D(128, (3, 3), activation='relu', padding='same'),
layers.BatchNormalization(),
layers.Conv2D(128, (3, 3), activation='relu', padding='same'),
layers.BatchNormalization(),
layers.MaxPooling2D((2, 2)),
layers.Dropout(0.4),
# Fully Connected Layers
layers.Flatten(),
layers.Dense(512, activation='relu'),
layers.BatchNormalization(),
layers.Dropout(0.5),
layers.Dense(10, activation='softmax') # 10 classes
])
model.compile(
optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy']
)
model.summary()
# Data augmentation
datagen = tf.keras.preprocessing.image.ImageDataGenerator(
rotation_range=15,
width_shift_range=0.1,
height_shift_range=0.1,
horizontal_flip=True,
zoom_range=0.1
)
# Train
history = model.fit(
datagen.flow(X_train, y_train, batch_size=32),
epochs=50,
validation_data=(X_test, y_test),
callbacks=[early_stop, reduce_lr]
)
Popular CNN Architectures
LeNet-5 (1998)
Pioneer: First successful CNN for digit recognition
Layers: 7 layers (Conv-Pool-Conv-Pool-FC-FC-Output)
Use: Digit recognition (MNIST)
AlexNet (2012)
Breakthrough: Won ImageNet 2012, popularized deep learning
Innovation: ReLU, Dropout, GPU training
Params: 60M parameters, 8 layers
VGG-16/19 (2014)
Key Idea: Deeper is better (16-19 layers)
Architecture: Small 3x3 filters throughout
Params: 138M (VGG-16)
ResNet (2015)
Innovation: Skip connections solve vanishing gradient
Depth: 50, 101, 152 layers possible
Impact: Enabled very deep networks
Inception (GoogLeNet)
Key Idea: Multiple filter sizes in parallel
Efficiency: Fewer params than VGG
Versions: v1, v2, v3, v4, Xception
EfficientNet (2019)
Innovation: Compound scaling (depth, width, resolution)
Efficiency: Best accuracy/parameters trade-off
Variants: B0-B7
Using Pre-trained CNNs
# Transfer Learning with ResNet50
from tensorflow.keras.applications import ResNet50
from tensorflow.keras.applications.resnet50 import preprocess_input
# Load pre-trained model (without top classification layer)
base_model = ResNet50(
weights='imagenet',
include_top=False,
input_shape=(224, 224, 3)
)
# Freeze base model layers
base_model.trainable = False
# Add custom classification head
model = models.Sequential([
base_model,
layers.GlobalAveragePooling2D(),
layers.Dense(256, activation='relu'),
layers.Dropout(0.5),
layers.Dense(num_classes, activation='softmax')
])
model.compile(
optimizer=keras.optimizers.Adam(1e-4),
loss='categorical_crossentropy',
metrics=['accuracy']
)
# Train only the new layers
model.fit(X_train, y_train, epochs=10)
# Fine-tuning: Unfreeze some base layers
base_model.trainable = True
# Freeze early layers, train later layers
for layer in base_model.layers[:100]:
layer.trainable = False
# Recompile with lower learning rate
model.compile(
optimizer=keras.optimizers.Adam(1e-5),
loss='categorical_crossentropy',
metrics=['accuracy']
)
# Continue training
model.fit(X_train, y_train, epochs=10)
CNN Applications
Image Classification: Categorize images into classes (cats, dogs, vehicles)
Object Detection: Locate and classify objects (YOLO, Faster R-CNN, SSD)
Semantic Segmentation: Classify each pixel (U-Net, DeepLab)
Face Recognition: Identify individuals from facial features
Medical Imaging: Detect diseases from X-rays, MRIs, CT scans
Style Transfer: Apply artistic styles to images
5. Recurrent Neural Networks (RNNs)
What are RNNs?
RNNs are neural networks designed for sequential data (time series, text, audio). They have loops that allow information to persist, maintaining a "memory" of previous inputs.
RNN Characteristics:
Sequential Processing: Process inputs one at a time, maintaining hidden state
Variable Length: Handle sequences of different lengths
Parameter Sharing: Same weights applied at each time step
Memory: Hidden state captures information from previous inputs
Simple RNN Architecture
# Simple RNN for sequence classification
model = keras.Sequential([
layers.SimpleRNN(64,
return_sequences=True,
input_shape=(timesteps, features)),
layers.SimpleRNN(32),
layers.Dense(10, activation='softmax')
])
# Example: Text classification
# Input shape: (batch_size, sequence_length, embedding_dim)
model = keras.Sequential([
layers.Embedding(vocab_size, 128, input_length=max_length),
layers.SimpleRNN(64, return_sequences=False),
layers.Dense(1, activation='sigmoid') # Binary classification
])
model.compile(
optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy']
)
model.fit(X_train, y_train, epochs=10, validation_split=0.2)
Long Short-Term Memory (LSTM)
LSTMs solve the vanishing gradient problem in RNNs using gates to control information flow.
LSTM Gates:
Forget Gate: Decides what information to discard from cell state
Input Gate: Decides what new information to store in cell state
Output Gate: Decides what to output based on cell state
# LSTM for sentiment analysis
from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing import sequence
# Load IMDB dataset
max_features = 10000
maxlen = 200
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=max_features)
# Pad sequences to same length
X_train = sequence.pad_sequences(X_train, maxlen=maxlen)
X_test = sequence.pad_sequences(X_test, maxlen=maxlen)
# Build LSTM model
model = keras.Sequential([
layers.Embedding(max_features, 128, input_length=maxlen),
layers.LSTM(64, return_sequences=True, dropout=0.2, recurrent_dropout=0.2),
layers.LSTM(32, dropout=0.2, recurrent_dropout=0.2),
layers.Dense(1, activation='sigmoid')
])
model.compile(
optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy']
)
# Train
history = model.fit(
X_train, y_train,
batch_size=32,
epochs=10,
validation_data=(X_test, y_test)
)
# Evaluate
score = model.evaluate(X_test, y_test)
print(f"Test accuracy: {score[1]:.3f}")
Gated Recurrent Unit (GRU)
Simplified version of LSTM with fewer parameters. Often performs similarly to LSTM.
# GRU for time series prediction
model = keras.Sequential([
layers.GRU(64,
return_sequences=True,
input_shape=(timesteps, features)),
layers.Dropout(0.2),
layers.GRU(32),
layers.Dropout(0.2),
layers.Dense(1) # Regression output
])
model.compile(optimizer='adam', loss='mse')
# Example: Stock price prediction
model.fit(X_train, y_train,
epochs=50,
batch_size=32,
validation_split=0.2)
Bidirectional RNNs
Process sequences in both forward and backward directions for better context understanding.
# Bidirectional LSTM
model = keras.Sequential([
layers.Embedding(vocab_size, 128),
layers.Bidirectional(layers.LSTM(64, return_sequences=True)),
layers.Bidirectional(layers.LSTM(32)),
layers.Dense(num_classes, activation='softmax')
])
# Useful for:
# - Named Entity Recognition
# - Part-of-speech tagging
# - Any task where future context helps understand current position
Sequence-to-Sequence Models
Encoder-Decoder architecture for tasks like machine translation.
# Seq2Seq model for machine translation
# Encoder
encoder_inputs = layers.Input(shape=(None, num_encoder_tokens))
encoder_lstm = layers.LSTM(256, return_state=True)
encoder_outputs, state_h, state_c = encoder_lstm(encoder_inputs)
encoder_states = [state_h, state_c]
# Decoder
decoder_inputs = layers.Input(shape=(None, num_decoder_tokens))
decoder_lstm = layers.LSTM(256, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_inputs,
initial_state=encoder_states)
decoder_dense = layers.Dense(num_decoder_tokens, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)
# Full model
model = keras.Model([encoder_inputs, decoder_inputs], decoder_outputs)
model.compile(optimizer='rmsprop',
loss='categorical_crossentropy',
metrics=['accuracy'])
# Train
model.fit([encoder_input_data, decoder_input_data],
decoder_target_data,
batch_size=64,
epochs=100,
validation_split=0.2)
RNN Applications
Natural Language Processing: Text classification, sentiment analysis, named entity recognition
Machine Translation: Translate text from one language to another
Speech Recognition: Convert audio to text
Time Series Forecasting: Stock prices, weather, energy consumption
Video Analysis: Action recognition, video captioning
Music Generation: Compose melodies and harmonies
6. Transformers and Attention Mechanisms
What are Transformers?
Transformers are the architecture behind modern LLMs (GPT, BERT, T5). They use self-attention mechanisms to process sequences in parallel, unlike RNNs that process sequentially.
Self-Attention Mechanism
Attention allows the model to focus on relevant parts of the input when processing each element.
Attention Formula:
Attention(Q, K, V) = softmax(QK^T / √d_k) V
Q = Query, K = Key, V = Value
# Simple self-attention implementation
import numpy as np
def self_attention(X):
# X shape: (sequence_length, d_model)
d_k = X.shape[1]
# In practice, these are learned linear projections
Q = X # Query
K = X # Key
V = X # Value
# Compute attention scores
scores = np.matmul(Q, K.T) / np.sqrt(d_k)
# Apply softmax
attention_weights = np.exp(scores) / np.sum(np.exp(scores), axis=-1, keepdims=True)
# Compute weighted sum of values
output = np.matmul(attention_weights, V)
return output, attention_weights
# Example
sequence_length = 5
d_model = 8
X = np.random.randn(sequence_length, d_model)
output, weights = self_attention(X)
print("Attention weights shape:", weights.shape)
print("Output shape:", output.shape)
Transformer Architecture
# Multi-Head Self-Attention layer
class MultiHeadAttention(layers.Layer):
def __init__(self, d_model, num_heads):
super(MultiHeadAttention, self).__init__()
self.num_heads = num_heads
self.d_model = d_model
assert d_model % num_heads == 0
self.depth = d_model // num_heads
self.wq = layers.Dense(d_model)
self.wk = layers.Dense(d_model)
self.wv = layers.Dense(d_model)
self.dense = layers.Dense(d_model)
def split_heads(self, x, batch_size):
x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth))
return tf.transpose(x, perm=[0, 2, 1, 3])
def call(self, v, k, q, mask=None):
batch_size = tf.shape(q)[0]
# Linear projections
q = self.wq(q)
k = self.wk(k)
v = self.wv(v)
# Split into multiple heads
q = self.split_heads(q, batch_size)
k = self.split_heads(k, batch_size)
v = self.split_heads(v, batch_size)
# Scaled dot-product attention
matmul_qk = tf.matmul(q, k, transpose_b=True)
dk = tf.cast(tf.shape(k)[-1], tf.float32)
scaled_attention_logits = matmul_qk / tf.math.sqrt(dk)
if mask is not None:
scaled_attention_logits += (mask * -1e9)
attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1)
output = tf.matmul(attention_weights, v)
output = tf.transpose(output, perm=[0, 2, 1, 3])
output = tf.reshape(output, (batch_size, -1, self.d_model))
output = self.dense(output)
return output, attention_weights
# Transformer Encoder Block
class TransformerBlock(layers.Layer):
def __init__(self, d_model, num_heads, dff, dropout_rate=0.1):
super(TransformerBlock, self).__init__()
self.mha = MultiHeadAttention(d_model, num_heads)
self.ffn = keras.Sequential([
layers.Dense(dff, activation='relu'),
layers.Dense(d_model)
])
self.layernorm1 = layers.LayerNormalization(epsilon=1e-6)
self.layernorm2 = layers.LayerNormalization(epsilon=1e-6)
self.dropout1 = layers.Dropout(dropout_rate)
self.dropout2 = layers.Dropout(dropout_rate)
def call(self, x, training, mask=None):
# Multi-head attention
attn_output, _ = self.mha(x, x, x, mask)
attn_output = self.dropout1(attn_output, training=training)
out1 = self.layernorm1(x + attn_output)
# Feed-forward network
ffn_output = self.ffn(out1)
ffn_output = self.dropout2(ffn_output, training=training)
out2 = self.layernorm2(out1 + ffn_output)
return out2
# Build Transformer model
def create_transformer_model(vocab_size, d_model=128, num_heads=8,
dff=512, num_layers=4, max_length=100):
inputs = layers.Input(shape=(max_length,))
# Embedding + Positional Encoding
x = layers.Embedding(vocab_size, d_model)(inputs)
x *= tf.math.sqrt(tf.cast(d_model, tf.float32))
# Add positional encoding
pos_encoding = positional_encoding(max_length, d_model)
x += pos_encoding[:, :max_length, :]
# Transformer blocks
for _ in range(num_layers):
x = TransformerBlock(d_model, num_heads, dff)(x)
# Global average pooling
x = layers.GlobalAveragePooling1D()(x)
# Classification head
x = layers.Dropout(0.1)(x)
outputs = layers.Dense(num_classes, activation='softmax')(x)
return keras.Model(inputs=inputs, outputs=outputs)
Popular Transformer Models
BERT
Type: Bidirectional Encoder
Training: Masked Language Modeling
Use: Text classification, QA, NER
Variants: RoBERTa, ALBERT, DistilBERT
GPT
Type: Autoregressive Decoder
Training: Next token prediction
Use: Text generation, completion
Versions: GPT-2, GPT-3, GPT-4
T5
Type: Encoder-Decoder
Approach: Text-to-Text framework
Use: Translation, summarization, QA
Sizes: Small, Base, Large, XL, XXL
Using Pre-trained Transformers
# Using Hugging Face Transformers
from transformers import BertTokenizer, TFBertForSequenceClassification
import tensorflow as tf
# Load pre-trained BERT model and tokenizer
model_name = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(model_name)
model = TFBertForSequenceClassification.from_pretrained(
model_name,
num_labels=2
)
# Prepare data
texts = ["This movie is great!", "This movie is terrible."]
labels = [1, 0] # positive, negative
# Tokenize
encodings = tokenizer(
texts,
padding=True,
truncation=True,
max_length=128,
return_tensors='tf'
)
# Train
optimizer = tf.keras.optimizers.Adam(learning_rate=2e-5)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])
model.fit(
encodings['input_ids'],
labels,
epochs=3,
batch_size=8
)
# Inference
new_texts = ["I loved this movie!"]
new_encodings = tokenizer(
new_texts,
padding=True,
truncation=True,
return_tensors='tf'
)
predictions = model(new_encodings['input_ids'])
predicted_class = tf.argmax(predictions.logits, axis=1)
print(f"Predicted sentiment: {predicted_class.numpy()}")
7. Advanced Architectures
Autoencoders
Unsupervised learning models that learn compressed representations of data.
# Autoencoder for dimensionality reduction
# Encoder
encoder_input = layers.Input(shape=(784,))
encoded = layers.Dense(128, activation='relu')(encoder_input)
encoded = layers.Dense(64, activation='relu')(encoded)
encoded = layers.Dense(32, activation='relu')(encoded)
# Decoder
decoded = layers.Dense(64, activation='relu')(encoded)
decoded = layers.Dense(128, activation='relu')(decoded)
decoded = layers.Dense(784, activation='sigmoid')(decoded)
# Full autoencoder
autoencoder = keras.Model(encoder_input, decoded)
# Encoder model (for extracting features)
encoder = keras.Model(encoder_input, encoded)
# Train
autoencoder.compile(optimizer='adam', loss='binary_crossentropy')
autoencoder.fit(X_train, X_train, # Input = Output
epochs=50,
batch_size=256,
validation_data=(X_test, X_test))
# Use encoder for dimensionality reduction
encoded_imgs = encoder.predict(X_test)
Variational Autoencoders (VAE)
Generative models that learn probability distributions in latent space.
Generative Adversarial Networks (GANs)
Two networks compete: Generator creates fake data, Discriminator tries to detect fakes.
# Simple GAN for generating images
# Generator
def build_generator(latent_dim):
model = keras.Sequential([
layers.Dense(256, input_dim=latent_dim),
layers.LeakyReLU(alpha=0.2),
layers.BatchNormalization(momentum=0.8),
layers.Dense(512),
layers.LeakyReLU(alpha=0.2),
layers.BatchNormalization(momentum=0.8),
layers.Dense(1024),
layers.LeakyReLU(alpha=0.2),
layers.BatchNormalization(momentum=0.8),
layers.Dense(784, activation='tanh'),
layers.Reshape((28, 28, 1))
])
return model
# Discriminator
def build_discriminator():
model = keras.Sequential([
layers.Flatten(input_shape=(28, 28, 1)),
layers.Dense(512),
layers.LeakyReLU(alpha=0.2),
layers.Dropout(0.3),
layers.Dense(256),
layers.LeakyReLU(alpha=0.2),
layers.Dropout(0.3),
layers.Dense(1, activation='sigmoid')
])
return model
# Build and compile
latent_dim = 100
generator = build_generator(latent_dim)
discriminator = build_discriminator()
discriminator.compile(
loss='binary_crossentropy',
optimizer=keras.optimizers.Adam(0.0002, 0.5),
metrics=['accuracy']
)
# Combined model (Generator + Discriminator)
discriminator.trainable = False
gan_input = layers.Input(shape=(latent_dim,))
generated_img = generator(gan_input)
gan_output = discriminator(generated_img)
gan = keras.Model(gan_input, gan_output)
gan.compile(
loss='binary_crossentropy',
optimizer=keras.optimizers.Adam(0.0002, 0.5)
)
# Training loop
def train_gan(epochs, batch_size=128):
for epoch in range(epochs):
# Train Discriminator
idx = np.random.randint(0, X_train.shape[0], batch_size)
real_imgs = X_train[idx]
noise = np.random.normal(0, 1, (batch_size, latent_dim))
fake_imgs = generator.predict(noise)
d_loss_real = discriminator.train_on_batch(real_imgs, np.ones((batch_size, 1)))
d_loss_fake = discriminator.train_on_batch(fake_imgs, np.zeros((batch_size, 1)))
# Train Generator
noise = np.random.normal(0, 1, (batch_size, latent_dim))
g_loss = gan.train_on_batch(noise, np.ones((batch_size, 1)))
if epoch % 100 == 0:
print(f"Epoch {epoch}, D Loss: {d_loss_real[0]}, G Loss: {g_loss}")
Graph Neural Networks (GNNs)
Neural networks for graph-structured data (social networks, molecules, knowledge graphs).
Neural Architecture Search (NAS)
Automatically discover optimal neural network architectures.
8. Optimization and Regularization
Common Training Challenges
Vanishing Gradient Problem:
Gradients become extremely small in deep networks, preventing early layers from learning.
Solutions:
Use ReLU instead of sigmoid/tanh
Batch Normalization
Residual connections (ResNet)
LSTM/GRU for RNNs
Gradient clipping
Exploding Gradient Problem:
Gradients become extremely large, causing unstable training.
Solutions:
Gradient clipping
Lower learning rate
Batch Normalization
Weight regularization
Practical Training Tips
Training Best Practices:
Data Preprocessing: Normalize inputs (mean=0, std=1)
Weight Initialization: Use He initialization for ReLU, Xavier for tanh
Batch Size: Start with 32-128, larger for more stable gradients
Learning Rate: Start with 1e-3, use learning rate finder
Monitor Training: Plot loss curves, watch for overfitting
Validation: Always use validation set for hyperparameter tuning
# Advanced training setup
from tensorflow.keras.callbacks import ModelCheckpoint, TensorBoard, ReduceLROnPlateau
# Callbacks
callbacks = [
# Save best model
ModelCheckpoint(
'best_model.h5',
monitor='val_loss',
save_best_only=True
),
# Reduce learning rate on plateau
ReduceLROnPlateau(
monitor='val_loss',
factor=0.5,
patience=5,
min_lr=1e-7
),
# Early stopping
keras.callbacks.EarlyStopping(
monitor='val_loss',
patience=10,
restore_best_weights=True
),
# TensorBoard logging
TensorBoard(
log_dir='./logs',
histogram_freq=1
)
]
# Data augmentation
datagen = keras.preprocessing.image.ImageDataGenerator(
rotation_range=20,
width_shift_range=0.2,
height_shift_range=0.2,
horizontal_flip=True,
zoom_range=0.2,
fill_mode='nearest'
)
# Train with all optimizations
history = model.fit(
datagen.flow(X_train, y_train, batch_size=32),
steps_per_epoch=len(X_train) // 32,
epochs=100,
validation_data=(X_val, y_val),
callbacks=callbacks,
class_weight=class_weights # Handle imbalanced data
)
9. Transfer Learning
What is Transfer Learning?
Leverage knowledge from pre-trained models to solve new tasks with less data and training time.
Transfer Learning Strategies
Feature Extraction
Freeze pre-trained model, add new classifier on top.
When: Small dataset, similar task
Fast: Only train new layers
Fine-Tuning
Unfreeze some layers and train with low learning rate.
When: Medium dataset, related task
Better performance: Adapt to new domain
# Transfer Learning workflow
# 1. Load pre-trained model
base_model = keras.applications.MobileNetV2(
input_shape=(224, 224, 3),
include_top=False,
weights='imagenet'
)
# 2. Freeze base model
base_model.trainable = False
# 3. Add custom head
model = keras.Sequential([
base_model,
layers.GlobalAveragePooling2D(),
layers.Dense(128, activation='relu'),
layers.Dropout(0.5),
layers.Dense(num_classes, activation='softmax')
])
# 4. Train top layers
model.compile(
optimizer=keras.optimizers.Adam(1e-3),
loss='categorical_crossentropy',
metrics=['accuracy']
)
model.fit(train_dataset, epochs=5, validation_data=val_dataset)
# 5. Fine-tune: Unfreeze and train with low LR
base_model.trainable = True
# Freeze early layers
for layer in base_model.layers[:100]:
layer.trainable = False
model.compile(
optimizer=keras.optimizers.Adam(1e-5), # Lower LR
loss='categorical_crossentropy',
metrics=['accuracy']
)
model.fit(train_dataset, epochs=10, validation_data=val_dataset)
Domain Adaptation
Adapt models trained on one domain (e.g., synthetic images) to another (e.g., real images).
10. Model Deployment
Model Optimization for Production
Quantization
Reduce precision (float32 → int8) for faster inference.
Benefit: 4x smaller model, 2-4x faster
Pruning
Remove unnecessary weights/neurons.
Benefit: Smaller model, faster inference
Knowledge Distillation
Train small model to mimic large model.
Benefit: Maintain performance, reduce size
Model Compression
Combine multiple techniques for maximum efficiency.
Benefit: Deploy on edge devices
# TensorFlow Lite conversion (mobile deployment)
import tensorflow as tf
# Convert to TFLite
converter = tf.lite.TFLiteConverter.from_keras_model(model)
# Apply optimizations
converter.optimizations = [tf.lite.Optimize.DEFAULT]
# Quantization
converter.representative_dataset = representative_dataset
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.uint8
converter.inference_output_type = tf.uint8
# Convert
tflite_model = converter.convert()
# Save
with open('model.tflite', 'wb') as f:
f.write(tflite_model)
# Deploy on mobile/edge devices
Deployment Platforms
TensorFlow Serving: Production-ready serving system
TorchServe: PyTorch model serving
ONNX Runtime: Cross-platform, high-performance inference
TensorFlow Lite: Mobile and embedded devices
CoreML: iOS deployment
TensorRT: NVIDIA GPU optimization
11. Real-World Applications
Computer Vision
Self-driving cars
Medical image diagnosis
Facial recognition
Object detection
Image generation (DALL-E)
Natural Language Processing
Chatbots (ChatGPT)
Machine translation
Sentiment analysis
Text summarization
Question answering
Speech and Audio
Speech recognition (Siri, Alexa)
Text-to-speech
Music generation
Audio classification
Voice cloning
Healthcare
Disease diagnosis
Drug discovery
Protein folding (AlphaFold)
Patient monitoring
Personalized treatment
Finance
Fraud detection
Algorithmic trading
Credit scoring
Risk assessment
Market prediction
Recommendation Systems
Netflix, YouTube recommendations
E-commerce product suggestions
Music recommendations (Spotify)
News feed personalization
Ad targeting
12. Best Practices and Tips
Getting Started:
Start with simple architectures, increase complexity as needed
Use transfer learning when possible
Always split data: train/validation/test
Visualize training curves to diagnose issues
Use pre-trained models and established architectures
Experiment with Colab/Kaggle for free GPU access
Common Mistakes to Avoid:
Not shuffling training data
Data leakage between train/test sets
Training on unnormalized data
Using too high learning rate
Not using validation set
Overfitting to training data
Learning Resources
Courses: Fast.ai, Stanford CS230, Coursera Deep Learning Specialization
Books: Deep Learning (Goodfellow), Deep Learning with Python (Chollet)
Frameworks: TensorFlow, PyTorch, Keras
Practice: Kaggle competitions, personal projects
Papers: arXiv.org for latest research
Deep Learning - Complete Guide | Part of IT Knowledge Base
Teaching machines to learn deeply, one layer at a time