Master algorithms that learn from data to make predictions and decisions without explicit programming
What You'll Learn
Understand the three main types of machine learning and when to use each
Master essential ML algorithms from linear regression to neural networks
Learn the complete ML workflow from data preparation to model deployment
Implement ML models using Python and Scikit-learn
Evaluate and optimize model performance using appropriate metrics
Apply ML to real-world problems across different domains
1. Introduction to Machine Learning
What is Machine Learning?
Machine Learning (ML) is a subset of artificial intelligence that enables systems to automatically learn and improve from experience without being explicitly programmed. Instead of following pre-defined rules, ML systems identify patterns in data and make data-driven predictions or decisions.
Traditional Programming vs Machine Learning:
Traditional: Rules + Data → Answers (programmer defines rules)
Machine Learning: Data + Answers → Rules (system learns rules)
Why Machine Learning Matters
Automation at Scale
Process millions of data points in seconds, making decisions that would take humans years.
Example: Gmail processes billions of emails daily for spam detection
Pattern Recognition
Discover complex patterns humans might miss in large datasets.
Example: Netflix recommends shows based on viewing patterns
Continuous Improvement
Models improve automatically as they process more data.
Example: Self-driving cars get safer with more miles driven
Key Terminology
Term
Definition
Example
Features (X)
Input variables used for predictions
House size, bedrooms, location
Target (y)
Output variable we want to predict
House price
Model
Mathematical function mapping inputs to outputs
y = w₁x₁ + w₂x₂ + b
Training
Process of learning model parameters from data
Finding optimal weights (w) and bias (b)
Inference
Using trained model to make predictions
Predicting price of a new house
2. Types of Machine Learning
Supervised Learning
What: Learn from labeled data (input-output pairs)
Goal: Predict output for new inputs
Examples:
Email spam detection
House price prediction
Medical diagnosis
Customer churn prediction
Data: (X, y) pairs where y is known
Unsupervised Learning
What: Find patterns in unlabeled data
Goal: Discover hidden structure
Examples:
Customer segmentation
Anomaly detection
Topic modeling
Dimensionality reduction
Data: Only X, no labels
Reinforcement Learning
What: Learn by trial and error through rewards
Goal: Maximize cumulative reward
Examples:
Game playing (AlphaGo, Chess)
Robotics control
Autonomous driving
Resource optimization
Data: State, Action, Reward sequences
3. Supervised Learning
Supervised learning is the most common ML type. It learns a mapping function f: X → y from labeled training data.
Classification vs Regression
Classification
Goal: Predict discrete categories
Output: Class label (e.g., spam/not spam, cat/dog)
Algorithms:
Logistic Regression
Decision Trees
Random Forest
Support Vector Machines
Neural Networks
Regression
Goal: Predict continuous values
Output: Numerical value (e.g., price, temperature)
Algorithms:
Linear Regression
Polynomial Regression
Random Forest Regressor
Gradient Boosting
Neural Networks
Popular Supervised Learning Algorithms
1. Linear Regression
Simplest algorithm for regression. Models relationship as a straight line.
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import numpy as np
# Sample data: house size → price
X = np.array([[1000], [1500], [2000], [2500], [3000]]) # sq ft
y = np.array([200000, 300000, 400000, 500000, 600000]) # price
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Train model
model = LinearRegression()
model.fit(X_train, y_train)
# Make predictions
new_house = [[1800]]
predicted_price = model.predict(new_house)
print(f"Predicted price for 1800 sq ft: ${predicted_price[0]:,.0f}")
# Model parameters
print(f"Coefficient (slope): {model.coef_[0]:.2f}")
print(f"Intercept: {model.intercept_:.2f}")
# Equation: price = 200 * sq_ft + 0
2. Logistic Regression
Despite the name, used for classification. Predicts probability of belonging to a class.
Tree-based model that makes decisions through a series of if-else questions.
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
import matplotlib.pyplot as plt
# Example: Should we play tennis?
# Features: [outlook, temperature, humidity, windy]
# Encoded: outlook(sunny=0,overcast=1,rainy=2), temp(hot=0,mild=1,cool=2)
X = [[0, 0, 1, 0], [0, 0, 1, 1], [1, 0, 1, 0], [2, 1, 1, 0]]
y = [0, 0, 1, 1] # 0=no, 1=yes
# Train
model = DecisionTreeClassifier(max_depth=3)
model.fit(X, y)
# Visualize tree structure
# tree.plot_tree(model, feature_names=['outlook', 'temp', 'humidity', 'windy'])
# Feature importance
print("Feature importances:", model.feature_importances_)
4. Random Forest
Ensemble of decision trees. More accurate and less prone to overfitting.
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
# Generate sample data
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Train Random Forest with 100 trees
model = RandomForestClassifier(
n_estimators=100, # number of trees
max_depth=10, # max tree depth
min_samples_split=5, # min samples to split node
random_state=42
)
model.fit(X_train, y_train)
# Evaluate
accuracy = model.score(X_test, y_test)
print(f"Accuracy: {accuracy:.2%}")
# Feature importance
importances = model.feature_importances_
top_features = sorted(enumerate(importances), key=lambda x: x[1], reverse=True)[:5]
print("Top 5 important features:", top_features)
5. Gradient Boosting (XGBoost)
Powerful ensemble method that builds trees sequentially, each correcting previous errors.
from xgboost import XGBClassifier
from sklearn.metrics import classification_report
# Train XGBoost
model = XGBClassifier(
n_estimators=100,
learning_rate=0.1,
max_depth=5,
random_state=42
)
model.fit(X_train, y_train)
# Predict
y_pred = model.predict(X_test)
# Detailed evaluation
print(classification_report(y_test, y_pred))
# Often wins Kaggle competitions!
# Try different K values
inertias = []
K_range = range(1, 11)
for k in K_range:
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(X)
inertias.append(kmeans.inertia_) # sum of squared distances
# Plot elbow curve
plt.plot(K_range, inertias, marker='o')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Inertia')
plt.title('Elbow Method')
plt.show()
# Look for "elbow" where inertia decrease slows
Dimensionality Reduction
Principal Component Analysis (PCA)
Reduces high-dimensional data to fewer dimensions while preserving variance.
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
# High-dimensional data (e.g., 20 features)
X, y = make_classification(n_samples=1000, n_features=20)
# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Reduce to 2 components for visualization
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
print("Original shape:", X.shape) # (1000, 20)
print("Reduced shape:", X_pca.shape) # (1000, 2)
print("Explained variance ratio:", pca.explained_variance_ratio_)
print("Total variance explained:", sum(pca.explained_variance_ratio_))
# Visualize in 2D
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis', alpha=0.5)
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')
plt.show()
Deep RL: Modern RL uses deep neural networks (Deep Q-Networks, Policy Gradients, Actor-Critic) for complex environments with high-dimensional state spaces. Examples: AlphaGo, OpenAI Five (Dota 2), autonomous driving.
from sklearn.model_selection import train_test_split
# Separate features and target
X = df.drop('target', axis=1)
y = df['target']
# Split 80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
print(f"Training set: {X_train.shape}")
print(f"Test set: {X_test.shape}")
Step 3: Feature Scaling
from sklearn.preprocessing import StandardScaler
# Standardize features (mean=0, std=1)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test) # use same scaler!
# Note: Some algorithms (tree-based) don't need scaling
Step 4: Model Training
from sklearn.ensemble import RandomForestClassifier
# Initialize model
model = RandomForestClassifier(n_estimators=100, random_state=42)
# Train
model.fit(X_train_scaled, y_train)
# Make predictions
y_pred = model.predict(X_test_scaled)
y_pred_proba = model.predict_proba(X_test_scaled)
from sklearn.model_selection import cross_val_score, StratifiedKFold
# 5-fold cross-validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=cv, scoring='f1_weighted')
print(f"CV Scores: {scores}")
print(f"Mean: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")
# More reliable than single train-test split!
8. Hyperparameter Tuning
Grid Search
from sklearn.model_selection import GridSearchCV
# Define hyperparameter grid
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [5, 10, 15, None],
'min_samples_split': [2, 5, 10]
}
# Grid search with cross-validation
grid_search = GridSearchCV(
RandomForestClassifier(random_state=42),
param_grid,
cv=5,
scoring='f1_weighted',
n_jobs=-1 # use all CPU cores
)
grid_search.fit(X_train, y_train)
print("Best parameters:", grid_search.best_params_)
print("Best cross-validation score:", grid_search.best_score_)
# Use best model
best_model = grid_search.best_estimator_
Random Search (Faster Alternative)
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint
# Define distributions
param_distributions = {
'n_estimators': randint(50, 500),
'max_depth': randint(5, 50),
'min_samples_split': randint(2, 20)
}
# Random search - tries random combinations
random_search = RandomizedSearchCV(
RandomForestClassifier(random_state=42),
param_distributions,
n_iter=50, # try 50 random combinations
cv=5,
random_state=42,
n_jobs=-1
)
random_search.fit(X_train, y_train)
9. Model Deployment
Save and Load Models
import pickle
import joblib
# Method 1: Pickle
with open('model.pkl', 'wb') as f:
pickle.dump(model, f)
with open('model.pkl', 'rb') as f:
loaded_model = pickle.load(f)
# Method 2: Joblib (better for large models)
joblib.dump(model, 'model.joblib')
loaded_model = joblib.load('model.joblib')
Deploy as REST API
# app.py
from flask import Flask, request, jsonify
import joblib
import numpy as np
app = Flask(__name__)
model = joblib.load('model.joblib')
scaler = joblib.load('scaler.joblib')
@app.route('/predict', methods=['POST'])
def predict():
data = request.get_json()
features = np.array(data['features']).reshape(1, -1)
features_scaled = scaler.transform(features)
prediction = model.predict(features_scaled)[0]
probability = model.predict_proba(features_scaled)[0].tolist()
return jsonify({
'prediction': int(prediction),
'probability': probability
})
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5000)
10. Common Challenges
Overfitting: Model memorizes training data instead of learning patterns.
Signs: High training accuracy, low test accuracy
Solutions: More data, simpler model, regularization, cross-validation, dropout
Underfitting: Model too simple to capture patterns.
Signs: Low training and test accuracy
Solutions: More complex model, better features, less regularization
Imbalanced Classes: One class much more frequent than others.
Solutions: Resampling (SMOTE), class weights, different metrics (F1, ROC-AUC)
Data Leakage: Test information leaks into training.
Prevention: Split before preprocessing, careful feature engineering
11. Real-World Applications
Healthcare
Disease diagnosis from medical images
Drug discovery and development
Patient readmission prediction
Personalized treatment plans
Finance
Fraud detection
Credit scoring
Algorithmic trading
Risk assessment
E-Commerce
Product recommendations
Price optimization
Customer churn prediction
Demand forecasting
Transportation
Autonomous vehicles
Route optimization
Demand prediction (Uber, Lyft)
Traffic flow prediction
Manufacturing
Predictive maintenance
Quality control
Supply chain optimization
Defect detection
Marketing
Customer segmentation
Sentiment analysis
Ad targeting
Conversion prediction
12. Resources and Next Steps
Essential Libraries
Scikit-learn: General ML algorithms |
TensorFlow/Keras: Deep learning |
PyTorch: Deep learning research |
XGBoost: Gradient boosting |
LightGBM: Fast gradient boosting
Learning Path
Beginner to Advanced:
Learn Python and NumPy/Pandas basics
Understand linear regression and logistic regression