📊 Python for Data Science

Transform Data into Insights

👋 Welcome to Data Science with Python!

Python is the #1 language for data science and machine learning. With powerful libraries like NumPy, Pandas, and Scikit-learn, you can analyze data, create visualizations, and build predictive models.

If you know Python basics, you're ready to dive into data science!

🎯 Your Data Science Mastery Journey - Complete Learning Roadmap

Welcome to the most comprehensive Python Data Science learning experience available! This guide provides an exhaustive, hands-on journey through data science, designed to take you from beginner to professional-level competence. Each section is crafted with meticulous detail, featuring 50+ practical exercises, real-world projects, advanced techniques, and complete working examples.

Our approach goes beyond basic tutorials. We explore the entire data science pipeline, statistical foundations, machine learning algorithms, big data tools, and production deployment. Whether you're analyzing your first dataset or building enterprise ML systems, this guide will equip you with the expertise to excel in data science.

📚 Comprehensive Learning Structure (18+ Major Sections)

Introduction & Foundation (200+ learning concepts) - Data science philosophy, career paths, industry overview
Python Data Science Environment - Anaconda, Jupyter, virtual environments
NumPy Fundamentals - Arrays, operations, broadcasting, performance
Pandas Data Manipulation - DataFrames, Series, cleaning, transformation
Data Cleaning & Preprocessing - Missing values, outliers, feature engineering
Exploratory Data Analysis - Statistics, visualization, insights discovery
Statistical Analysis - Hypothesis testing, distributions, inference
Data Visualization - Matplotlib, Seaborn, Plotly, dashboards
Machine Learning Fundamentals - Supervised, unsupervised, evaluation
Scikit-learn Mastery - Algorithms, pipelines, hyperparameter tuning
Deep Learning Introduction - Neural networks, TensorFlow, PyTorch
Natural Language Processing - Text analysis, sentiment, transformers
Computer Vision - Image processing, CNNs, object detection
Time Series Analysis - Forecasting, ARIMA, Prophet
Big Data Tools - Spark, Dask, cloud computing
Model Deployment - Flask APIs, Docker, cloud platforms
MLOps & Production - Monitoring, A/B testing, ethics
Real-World Projects - Complete applications with best practices

🎯 Learning Objectives

By the end of this comprehensive guide, you will be able to:

✅ Master NumPy for efficient numerical computing
✅ Manipulate and analyze data with Pandas
✅ Create stunning visualizations with Matplotlib & Seaborn
✅ Build and evaluate machine learning models
✅ Perform statistical analysis and hypothesis testing
✅ Clean and preprocess real-world datasets
✅ Deploy ML models to production environments
✅ Work with big data using Spark and cloud tools
✅ Apply data science to solve business problems
✅ Build a professional portfolio of projects

📊 Learning Intensity Scale

Each major section contains extensive content including:

🔸 Deep theoretical explanations with mathematical foundations
🔸 100+ hands-on exercises with complete solutions
🔸 Real-world case studies from industry applications
🔸 Performance optimization and best practices
🔸 Statistical rigor and mathematical proofs
🔸 Code reviews and debugging strategies
🔸 Integration with modern tools and platforms
🔸 Business applications and ROI analysis

🚀 Career Advancement

Data Science Career Path:

Junior Data Analyst - $50,000 - $70,000/year
Data Analyst - $65,000 - $90,000/year
Data Scientist - $90,000 - $130,000/year
Senior Data Scientist - $120,000 - $170,000/year
Principal Data Scientist - $140,000 - $200,000/year
Chief Data Officer - $180,000 - $300,000+/year

*Salaries based on US averages from major tech job boards (2024)

📚 Course Navigation

What is Data Science?
Data Science Workflow
Essential Libraries
Prerequisites
Installation & Setup
NumPy - Numerical Computing
Pandas - Data Manipulation
Data Visualization
Machine Learning
Real-World Projects
Best Practices
Learning Path
Learning Resources
Career Opportunities
Common Projects
Next Steps

🐼

Pandas

Data manipulation and analysis

DataFrames for tabular data
Data cleaning tools
Time series analysis

📊 Course Progress Tracker

Track your data science learning journey! Complete exercises and sections to earn points and unlock achievements.

🎯 Course Completion Goals:

30 exercises - Interactive coding challenges
18 sections - Comprehensive learning modules
8 projects - Real-world application builds
150+ code examples - Practical implementations

Total Points Available: 350+

🤔 What is Data Science?

Data Science is the field of extracting knowledge and insights from data using scientific methods, algorithms, and systems. It combines statistics, mathematics, programming, and domain expertise to solve real-world problems.

Why Python for Data Science?

📚 Rich Ecosystem: Hundreds of specialized libraries
🎯 Easy to Learn: Simple, readable syntax
🔬 Scientific Computing: NumPy, SciPy for calculations
📊 Data Manipulation: Pandas for data analysis
📈 Visualization: Matplotlib, Seaborn for charts
🤖 Machine Learning: Scikit-learn, TensorFlow, PyTorch
👥 Huge Community: Extensive support and resources

🎯 Data Science Workflow

The Data Science Process

┌─────────────────────────────────────────────────────┐
│         1. PROBLEM DEFINITION                        │
│    What question are we trying to answer?            │
└─────────────────────────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────┐
│         2. DATA COLLECTION                           │
│    Gather data from various sources                  │
│    (CSV, databases, APIs, web scraping)              │
└─────────────────────────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────┐
│         3. DATA CLEANING                             │
│    Handle missing values, remove duplicates          │
│    Fix errors, standardize formats                   │
└─────────────────────────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────┐
│         4. EXPLORATORY DATA ANALYSIS (EDA)           │
│    Understand data patterns, distributions           │
│    Create visualizations, find correlations          │
└─────────────────────────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────┐
│         5. FEATURE ENGINEERING                       │
│    Create new features, transform data               │
│    Select relevant features                          │
└─────────────────────────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────┐
│         6. MODEL BUILDING                            │
│    Choose algorithm, train model                     │
│    Tune hyperparameters                              │
└─────────────────────────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────┐
│         7. MODEL EVALUATION                          │
│    Test accuracy, validate results                   │
│    Compare different models                          │
└─────────────────────────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────┐
│         8. DEPLOYMENT & MONITORING                   │
│    Deploy model, monitor performance                 │
│    Update as needed                                  │
└─────────────────────────────────────────────────────┘
            

📦 Essential Libraries

🔢

NumPy

Numerical computing with arrays

Fast array operations
Mathematical functions
Linear algebra

🐼

Pandas

Data manipulation and analysis

DataFrames for tabular data
Data cleaning tools
Time series analysis

📊

Matplotlib

Basic plotting and visualization

Line, bar, scatter plots
Histograms, pie charts
Customizable figures

🎨

Seaborn

Statistical data visualization

Beautiful default styles
Statistical plots
Built on Matplotlib

🤖

Scikit-learn

Machine learning algorithms

Classification, regression
Clustering, dimensionality reduction
Model evaluation tools

📈

SciPy

Scientific computing

Statistical functions
Optimization algorithms
Signal processing

🚀 Installation & Setup

Install Essential Libraries

# Install core data science libraries
pip install numpy pandas matplotlib seaborn scikit-learn scipy

# Or install Anaconda (includes everything)
# Download from: https://www.anaconda.com/download

# Verify installations
python -c "import numpy; print(numpy.__version__)"
python -c "import pandas; print(pandas.__version__)"
python -c "import sklearn; print(sklearn.__version__)"
        

Jupyter Notebook (Recommended)

# Install Jupyter Notebook
pip install jupyter

# Start Jupyter Notebook
jupyter notebook

# This opens a browser with interactive notebooks
# Perfect for data science work!
        

Why Use Jupyter Notebooks?

Interactive: Run code cell by cell
Visualizations: See plots inline
Documentation: Mix code with markdown text
Experimentation: Try different approaches easily
Sharing: Share notebooks with others

🔢 NumPy - Numerical Computing

🤔 What is NumPy?

NumPy (Numerical Python) is like Excel on steroids for programmers! Instead of working with single numbers, NumPy lets you work with entire collections of numbers at once. Think of it as a powerful calculator that can handle thousands of calculations simultaneously.

Why do we need NumPy? Regular Python lists are slow for math operations. NumPy arrays are much faster and use less memory. It's the foundation for almost all data science and machine learning in Python.

Creating Arrays

🧱 Building Your First Arrays

Arrays are like containers that hold multiple numbers. Think of them as:

1D Array: A single row of numbers (like a list)
2D Array: A table or spreadsheet (rows and columns)
3D+ Arrays: Like stacking multiple tables on top of each other

import numpy as np

# 🏗️ Create arrays from regular Python lists
# This is like converting a shopping list into a super-powered number container
arr1d = np.array([1, 2, 3, 4, 5])  # One row of numbers
arr2d = np.array([[1, 2, 3],       # Two rows, three columns
                  [4, 5, 6]])       # Like a tiny spreadsheet!

print("1D Array (one row):", arr1d)
print("2D Array (table):\n", arr2d)
print("Shape (rows, columns):", arr2d.shape)    # (2, 3) means 2 rows, 3 columns
print("Total elements:", arr2d.size)            # 6 numbers total
print("Data type:", arr2d.dtype)                # What kind of numbers are stored
        

🚨 Common Beginner Mistakes:

Forgetting import numpy as np at the top
Using regular Python lists when you need fast math
Mixing different data types in one array (NumPy prefers one type)

Special Arrays (Ready-Made Templates)

# 🎯 Pre-built arrays for common needs

# Array filled with zeros (like an empty scorecard)
zeros = np.zeros((3, 4))        # 3 rows, 4 columns, all zeros
print("Zeros array:\n", zeros)

# Array filled with ones (like a scorecard where everyone gets 1 point)
ones = np.ones((2, 3))          # 2 rows, 3 columns, all ones
print("Ones array:\n", ones)

# Identity matrix (special diagonal array used in math)
identity = np.eye(3)            # 3x3 with 1s on diagonal, 0s elsewhere
print("Identity matrix:\n", identity)

# Random numbers (like rolling dice)
random = np.random.rand(2, 3)   # 2x3 array with random numbers between 0-1
print("Random array:\n", random)
        

Creating Number Sequences

# 📊 Creating sequences of numbers

# Range with step size (like counting by 2s)
range_arr = np.arange(0, 10, 2)      # Start at 0, stop before 10, step by 2
print("Range array:", range_arr)     # [0, 2, 4, 6, 8]

# Evenly spaced numbers (like dividing a pizza into equal slices)
linspace = np.linspace(0, 1, 5)      # Start at 0, end at 1, 5 equal pieces
print("Linspace array:", linspace)   # [0.0, 0.25, 0.5, 0.75, 1.0]

# Real-world example: Temperature readings every 2 hours
hours = np.arange(0, 24, 2)          # 0, 2, 4, ..., 22 hours
print("Measurement times:", hours)
        

💡 Pro Tips for Beginners:

arange() is like a for loop that creates the numbers
linspace() guarantees you get exactly the number of points you ask for
Always check your array's shape to understand its structure
Use print() often to see what your arrays look like

Array Operations

🧮 Doing Math with Arrays

NumPy makes math operations super easy! Instead of looping through each number, you can do math on entire arrays at once. This is called "vectorized operations" and it's what makes NumPy fast.

import numpy as np

arr = np.array([1, 2, 3, 4, 5])

# 📱 Mathematical operations (element-wise)
# This adds 10 to EACH number in the array automatically!
print("Add 10 to each number:", arr + 10)           # [11, 12, 13, 14, 15]
print("Multiply each by 2:", arr * 2)               # [2, 4, 6, 8, 10]
print("Square each number:", arr ** 2)              # [1, 4, 9, 16, 25]
print("Square root of each:", np.sqrt(arr))         # [1., 1.41, 1.73, 2., 2.24]

# 📊 Statistical operations (summary of the whole array)
print("Total sum:", arr.sum())             # 15 (1+2+3+4+5)
print("Average (mean):", arr.mean())       # 3.0 (15÷5)
print("Standard deviation:", arr.std())    # How spread out the numbers are
print("Smallest number:", arr.min())       # 1
print("Largest number:", arr.max())        # 5

# ➕ Operations between arrays
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
print("Add corresponding elements:", arr1 + arr2)    # [5, 7, 9] (1+4, 2+5, 3+6)
print("Dot product (advanced math):", np.dot(arr1, arr2))  # 32 (1*4 + 2*5 + 3*6)
        

💡 Why This Matters:

Speed: Array operations are 10-100x faster than Python loops
Simplicity: arr * 2 instead of writing a loop
Memory efficient: Uses less computer memory
Real-world use: Image processing, scientific calculations, data analysis

🚨 Important Rules:

Arrays must be the same size for element-wise operations
Single numbers work with arrays (broadcasting)
Operations create new arrays, don't modify the original

Array Indexing & Slicing

🎯 Accessing Your Data Like a Pro

Indexing is like pointing to specific items in your array. Think of it like finding a book on a shelf - you need to know the row and column!

1D Array Indexing (Like a Shopping List)

import numpy as np

# Think of this as a shopping list
arr = np.array([10, 20, 30, 40, 50])  # 5 items on our list

# Get specific items by position (starting from 0)
print("First item (position 0):", arr[0])      # 10 - like "item 1"
print("Last item (position -1):", arr[-1])     # 50 - like "last item"
print("Third item (position 2):", arr[2])      # 30

# Get multiple items at once (slicing)
print("Items 2-4 (positions 1-3):", arr[1:4])  # [20, 30, 40] - like "items 2 through 4"
print("First 3 items:", arr[:3])               # [10, 20, 30] - like "first 3 items"
print("Last 2 items:", arr[-2:])               # [40, 50] - like "last 2 items"
        

2D Array Indexing (Like a Chess Board)

import numpy as np

# Think of this as a 3x3 chess board
arr2d = np.array([[1, 2, 3],    # Row 0: [1, 2, 3]
                  [4, 5, 6],    # Row 1: [4, 5, 6]
                  [7, 8, 9]])   # Row 2: [7, 8, 9]

# Get specific square: [row, column]
print("Top-left corner [0,0]:", arr2d[0, 0])   # 1 - row 0, column 0
print("Center square [1,1]:", arr2d[1, 1])     # 5 - row 1, column 1
print("Bottom-right [2,2]:", arr2d[2, 2])      # 9 - row 2, column 2

# Get entire rows or columns
print("Entire first row [0,:]:", arr2d[0, :])   # [1, 2, 3] - row 0, all columns
print("Entire second column [:,1]:", arr2d[:, 1]) # [2, 5, 8] - all rows, column 1

# Get parts of rows/columns
print("First 2 items of row 1:", arr2d[1, :2])   # [4, 5] - row 1, columns 0-1
print("Rows 1-2, column 2:", arr2d[1:3, 2])     # [6, 9] - rows 1-2, column 2
        

Smart Filtering (Boolean Indexing)

import numpy as np

# Like asking "which items meet my criteria?"
arr = np.array([1, 2, 3, 4, 5])

# Create a "mask" - true/false for each item
mask = arr > 3  # Which numbers are greater than 3?
print("Mask (True/False):", mask)               # [False, False, False, True, True]
print("Values that pass test:", arr[mask])      # [4, 5] - only the True ones

# Real-world example: Filter test scores
scores = np.array([85, 92, 78, 96, 88])
passed = scores >= 90  # Must be 90 or higher to pass
print("Who passed?", passed)                    # [False, True, False, True, False]
print("Passing scores:", scores[passed])        # [92, 96]

# Multiple conditions (like "good AND cheap")
prices = np.array([10, 25, 5, 15, 8])
qualities = np.array([8, 9, 6, 7, 9])  # Quality rating 1-10

# Find items that are cheap (≤$15) AND high quality (≥8)
good_deals = (prices <= 15) & (qualities >= 8)
print("Good deals?", good_deals)                # [True, False, False, False, True]
print("Good deal prices:", prices[good_deals])  # [10, 8]
        

💡 Indexing Cheat Sheet:

arr[0] = First item
arr[-1] = Last item
arr[1:4] = Items 2-4 (not including 4)
arr[:3] = First 3 items
arr[2:] = From item 3 to end
arr2d[0, 1] = Row 0, Column 1
arr2d[:, 0] = Entire first column
arr[mask] = Only items where mask is True

🚨 Common Beginner Pitfalls:

Arrays start at index 0, not 1
Slicing [1:4] gives items 1, 2, 3 (stops before 4)
Forgetting the comma in 2D indexing: arr2d[0, 1] not arr2d[0 1]
Boolean indexing returns only matching values, not their positions

🐼 Pandas - Data Manipulation

🐼 What is Pandas?

Pandas is like Excel for programmers! It gives you superpowers to work with data in tables (called DataFrames). Think of it as a Swiss Army knife for data - you can clean, filter, analyze, and transform data with just a few lines of code.

Why Pandas? Real-world data is messy. Pandas helps you handle missing values, combine datasets, and perform complex operations that would take hours in Excel. It's the #1 tool for data scientists worldwide.

📊 Key Pandas Concepts:

DataFrame: A table with rows and columns (like an Excel spreadsheet)
Series: A single column of data (like one column from a spreadsheet)
Index: Row labels (usually numbers starting from 0)
Columns: Column names (like "Name", "Age", "Salary")

Creating DataFrames

🏗️ Building Your First DataFrame

DataFrames are like spreadsheets in Python. You can create them from dictionaries, CSV files, databases, or even from scratch!

import pandas as pd

# 🏗️ Method 1: From a Python dictionary (easiest for beginners)
# Think of this like filling out a form with multiple people
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],     # Column 1: Names
    'Age': [25, 30, 35, 28],                          # Column 2: Ages
    'City': ['New York', 'London', 'Paris', 'Tokyo'], # Column 3: Cities
    'Salary': [70000, 80000, 75000, 85000]            # Column 4: Salaries
}

# Create the DataFrame (like creating the spreadsheet)
df = pd.DataFrame(data)

print("Our employee DataFrame:")
print(df)
print("\nDataFrame shape:", df.shape)  # (rows, columns)
print("Column names:", list(df.columns))

# 🗂️ Method 2: Reading from files (most common in real life)
# df = pd.read_csv('employees.csv')        # From CSV file
# df = pd.read_excel('employees.xlsx')     # From Excel file
# df = pd.read_json('employees.json')      # From JSON file

# 🗃️ Method 3: From database (for big companies)
# import sqlite3
# conn = sqlite3.connect('company.db')
# df = pd.read_sql_query("SELECT * FROM employees", conn)
        

💡 DataFrame Anatomy:

Rows: Each row represents one record (like one employee)
Columns: Each column represents one type of information
Index: The row numbers on the left (0, 1, 2, 3...)
Headers: The column names at the top

🚨 Common Beginner Mistakes:

Forgetting import pandas as pd at the top
Making sure all lists in the dictionary have the same length
Remembering that Python uses single quotes ' or double quotes " for strings
Not checking file paths when reading CSV files

Exploring Data

🔍 Getting to Know Your Data

Before you start analyzing, you need to understand what you're working with! This is like meeting someone new - you want to know their background, personality, and any quirks they might have.

Basic Data Inspection (First Look)

import pandas as pd

# Load your data (like opening a file in Excel)
df = pd.read_csv('sales_data.csv')

# 👀 Quick peeks at your data
print("First 5 rows (getting a feel for the data):")
print(df.head())          # Like looking at the first page of a book

print("\nLast 5 rows (checking the end):")
print(df.tail())          # Like checking the last page

print(f"\nDataset size: {df.shape}")           # (rows, columns) - like counting pages and chapters
print(f"Total rows: {df.shape[0]}")            # Number of records
print(f"Total columns: {df.shape[1]}")         # Number of data types

# 📋 Detailed information about each column
print("\nColumn details:")
print(df.info())          # Like a nutrition label - tells you what's in each column
        

Statistical Summary (The Big Picture)

# 📊 Get statistical overview (like a report card for your data)
print("Statistical summary of numeric columns:")
print(df.describe())      # Mean, min, max, etc. for all numeric columns

# This tells you:
# - count: How many non-empty values
# - mean: Average value
# - std: How spread out the values are
# - min/max: Smallest and largest values
# - 25%/50%/75%: Quartiles (where 25%, 50%, 75% of data falls)
        

Column-Level Exploration

# 🔍 Exploring individual columns

# What columns do we have?
print("All column names:")
print(df.columns.tolist())         # List of all column names

# Working with specific columns
print(f"\nAverage age: {df['Age'].mean():.1f} years")
print(f"Youngest person: {df['Age'].min()} years old")
print(f"Oldest person: {df['Age'].max()} years old")

# Exploring categorical data (text columns)
print(f"\nCities in our data: {df['City'].unique()}")
print("\nHow many people from each city?")
print(df['City'].value_counts())  # Count of each unique value

# Real-world example: Salary analysis
print("
💰 Salary Statistics:")
print(f"Average salary: ${df['Salary'].mean():,.0f}")
print(f"Median salary: ${df['Salary'].median():,.0f}")
print(f"Highest salary: ${df['Salary'].max():,.0f}")
print(f"Lowest salary: ${df['Salary'].min():,.0f}")
        

Checking for Missing Data (Data Quality Check)

# 🚨 Finding missing or incomplete data (very important!)

print("Missing values per column:")
print(df.isnull().sum())  # Count of missing values in each column

print("\nAre there any missing values at all?")
print(df.isna().any())    # True/False for each column

print(f"\nTotal missing values in entire dataset: {df.isnull().sum().sum()}")

# Percentage of missing data
missing_percent = (df.isnull().sum() / len(df)) * 100
print("\nPercentage of missing data per column:")
print(missing_percent.round(2).astype(str) + '%')
        

💡 Why This Matters:

Data Quality: Missing data can ruin your analysis
Understanding Scale: Know if you're dealing with millions or dozens of records
Data Types: Numbers, text, dates all behave differently
Outliers: Unusual values that might be errors or important insights
Distribution: Is your data normal, skewed, or something else?

🚨 Red Flags to Watch For:

Too many missing values in important columns
Unexpected data types (numbers stored as text)
Impossible values (negative ages, salaries over $1 billion)
Only one unique value in a column (useless for analysis)
Dates that aren't recognized as dates

🎯 Pro Tip: Always explore before you analyze!

Think of data exploration like a doctor examining a patient:

Vitals check: df.shape, df.info()
Overall health: df.describe()
Specific concerns: Check for missing values, outliers
Diagnosis: Understand patterns and relationships

Data Cleaning

import pandas as pd
import numpy as np

df = pd.read_csv('messy_data.csv')

# Handle missing values
df_cleaned = df.dropna()              # Remove rows with any null
df_filled = df.fillna(0)              # Fill nulls with 0
df_filled = df.fillna(df.mean())      # Fill with column mean

# Remove duplicates
df_unique = df.drop_duplicates()

# Rename columns
df_renamed = df.rename(columns={'old_name': 'new_name'})

# Change data types
df['Age'] = df['Age'].astype(int)
df['Date'] = pd.to_datetime(df['Date'])

# Remove outliers (example: remove values > 3 std devs)
z_scores = np.abs((df['Salary'] - df['Salary'].mean()) / df['Salary'].std())
df_no_outliers = df[z_scores < 3]

# Replace values
df['City'] = df['City'].replace('NYC', 'New York')
        

Data Manipulation

import pandas as pd

df = pd.read_csv('sales_data.csv')

# Filtering
young_people = df[df['Age'] < 30]
high_earners = df[df['Salary'] > 75000]
combined = df[(df['Age'] < 30) & (df['Salary'] > 70000)]

# Sorting
df_sorted = df.sort_values('Salary', ascending=False)
df_multi_sort = df.sort_values(['City', 'Age'])

# Grouping and aggregation
city_avg = df.groupby('City')['Salary'].mean()
city_stats = df.groupby('City').agg({
    'Salary': ['mean', 'min', 'max'],
    'Age': 'mean'
})

# Adding new columns
df['Salary_K'] = df['Salary'] / 1000
df['Age_Group'] = pd.cut(df['Age'], bins=[0, 30, 40, 100], 
                         labels=['Young', 'Middle', 'Senior'])

# Merging DataFrames
df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['A', 'B', 'C']})
df2 = pd.DataFrame({'ID': [1, 2, 4], 'Score': [90, 85, 95]})
merged = pd.merge(df1, df2, on='ID', how='inner')
        

📊 Data Visualization

Matplotlib Basics

import matplotlib.pyplot as plt
import numpy as np

# Line plot
x = np.linspace(0, 10, 100)
y = np.sin(x)

plt.figure(figsize=(10, 6))
plt.plot(x, y, label='sin(x)', color='blue', linewidth=2)
plt.plot(x, np.cos(x), label='cos(x)', color='red', linestyle='--')
plt.xlabel('X axis')
plt.ylabel('Y axis')
plt.title('Trigonometric Functions')
plt.legend()
plt.grid(True)
plt.show()

# Scatter plot
x = np.random.rand(50)
y = np.random.rand(50)
colors = np.random.rand(50)
sizes = 1000 * np.random.rand(50)

plt.figure(figsize=(8, 6))
plt.scatter(x, y, c=colors, s=sizes, alpha=0.5, cmap='viridis')
plt.colorbar()
plt.title('Scatter Plot Example')
plt.show()

# Bar chart
categories = ['A', 'B', 'C', 'D']
values = [23, 45, 56, 78]

plt.figure(figsize=(8, 6))
plt.bar(categories, values, color='skyblue')
plt.xlabel('Categories')
plt.ylabel('Values')
plt.title('Bar Chart Example')
plt.show()

# Histogram
data = np.random.randn(1000)

plt.figure(figsize=(8, 6))
plt.hist(data, bins=30, color='green', alpha=0.7, edgecolor='black')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.title('Histogram Example')
plt.show()
        

Seaborn Visualizations

import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

# Set style
sns.set_style("whitegrid")

# Load sample dataset
tips = sns.load_dataset('tips')

# Scatter plot with regression line
plt.figure(figsize=(10, 6))
sns.regplot(x='total_bill', y='tip', data=tips)
plt.title('Total Bill vs Tip')
plt.show()

# Box plot
plt.figure(figsize=(10, 6))
sns.boxplot(x='day', y='total_bill', data=tips)
plt.title('Total Bill by Day')
plt.show()

# Violin plot
plt.figure(figsize=(10, 6))
sns.violinplot(x='day', y='total_bill', hue='sex', data=tips)
plt.title('Total Bill Distribution by Day and Gender')
plt.show()

# Heatmap (correlation matrix)
df = pd.read_csv('data.csv')
correlation = df.corr()

plt.figure(figsize=(10, 8))
sns.heatmap(correlation, annot=True, cmap='coolwarm', center=0)
plt.title('Correlation Heatmap')
plt.show()

# Pair plot
iris = sns.load_dataset('iris')
sns.pairplot(iris, hue='species')
plt.show()
        

🤖 Machine Learning with Scikit-learn

Complete ML Workflow

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

# 1. Load data
df = pd.read_csv('data.csv')

# 2. Prepare features and target
X = df.drop('target', axis=1)  # Features
y = df['target']                # Target variable

# 3. Split data (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"Training set: {X_train.shape}")
print(f"Test set: {X_test.shape}")

# 4. Feature scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 5. Train model
model = LogisticRegression(random_state=42)
model.fit(X_train_scaled, y_train)

# 6. Make predictions
y_pred = model.predict(X_test_scaled)

# 7. Evaluate model
accuracy = accuracy_score(y_test, y_pred)
print(f"\nAccuracy: {accuracy:.2f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# 8. Confusion matrix
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.show()
        

Different ML Algorithms

from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB

# Classification algorithms
models = {
    'Logistic Regression': LogisticRegression(),
    'Decision Tree': DecisionTreeClassifier(),
    'Random Forest': RandomForestClassifier(),
    'SVM': SVC(),
    'KNN': KNeighborsClassifier(),
    'Naive Bayes': GaussianNB()
}

# Train and evaluate each model
results = {}
for name, model in models.items():
    model.fit(X_train_scaled, y_train)
    y_pred = model.predict(X_test_scaled)
    accuracy = accuracy_score(y_test, y_pred)
    results[name] = accuracy
    print(f"{name}: {accuracy:.4f}")

# Find best model
best_model = max(results, key=results.get)
print(f"\nBest Model: {best_model} ({results[best_model]:.4f})")
        

Model Evaluation Techniques

from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score, roc_auc_score, roc_curve

# Cross-validation
model = RandomForestClassifier()
cv_scores = cross_val_score(model, X_train_scaled, y_train, cv=5)
print(f"CV Scores: {cv_scores}")
print(f"Mean CV Score: {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})")

# Hyperparameter tuning with Grid Search
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10]
}

grid_search = GridSearchCV(
    RandomForestClassifier(),
    param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)

grid_search.fit(X_train_scaled, y_train)
print(f"\nBest Parameters: {grid_search.best_params_}")
print(f"Best Score: {grid_search.best_score_:.4f}")

# Use best model
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test_scaled)
print(f"Test Accuracy: {accuracy_score(y_test, y_pred):.4f}")

# ROC Curve
y_pred_proba = best_model.predict_proba(X_test_scaled)[:, 1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
auc = roc_auc_score(y_test, y_pred_proba)

plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label=f'ROC Curve (AUC = {auc:.2f})')
plt.plot([0, 1], [0, 1], 'k--', label='Random Classifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.show()
        

📈 Real-World Example: Customer Churn Prediction

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix

# 1. Load and explore data
df = pd.read_csv('customer_churn.csv')
print("Dataset shape:", df.shape)
print("\nFirst few rows:")
print(df.head())
print("\nData info:")
print(df.info())
print("\nMissing values:")
print(df.isnull().sum())

# 2. Data cleaning
# Handle missing values
df = df.dropna()

# Encode categorical variables
le = LabelEncoder()
categorical_cols = ['gender', 'contract_type', 'payment_method']
for col in categorical_cols:
    df[col] = le.fit_transform(df[col])

# 3. Exploratory Data Analysis
# Churn distribution
plt.figure(figsize=(8, 6))
df['churn'].value_counts().plot(kind='bar')
plt.title('Churn Distribution')
plt.xlabel('Churn (0=No, 1=Yes)')
plt.ylabel('
Count')
plt.show()

# Correlation heatmap
plt.figure(figsize=(12, 8))
correlation = df.corr()
sns.heatmap(correlation, annot=True, cmap='coolwarm', center=0)
plt.title('Feature Correlation Heatmap')
plt.show()

# 4. Feature engineering
X = df.drop('churn', axis=1)
y = df['churn']

# 5. Split and scale data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 6. Train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train_scaled, y_train)

# 7. Evaluate
y_pred = model.predict(X_test_scaled)
print("\nModel Performance:")
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Feature importance
feature_importance = pd.DataFrame({
    'feature': X.columns,
    'importance': model.feature_importances_
}).sort_values('importance', ascending=False)

plt.figure(figsize=(10, 6))
sns.barplot(x='importance', y='feature', data=feature_importance.head(10))
plt.title('Top 10 Most Important Features')
plt.show()
        

💡 Best Practices

Data Science Best Practices:

Understand Your Data: Always explore before modeling
Clean Thoroughly: Handle missing values and outliers
Feature Engineering: Create meaningful features
Split Data Properly: Train/validation/test sets
Scale Features: Normalize or standardize when needed
Cross-Validate: Don't rely on single train/test split
Avoid Overfitting: Use regularization, simpler models
Document Everything: Keep notebooks well-commented
Version Control: Use Git for code and data
Reproducibility: Set random seeds, save models

🛠️ Useful Tools & Libraries

Data Manipulation

Pandas: DataFrames
NumPy: Arrays
Dask: Big data

Visualization

Matplotlib: Basic plots
Seaborn: Statistical
Plotly: Interactive

Machine Learning

Scikit-learn: Traditional ML
XGBoost: Gradient boosting
LightGBM: Fast boosting

Deep Learning

TensorFlow: Google's framework
PyTorch: Facebook's framework
Keras: High-level API

Natural Language

NLTK: Text processing
spaCy: Industrial NLP
Transformers: BERT, GPT

Computer Vision

OpenCV: Image processing
Pillow: Image manipulation
scikit-image: Image algorithms

📚 Learning Path

Beginner to Data Scientist (6-12 months)

Phase 1: Python Fundamentals (1-2 months)

Python basics: variables, loops, functions
Data structures: lists, dictionaries, sets
Object-oriented programming
File handling and modules

Phase 2: Data Analysis (2-3 months)

NumPy for numerical computing
Pandas for data manipulation
Data cleaning and preprocessing
Exploratory data analysis

Phase 3: Visualization (1 month)

Matplotlib basics
Seaborn for statistical plots
Creating dashboards
Storytelling with data

Phase 4: Statistics & Math (1-2 months)

Descriptive statistics
Probability distributions
Hypothesis testing
Linear algebra basics

Phase 5: Machine Learning (2-3 months)

Supervised learning algorithms
Unsupervised learning
Model evaluation and tuning
Feature engineering

Phase 6: Advanced Topics (Ongoing)

Deep learning
Natural language processing
Computer vision
Big data tools (Spark, Hadoop)

📖 Learning Resources

Online Courses

Coursera: Data Science Specialization
DataCamp: Python for Data Science
Kaggle Learn: Free micro-courses
Fast.ai: Practical Deep Learning

Books

Python for Data Analysis (McKinney)
Hands-On Machine Learning (Géron)
Introduction to Statistical Learning
Deep Learning (Goodfellow)

Practice Platforms

Kaggle: Competitions & datasets
LeetCode: Coding practice
HackerRank: Data science track
DataQuest: Interactive learning

Communities

r/datascience (Reddit)
Kaggle Forums
Stack Overflow
Data Science Discord servers

💼 Career Opportunities

Data Science Career Paths:

Entry Level:

Data Analyst: $50,000 - $70,000/year
Junior Data Scientist: $60,000 - $80,000/year
Business Intelligence Analyst: $55,000 - $75,000/year

Mid Level:

Data Scientist: $80,000 - $120,000/year
Machine Learning Engineer: $90,000 - $130,000/year
Data Engineer: $85,000 - $125,000/year

Senior Level:

Senior Data Scientist: $120,000 - $160,000/year
ML Architect: $130,000 - $180,000/year
Chief Data Officer: $150,000 - $250,000+/year

Extremely high demand for data science professionals across all industries!

🎯 Common Data Science Projects

Beginner Projects

Exploratory Data Analysis
Sales Data Analysis
Weather Data Visualization
Simple Linear Regression

Intermediate Projects

Customer Segmentation
House Price Prediction
Sentiment Analysis
Recommendation System

Advanced Projects

Image Classification
Time Series Forecasting
Fraud Detection
Chatbot Development

Portfolio Projects

End-to-end ML Pipeline
Kaggle Competition Entry
Real-world Problem Solution
Deployed Web Application

🚀 Next Steps

Your Data Science Journey:

Master Python Basics: Complete python-basics.html course
Learn NumPy & Pandas: Practice data manipulation daily
Create Visualizations: Make 10+ different chart types
Study Statistics: Understand the math behind ML
Build ML Models: Start with simple algorithms
Work on Projects: Build a portfolio on GitHub
Join Kaggle: Participate in competitions
Network: Connect with data scientists online
Keep Learning: Stay updated with latest techniques
Apply for Jobs: Start with internships or junior roles

🚀 Start Your Data Science Journey!

Transform data into insights and build intelligent systems