๐ Python for Data Science
Transform Data into Insights
๐ Welcome to Data Science with Python!
Python is the #1 language for data science and machine learning. With powerful libraries like NumPy, Pandas, and Scikit-learn, you can analyze data, create visualizations, and build predictive models.
If you know Python basics, you're ready to dive into data science!
๐ฏ Your Data Science Mastery Journey - Complete Learning Roadmap
Welcome to the most comprehensive Python Data Science learning experience available! This guide provides an exhaustive, hands-on journey through data science, designed to take you from beginner to professional-level competence. Each section is crafted with meticulous detail, featuring 50+ practical exercises, real-world projects, advanced techniques, and complete working examples.
Our approach goes beyond basic tutorials. We explore the entire data science pipeline, statistical foundations, machine learning algorithms, big data tools, and production deployment. Whether you're analyzing your first dataset or building enterprise ML systems, this guide will equip you with the expertise to excel in data science.
๐ Comprehensive Learning Structure (18+ Major Sections)
- Introduction & Foundation (200+ learning concepts) - Data science philosophy, career paths, industry overview
- Python Data Science Environment - Anaconda, Jupyter, virtual environments
- NumPy Fundamentals - Arrays, operations, broadcasting, performance
- Pandas Data Manipulation - DataFrames, Series, cleaning, transformation
- Data Cleaning & Preprocessing - Missing values, outliers, feature engineering
- Exploratory Data Analysis - Statistics, visualization, insights discovery
- Statistical Analysis - Hypothesis testing, distributions, inference
- Data Visualization - Matplotlib, Seaborn, Plotly, dashboards
- Machine Learning Fundamentals - Supervised, unsupervised, evaluation
- Scikit-learn Mastery - Algorithms, pipelines, hyperparameter tuning
- Deep Learning Introduction - Neural networks, TensorFlow, PyTorch
- Natural Language Processing - Text analysis, sentiment, transformers
- Computer Vision - Image processing, CNNs, object detection
- Time Series Analysis - Forecasting, ARIMA, Prophet
- Big Data Tools - Spark, Dask, cloud computing
- Model Deployment - Flask APIs, Docker, cloud platforms
- MLOps & Production - Monitoring, A/B testing, ethics
- Real-World Projects - Complete applications with best practices
๐ฏ Learning Objectives
By the end of this comprehensive guide, you will be able to:
- โ
Master NumPy for efficient numerical computing
- โ
Manipulate and analyze data with Pandas
- โ
Create stunning visualizations with Matplotlib & Seaborn
- โ
Build and evaluate machine learning models
- โ
Perform statistical analysis and hypothesis testing
- โ
Clean and preprocess real-world datasets
- โ
Deploy ML models to production environments
- โ
Work with big data using Spark and cloud tools
- โ
Apply data science to solve business problems
- โ
Build a professional portfolio of projects
๐ Learning Intensity Scale
Each major section contains extensive content including:
- ๐ธ Deep theoretical explanations with mathematical foundations
- ๐ธ 100+ hands-on exercises with complete solutions
- ๐ธ Real-world case studies from industry applications
- ๐ธ Performance optimization and best practices
- ๐ธ Statistical rigor and mathematical proofs
- ๐ธ Code reviews and debugging strategies
- ๐ธ Integration with modern tools and platforms
- ๐ธ Business applications and ROI analysis
๐ Career Advancement
Data Science Career Path:
- Junior Data Analyst - $50,000 - $70,000/year
- Data Analyst - $65,000 - $90,000/year
- Data Scientist - $90,000 - $130,000/year
- Senior Data Scientist - $120,000 - $170,000/year
- Principal Data Scientist - $140,000 - $200,000/year
- Chief Data Officer - $180,000 - $300,000+/year
*Salaries based on US averages from major tech job boards (2024)
๐ผ
Pandas
Data manipulation and analysis
- DataFrames for tabular data
- Data cleaning tools
- Time series analysis
๐ Course Progress Tracker
Track your data science learning journey! Complete exercises and sections to earn points and unlock achievements.
๐ฏ Course Completion Goals:
- 30 exercises - Interactive coding challenges
- 18 sections - Comprehensive learning modules
- 8 projects - Real-world application builds
- 150+ code examples - Practical implementations
Total Points Available: 350+
๐ค What is Data Science?
Data Science is the field of extracting knowledge and insights from data using scientific methods, algorithms, and systems. It combines statistics, mathematics, programming, and domain expertise to solve real-world problems.
Why Python for Data Science?
- ๐ Rich Ecosystem: Hundreds of specialized libraries
- ๐ฏ Easy to Learn: Simple, readable syntax
- ๐ฌ Scientific Computing: NumPy, SciPy for calculations
- ๐ Data Manipulation: Pandas for data analysis
- ๐ Visualization: Matplotlib, Seaborn for charts
- ๐ค Machine Learning: Scikit-learn, TensorFlow, PyTorch
- ๐ฅ Huge Community: Extensive support and resources
๐ฏ Data Science Workflow
The Data Science Process
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ 1. PROBLEM DEFINITION โ
โ What question are we trying to answer? โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ 2. DATA COLLECTION โ
โ Gather data from various sources โ
โ (CSV, databases, APIs, web scraping) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ 3. DATA CLEANING โ
โ Handle missing values, remove duplicates โ
โ Fix errors, standardize formats โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ 4. EXPLORATORY DATA ANALYSIS (EDA) โ
โ Understand data patterns, distributions โ
โ Create visualizations, find correlations โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ 5. FEATURE ENGINEERING โ
โ Create new features, transform data โ
โ Select relevant features โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ 6. MODEL BUILDING โ
โ Choose algorithm, train model โ
โ Tune hyperparameters โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ 7. MODEL EVALUATION โ
โ Test accuracy, validate results โ
โ Compare different models โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ 8. DEPLOYMENT & MONITORING โ
โ Deploy model, monitor performance โ
โ Update as needed โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
๐ฆ Essential Libraries
๐ข
NumPy
Numerical computing with arrays
- Fast array operations
- Mathematical functions
- Linear algebra
๐ผ
Pandas
Data manipulation and analysis
- DataFrames for tabular data
- Data cleaning tools
- Time series analysis
๐
Matplotlib
Basic plotting and visualization
- Line, bar, scatter plots
- Histograms, pie charts
- Customizable figures
๐จ
Seaborn
Statistical data visualization
- Beautiful default styles
- Statistical plots
- Built on Matplotlib
๐ค
Scikit-learn
Machine learning algorithms
- Classification, regression
- Clustering, dimensionality reduction
- Model evaluation tools
๐
SciPy
Scientific computing
- Statistical functions
- Optimization algorithms
- Signal processing
๐ Installation & Setup
Install Essential Libraries
# Install core data science libraries
pip install numpy pandas matplotlib seaborn scikit-learn scipy
# Or install Anaconda (includes everything)
# Download from: https://www.anaconda.com/download
# Verify installations
python -c "import numpy; print(numpy.__version__)"
python -c "import pandas; print(pandas.__version__)"
python -c "import sklearn; print(sklearn.__version__)"
Jupyter Notebook (Recommended)
# Install Jupyter Notebook
pip install jupyter
# Start Jupyter Notebook
jupyter notebook
# This opens a browser with interactive notebooks
# Perfect for data science work!
Why Use Jupyter Notebooks?
- Interactive: Run code cell by cell
- Visualizations: See plots inline
- Documentation: Mix code with markdown text
- Experimentation: Try different approaches easily
- Sharing: Share notebooks with others
๐ข NumPy - Numerical Computing
๐ค What is NumPy?
NumPy (Numerical Python) is like Excel on steroids for programmers! Instead of working with single numbers, NumPy lets you work with entire collections of numbers at once. Think of it as a powerful calculator that can handle thousands of calculations simultaneously.
Why do we need NumPy? Regular Python lists are slow for math operations. NumPy arrays are much faster and use less memory. It's the foundation for almost all data science and machine learning in Python.
Creating Arrays
๐งฑ Building Your First Arrays
Arrays are like containers that hold multiple numbers. Think of them as:
- 1D Array: A single row of numbers (like a list)
- 2D Array: A table or spreadsheet (rows and columns)
- 3D+ Arrays: Like stacking multiple tables on top of each other
import numpy as np
# ๐๏ธ Create arrays from regular Python lists
# This is like converting a shopping list into a super-powered number container
arr1d = np.array([1, 2, 3, 4, 5]) # One row of numbers
arr2d = np.array([[1, 2, 3], # Two rows, three columns
[4, 5, 6]]) # Like a tiny spreadsheet!
print("1D Array (one row):", arr1d)
print("2D Array (table):\n", arr2d)
print("Shape (rows, columns):", arr2d.shape) # (2, 3) means 2 rows, 3 columns
print("Total elements:", arr2d.size) # 6 numbers total
print("Data type:", arr2d.dtype) # What kind of numbers are stored
๐จ Common Beginner Mistakes:
- Forgetting
import numpy as np at the top
- Using regular Python lists when you need fast math
- Mixing different data types in one array (NumPy prefers one type)
Special Arrays (Ready-Made Templates)
# ๐ฏ Pre-built arrays for common needs
# Array filled with zeros (like an empty scorecard)
zeros = np.zeros((3, 4)) # 3 rows, 4 columns, all zeros
print("Zeros array:\n", zeros)
# Array filled with ones (like a scorecard where everyone gets 1 point)
ones = np.ones((2, 3)) # 2 rows, 3 columns, all ones
print("Ones array:\n", ones)
# Identity matrix (special diagonal array used in math)
identity = np.eye(3) # 3x3 with 1s on diagonal, 0s elsewhere
print("Identity matrix:\n", identity)
# Random numbers (like rolling dice)
random = np.random.rand(2, 3) # 2x3 array with random numbers between 0-1
print("Random array:\n", random)
Creating Number Sequences
# ๐ Creating sequences of numbers
# Range with step size (like counting by 2s)
range_arr = np.arange(0, 10, 2) # Start at 0, stop before 10, step by 2
print("Range array:", range_arr) # [0, 2, 4, 6, 8]
# Evenly spaced numbers (like dividing a pizza into equal slices)
linspace = np.linspace(0, 1, 5) # Start at 0, end at 1, 5 equal pieces
print("Linspace array:", linspace) # [0.0, 0.25, 0.5, 0.75, 1.0]
# Real-world example: Temperature readings every 2 hours
hours = np.arange(0, 24, 2) # 0, 2, 4, ..., 22 hours
print("Measurement times:", hours)
๐ก Pro Tips for Beginners:
arange() is like a for loop that creates the numbers
linspace() guarantees you get exactly the number of points you ask for
- Always check your array's
shape to understand its structure
- Use
print() often to see what your arrays look like
Array Operations
๐งฎ Doing Math with Arrays
NumPy makes math operations super easy! Instead of looping through each number, you can do math on entire arrays at once. This is called "vectorized operations" and it's what makes NumPy fast.
import numpy as np
arr = np.array([1, 2, 3, 4, 5])
# ๐ฑ Mathematical operations (element-wise)
# This adds 10 to EACH number in the array automatically!
print("Add 10 to each number:", arr + 10) # [11, 12, 13, 14, 15]
print("Multiply each by 2:", arr * 2) # [2, 4, 6, 8, 10]
print("Square each number:", arr ** 2) # [1, 4, 9, 16, 25]
print("Square root of each:", np.sqrt(arr)) # [1., 1.41, 1.73, 2., 2.24]
# ๐ Statistical operations (summary of the whole array)
print("Total sum:", arr.sum()) # 15 (1+2+3+4+5)
print("Average (mean):", arr.mean()) # 3.0 (15รท5)
print("Standard deviation:", arr.std()) # How spread out the numbers are
print("Smallest number:", arr.min()) # 1
print("Largest number:", arr.max()) # 5
# โ Operations between arrays
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
print("Add corresponding elements:", arr1 + arr2) # [5, 7, 9] (1+4, 2+5, 3+6)
print("Dot product (advanced math):", np.dot(arr1, arr2)) # 32 (1*4 + 2*5 + 3*6)
๐ก Why This Matters:
- Speed: Array operations are 10-100x faster than Python loops
- Simplicity:
arr * 2 instead of writing a loop
- Memory efficient: Uses less computer memory
- Real-world use: Image processing, scientific calculations, data analysis
๐จ Important Rules:
- Arrays must be the same size for element-wise operations
- Single numbers work with arrays (broadcasting)
- Operations create new arrays, don't modify the original
Array Indexing & Slicing
๐ฏ Accessing Your Data Like a Pro
Indexing is like pointing to specific items in your array. Think of it like finding a book on a shelf - you need to know the row and column!
1D Array Indexing (Like a Shopping List)
import numpy as np
# Think of this as a shopping list
arr = np.array([10, 20, 30, 40, 50]) # 5 items on our list
# Get specific items by position (starting from 0)
print("First item (position 0):", arr[0]) # 10 - like "item 1"
print("Last item (position -1):", arr[-1]) # 50 - like "last item"
print("Third item (position 2):", arr[2]) # 30
# Get multiple items at once (slicing)
print("Items 2-4 (positions 1-3):", arr[1:4]) # [20, 30, 40] - like "items 2 through 4"
print("First 3 items:", arr[:3]) # [10, 20, 30] - like "first 3 items"
print("Last 2 items:", arr[-2:]) # [40, 50] - like "last 2 items"
2D Array Indexing (Like a Chess Board)
import numpy as np
# Think of this as a 3x3 chess board
arr2d = np.array([[1, 2, 3], # Row 0: [1, 2, 3]
[4, 5, 6], # Row 1: [4, 5, 6]
[7, 8, 9]]) # Row 2: [7, 8, 9]
# Get specific square: [row, column]
print("Top-left corner [0,0]:", arr2d[0, 0]) # 1 - row 0, column 0
print("Center square [1,1]:", arr2d[1, 1]) # 5 - row 1, column 1
print("Bottom-right [2,2]:", arr2d[2, 2]) # 9 - row 2, column 2
# Get entire rows or columns
print("Entire first row [0,:]:", arr2d[0, :]) # [1, 2, 3] - row 0, all columns
print("Entire second column [:,1]:", arr2d[:, 1]) # [2, 5, 8] - all rows, column 1
# Get parts of rows/columns
print("First 2 items of row 1:", arr2d[1, :2]) # [4, 5] - row 1, columns 0-1
print("Rows 1-2, column 2:", arr2d[1:3, 2]) # [6, 9] - rows 1-2, column 2
Smart Filtering (Boolean Indexing)
import numpy as np
# Like asking "which items meet my criteria?"
arr = np.array([1, 2, 3, 4, 5])
# Create a "mask" - true/false for each item
mask = arr > 3 # Which numbers are greater than 3?
print("Mask (True/False):", mask) # [False, False, False, True, True]
print("Values that pass test:", arr[mask]) # [4, 5] - only the True ones
# Real-world example: Filter test scores
scores = np.array([85, 92, 78, 96, 88])
passed = scores >= 90 # Must be 90 or higher to pass
print("Who passed?", passed) # [False, True, False, True, False]
print("Passing scores:", scores[passed]) # [92, 96]
# Multiple conditions (like "good AND cheap")
prices = np.array([10, 25, 5, 15, 8])
qualities = np.array([8, 9, 6, 7, 9]) # Quality rating 1-10
# Find items that are cheap (โค$15) AND high quality (โฅ8)
good_deals = (prices <= 15) & (qualities >= 8)
print("Good deals?", good_deals) # [True, False, False, False, True]
print("Good deal prices:", prices[good_deals]) # [10, 8]
๐ก Indexing Cheat Sheet:
arr[0] = First item
arr[-1] = Last item
arr[1:4] = Items 2-4 (not including 4)
arr[:3] = First 3 items
arr[2:] = From item 3 to end
arr2d[0, 1] = Row 0, Column 1
arr2d[:, 0] = Entire first column
arr[mask] = Only items where mask is True
๐จ Common Beginner Pitfalls:
- Arrays start at index 0, not 1
- Slicing
[1:4] gives items 1, 2, 3 (stops before 4)
- Forgetting the comma in 2D indexing:
arr2d[0, 1] not arr2d[0 1]
- Boolean indexing returns only matching values, not their positions
๐ผ Pandas - Data Manipulation
๐ผ What is Pandas?
Pandas is like Excel for programmers! It gives you superpowers to work with data in tables (called DataFrames). Think of it as a Swiss Army knife for data - you can clean, filter, analyze, and transform data with just a few lines of code.
Why Pandas? Real-world data is messy. Pandas helps you handle missing values, combine datasets, and perform complex operations that would take hours in Excel. It's the #1 tool for data scientists worldwide.
๐ Key Pandas Concepts:
- DataFrame: A table with rows and columns (like an Excel spreadsheet)
- Series: A single column of data (like one column from a spreadsheet)
- Index: Row labels (usually numbers starting from 0)
- Columns: Column names (like "Name", "Age", "Salary")
Creating DataFrames
๐๏ธ Building Your First DataFrame
DataFrames are like spreadsheets in Python. You can create them from dictionaries, CSV files, databases, or even from scratch!
import pandas as pd
# ๐๏ธ Method 1: From a Python dictionary (easiest for beginners)
# Think of this like filling out a form with multiple people
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David'], # Column 1: Names
'Age': [25, 30, 35, 28], # Column 2: Ages
'City': ['New York', 'London', 'Paris', 'Tokyo'], # Column 3: Cities
'Salary': [70000, 80000, 75000, 85000] # Column 4: Salaries
}
# Create the DataFrame (like creating the spreadsheet)
df = pd.DataFrame(data)
print("Our employee DataFrame:")
print(df)
print("\nDataFrame shape:", df.shape) # (rows, columns)
print("Column names:", list(df.columns))
# ๐๏ธ Method 2: Reading from files (most common in real life)
# df = pd.read_csv('employees.csv') # From CSV file
# df = pd.read_excel('employees.xlsx') # From Excel file
# df = pd.read_json('employees.json') # From JSON file
# ๐๏ธ Method 3: From database (for big companies)
# import sqlite3
# conn = sqlite3.connect('company.db')
# df = pd.read_sql_query("SELECT * FROM employees", conn)
๐ก DataFrame Anatomy:
- Rows: Each row represents one record (like one employee)
- Columns: Each column represents one type of information
- Index: The row numbers on the left (0, 1, 2, 3...)
- Headers: The column names at the top
๐จ Common Beginner Mistakes:
- Forgetting
import pandas as pd at the top
- Making sure all lists in the dictionary have the same length
- Remembering that Python uses single quotes
' or double quotes " for strings
- Not checking file paths when reading CSV files
Exploring Data
๐ Getting to Know Your Data
Before you start analyzing, you need to understand what you're working with! This is like meeting someone new - you want to know their background, personality, and any quirks they might have.
Basic Data Inspection (First Look)
import pandas as pd
# Load your data (like opening a file in Excel)
df = pd.read_csv('sales_data.csv')
# ๐ Quick peeks at your data
print("First 5 rows (getting a feel for the data):")
print(df.head()) # Like looking at the first page of a book
print("\nLast 5 rows (checking the end):")
print(df.tail()) # Like checking the last page
print(f"\nDataset size: {df.shape}") # (rows, columns) - like counting pages and chapters
print(f"Total rows: {df.shape[0]}") # Number of records
print(f"Total columns: {df.shape[1]}") # Number of data types
# ๐ Detailed information about each column
print("\nColumn details:")
print(df.info()) # Like a nutrition label - tells you what's in each column
Statistical Summary (The Big Picture)
# ๐ Get statistical overview (like a report card for your data)
print("Statistical summary of numeric columns:")
print(df.describe()) # Mean, min, max, etc. for all numeric columns
# This tells you:
# - count: How many non-empty values
# - mean: Average value
# - std: How spread out the values are
# - min/max: Smallest and largest values
# - 25%/50%/75%: Quartiles (where 25%, 50%, 75% of data falls)
Column-Level Exploration
# ๐ Exploring individual columns
# What columns do we have?
print("All column names:")
print(df.columns.tolist()) # List of all column names
# Working with specific columns
print(f"\nAverage age: {df['Age'].mean():.1f} years")
print(f"Youngest person: {df['Age'].min()} years old")
print(f"Oldest person: {df['Age'].max()} years old")
# Exploring categorical data (text columns)
print(f"\nCities in our data: {df['City'].unique()}")
print("\nHow many people from each city?")
print(df['City'].value_counts()) # Count of each unique value
# Real-world example: Salary analysis
print("
๐ฐ Salary Statistics:")
print(f"Average salary: ${df['Salary'].mean():,.0f}")
print(f"Median salary: ${df['Salary'].median():,.0f}")
print(f"Highest salary: ${df['Salary'].max():,.0f}")
print(f"Lowest salary: ${df['Salary'].min():,.0f}")
Checking for Missing Data (Data Quality Check)
# ๐จ Finding missing or incomplete data (very important!)
print("Missing values per column:")
print(df.isnull().sum()) # Count of missing values in each column
print("\nAre there any missing values at all?")
print(df.isna().any()) # True/False for each column
print(f"\nTotal missing values in entire dataset: {df.isnull().sum().sum()}")
# Percentage of missing data
missing_percent = (df.isnull().sum() / len(df)) * 100
print("\nPercentage of missing data per column:")
print(missing_percent.round(2).astype(str) + '%')
๐ก Why This Matters:
- Data Quality: Missing data can ruin your analysis
- Understanding Scale: Know if you're dealing with millions or dozens of records
- Data Types: Numbers, text, dates all behave differently
- Outliers: Unusual values that might be errors or important insights
- Distribution: Is your data normal, skewed, or something else?
๐จ Red Flags to Watch For:
- Too many missing values in important columns
- Unexpected data types (numbers stored as text)
- Impossible values (negative ages, salaries over $1 billion)
- Only one unique value in a column (useless for analysis)
- Dates that aren't recognized as dates
๐ฏ Pro Tip: Always explore before you analyze!
Think of data exploration like a doctor examining a patient:
- Vitals check:
df.shape, df.info()
- Overall health:
df.describe()
- Specific concerns: Check for missing values, outliers
- Diagnosis: Understand patterns and relationships
Data Cleaning
import pandas as pd
import numpy as np
df = pd.read_csv('messy_data.csv')
# Handle missing values
df_cleaned = df.dropna() # Remove rows with any null
df_filled = df.fillna(0) # Fill nulls with 0
df_filled = df.fillna(df.mean()) # Fill with column mean
# Remove duplicates
df_unique = df.drop_duplicates()
# Rename columns
df_renamed = df.rename(columns={'old_name': 'new_name'})
# Change data types
df['Age'] = df['Age'].astype(int)
df['Date'] = pd.to_datetime(df['Date'])
# Remove outliers (example: remove values > 3 std devs)
z_scores = np.abs((df['Salary'] - df['Salary'].mean()) / df['Salary'].std())
df_no_outliers = df[z_scores < 3]
# Replace values
df['City'] = df['City'].replace('NYC', 'New York')
Data Manipulation
import pandas as pd
df = pd.read_csv('sales_data.csv')
# Filtering
young_people = df[df['Age'] < 30]
high_earners = df[df['Salary'] > 75000]
combined = df[(df['Age'] < 30) & (df['Salary'] > 70000)]
# Sorting
df_sorted = df.sort_values('Salary', ascending=False)
df_multi_sort = df.sort_values(['City', 'Age'])
# Grouping and aggregation
city_avg = df.groupby('City')['Salary'].mean()
city_stats = df.groupby('City').agg({
'Salary': ['mean', 'min', 'max'],
'Age': 'mean'
})
# Adding new columns
df['Salary_K'] = df['Salary'] / 1000
df['Age_Group'] = pd.cut(df['Age'], bins=[0, 30, 40, 100],
labels=['Young', 'Middle', 'Senior'])
# Merging DataFrames
df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['A', 'B', 'C']})
df2 = pd.DataFrame({'ID': [1, 2, 4], 'Score': [90, 85, 95]})
merged = pd.merge(df1, df2, on='ID', how='inner')
๐ Data Visualization
Matplotlib Basics
import matplotlib.pyplot as plt
import numpy as np
# Line plot
x = np.linspace(0, 10, 100)
y = np.sin(x)
plt.figure(figsize=(10, 6))
plt.plot(x, y, label='sin(x)', color='blue', linewidth=2)
plt.plot(x, np.cos(x), label='cos(x)', color='red', linestyle='--')
plt.xlabel('X axis')
plt.ylabel('Y axis')
plt.title('Trigonometric Functions')
plt.legend()
plt.grid(True)
plt.show()
# Scatter plot
x = np.random.rand(50)
y = np.random.rand(50)
colors = np.random.rand(50)
sizes = 1000 * np.random.rand(50)
plt.figure(figsize=(8, 6))
plt.scatter(x, y, c=colors, s=sizes, alpha=0.5, cmap='viridis')
plt.colorbar()
plt.title('Scatter Plot Example')
plt.show()
# Bar chart
categories = ['A', 'B', 'C', 'D']
values = [23, 45, 56, 78]
plt.figure(figsize=(8, 6))
plt.bar(categories, values, color='skyblue')
plt.xlabel('Categories')
plt.ylabel('Values')
plt.title('Bar Chart Example')
plt.show()
# Histogram
data = np.random.randn(1000)
plt.figure(figsize=(8, 6))
plt.hist(data, bins=30, color='green', alpha=0.7, edgecolor='black')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.title('Histogram Example')
plt.show()
Seaborn Visualizations
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
# Set style
sns.set_style("whitegrid")
# Load sample dataset
tips = sns.load_dataset('tips')
# Scatter plot with regression line
plt.figure(figsize=(10, 6))
sns.regplot(x='total_bill', y='tip', data=tips)
plt.title('Total Bill vs Tip')
plt.show()
# Box plot
plt.figure(figsize=(10, 6))
sns.boxplot(x='day', y='total_bill', data=tips)
plt.title('Total Bill by Day')
plt.show()
# Violin plot
plt.figure(figsize=(10, 6))
sns.violinplot(x='day', y='total_bill', hue='sex', data=tips)
plt.title('Total Bill Distribution by Day and Gender')
plt.show()
# Heatmap (correlation matrix)
df = pd.read_csv('data.csv')
correlation = df.corr()
plt.figure(figsize=(10, 8))
sns.heatmap(correlation, annot=True, cmap='coolwarm', center=0)
plt.title('Correlation Heatmap')
plt.show()
# Pair plot
iris = sns.load_dataset('iris')
sns.pairplot(iris, hue='species')
plt.show()
๐ค Machine Learning with Scikit-learn
Complete ML Workflow
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
# 1. Load data
df = pd.read_csv('data.csv')
# 2. Prepare features and target
X = df.drop('target', axis=1) # Features
y = df['target'] # Target variable
# 3. Split data (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
print(f"Training set: {X_train.shape}")
print(f"Test set: {X_test.shape}")
# 4. Feature scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# 5. Train model
model = LogisticRegression(random_state=42)
model.fit(X_train_scaled, y_train)
# 6. Make predictions
y_pred = model.predict(X_test_scaled)
# 7. Evaluate model
accuracy = accuracy_score(y_test, y_pred)
print(f"\nAccuracy: {accuracy:.2f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
# 8. Confusion matrix
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.show()
Different ML Algorithms
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
# Classification algorithms
models = {
'Logistic Regression': LogisticRegression(),
'Decision Tree': DecisionTreeClassifier(),
'Random Forest': RandomForestClassifier(),
'SVM': SVC(),
'KNN': KNeighborsClassifier(),
'Naive Bayes': GaussianNB()
}
# Train and evaluate each model
results = {}
for name, model in models.items():
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_pred)
results[name] = accuracy
print(f"{name}: {accuracy:.4f}")
# Find best model
best_model = max(results, key=results.get)
print(f"\nBest Model: {best_model} ({results[best_model]:.4f})")
Model Evaluation Techniques
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score, roc_auc_score, roc_curve
# Cross-validation
model = RandomForestClassifier()
cv_scores = cross_val_score(model, X_train_scaled, y_train, cv=5)
print(f"CV Scores: {cv_scores}")
print(f"Mean CV Score: {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})")
# Hyperparameter tuning with Grid Search
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [None, 10, 20, 30],
'min_samples_split': [2, 5, 10]
}
grid_search = GridSearchCV(
RandomForestClassifier(),
param_grid,
cv=5,
scoring='accuracy',
n_jobs=-1
)
grid_search.fit(X_train_scaled, y_train)
print(f"\nBest Parameters: {grid_search.best_params_}")
print(f"Best Score: {grid_search.best_score_:.4f}")
# Use best model
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test_scaled)
print(f"Test Accuracy: {accuracy_score(y_test, y_pred):.4f}")
# ROC Curve
y_pred_proba = best_model.predict_proba(X_test_scaled)[:, 1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
auc = roc_auc_score(y_test, y_pred_proba)
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label=f'ROC Curve (AUC = {auc:.2f})')
plt.plot([0, 1], [0, 1], 'k--', label='Random Classifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.show()
๐ Real-World Example: Customer Churn Prediction
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
# 1. Load and explore data
df = pd.read_csv('customer_churn.csv')
print("Dataset shape:", df.shape)
print("\nFirst few rows:")
print(df.head())
print("\nData info:")
print(df.info())
print("\nMissing values:")
print(df.isnull().sum())
# 2. Data cleaning
# Handle missing values
df = df.dropna()
# Encode categorical variables
le = LabelEncoder()
categorical_cols = ['gender', 'contract_type', 'payment_method']
for col in categorical_cols:
df[col] = le.fit_transform(df[col])
# 3. Exploratory Data Analysis
# Churn distribution
plt.figure(figsize=(8, 6))
df['churn'].value_counts().plot(kind='bar')
plt.title('Churn Distribution')
plt.xlabel('Churn (0=No, 1=Yes)')
plt.ylabel('
Count')
plt.show()
# Correlation heatmap
plt.figure(figsize=(12, 8))
correlation = df.corr()
sns.heatmap(correlation, annot=True, cmap='coolwarm', center=0)
plt.title('Feature Correlation Heatmap')
plt.show()
# 4. Feature engineering
X = df.drop('churn', axis=1)
y = df['churn']
# 5. Split and scale data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# 6. Train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train_scaled, y_train)
# 7. Evaluate
y_pred = model.predict(X_test_scaled)
print("\nModel Performance:")
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
# Feature importance
feature_importance = pd.DataFrame({
'feature': X.columns,
'importance': model.feature_importances_
}).sort_values('importance', ascending=False)
plt.figure(figsize=(10, 6))
sns.barplot(x='importance', y='feature', data=feature_importance.head(10))
plt.title('Top 10 Most Important Features')
plt.show()
๐ก Best Practices
Data Science Best Practices:
- Understand Your Data: Always explore before modeling
- Clean Thoroughly: Handle missing values and outliers
- Feature Engineering: Create meaningful features
- Split Data Properly: Train/validation/test sets
- Scale Features: Normalize or standardize when needed
- Cross-Validate: Don't rely on single train/test split
- Avoid Overfitting: Use regularization, simpler models
- Document Everything: Keep notebooks well-commented
- Version Control: Use Git for code and data
- Reproducibility: Set random seeds, save models
๐ ๏ธ Useful Tools & Libraries
Data Manipulation
- Pandas: DataFrames
- NumPy: Arrays
- Dask: Big data
Visualization
- Matplotlib: Basic plots
- Seaborn: Statistical
- Plotly: Interactive
Machine Learning
- Scikit-learn: Traditional ML
- XGBoost: Gradient boosting
- LightGBM: Fast boosting
Deep Learning
- TensorFlow: Google's framework
- PyTorch: Facebook's framework
- Keras: High-level API
Natural Language
- NLTK: Text processing
- spaCy: Industrial NLP
- Transformers: BERT, GPT
Computer Vision
- OpenCV: Image processing
- Pillow: Image manipulation
- scikit-image: Image algorithms
๐ Learning Path
Beginner to Data Scientist (6-12 months)
Phase 1: Python Fundamentals (1-2 months)
- Python basics: variables, loops, functions
- Data structures: lists, dictionaries, sets
- Object-oriented programming
- File handling and modules
Phase 2: Data Analysis (2-3 months)
- NumPy for numerical computing
- Pandas for data manipulation
- Data cleaning and preprocessing
- Exploratory data analysis
Phase 3: Visualization (1 month)
- Matplotlib basics
- Seaborn for statistical plots
- Creating dashboards
- Storytelling with data
Phase 4: Statistics & Math (1-2 months)
- Descriptive statistics
- Probability distributions
- Hypothesis testing
- Linear algebra basics
Phase 5: Machine Learning (2-3 months)
- Supervised learning algorithms
- Unsupervised learning
- Model evaluation and tuning
- Feature engineering
Phase 6: Advanced Topics (Ongoing)
- Deep learning
- Natural language processing
- Computer vision
- Big data tools (Spark, Hadoop)
๐ Learning Resources
Online Courses
- Coursera: Data Science Specialization
- DataCamp: Python for Data Science
- Kaggle Learn: Free micro-courses
- Fast.ai: Practical Deep Learning
Books
- Python for Data Analysis (McKinney)
- Hands-On Machine Learning (Gรฉron)
- Introduction to Statistical Learning
- Deep Learning (Goodfellow)
Practice Platforms
- Kaggle: Competitions & datasets
- LeetCode: Coding practice
- HackerRank: Data science track
- DataQuest: Interactive learning
Communities
- r/datascience (Reddit)
- Kaggle Forums
- Stack Overflow
- Data Science Discord servers
๐ผ Career Opportunities
Data Science Career Paths:
Entry Level:
- Data Analyst: $50,000 - $70,000/year
- Junior Data Scientist: $60,000 - $80,000/year
- Business Intelligence Analyst: $55,000 - $75,000/year
Mid Level:
- Data Scientist: $80,000 - $120,000/year
- Machine Learning Engineer: $90,000 - $130,000/year
- Data Engineer: $85,000 - $125,000/year
Senior Level:
- Senior Data Scientist: $120,000 - $160,000/year
- ML Architect: $130,000 - $180,000/year
- Chief Data Officer: $150,000 - $250,000+/year
Extremely high demand for data science professionals across all industries!
๐ฏ Common Data Science Projects
Beginner Projects
- Exploratory Data Analysis
- Sales Data Analysis
- Weather Data Visualization
- Simple Linear Regression
Intermediate Projects
- Customer Segmentation
- House Price Prediction
- Sentiment Analysis
- Recommendation System
Advanced Projects
- Image Classification
- Time Series Forecasting
- Fraud Detection
- Chatbot Development
Portfolio Projects
- End-to-end ML Pipeline
- Kaggle Competition Entry
- Real-world Problem Solution
- Deployed Web Application
๐ Next Steps
Your Data Science Journey:
- Master Python Basics: Complete python-basics.html course
- Learn NumPy & Pandas: Practice data manipulation daily
- Create Visualizations: Make 10+ different chart types
- Study Statistics: Understand the math behind ML
- Build ML Models: Start with simple algorithms
- Work on Projects: Build a portfolio on GitHub
- Join Kaggle: Participate in competitions
- Network: Connect with data scientists online
- Keep Learning: Stay updated with latest techniques
- Apply for Jobs: Start with internships or junior roles
๐ Start Your Data Science Journey!
Transform data into insights and build intelligent systems
๐ Related Topics