Objectives
- Select and download suitable datasets
- Set up PostgreSQL environment
- Practice basic queries as foundation
- Prepare for CTEs, window functions, and query optimization
Part 1: Dataset Selection
Recommended Datasets
Dataset 1: E-commerce Sales Data (Primary)
Source: Kaggle – “Superstore Sales Dataset”
- URL: https://www.kaggle.com/datasets/vivek468/superstore-dataset-final
-
Why it’s ideal:
- Multiple tables (orders, customers, products)
- Time-series data (perfect for window functions)
- Sales metrics (great for aggregations and CTEs)
- ~10,000 rows (manageable size)
Key columns: Order ID, Customer ID, Product, Category, Sales, Profit, Quantity, Order Date, Ship Date
Advanced SQL use cases:
- Running totals and moving averages (window functions)
- Customer cohort analysis (CTEs)
- Sales ranking by region/category (RANK, DENSE_RANK)
- Year-over-year comparisons
Dataset 2: Employee/HR Data (Secondary)
Source: Kaggle – “HR Analytics Dataset”
- URL: https://www.kaggle.com/datasets/pavansubhasht/ibm-hr-analytics-attrition-dataset
-
Why it’s useful:
- Employee hierarchy (good for recursive CTEs)
- Salary and performance data
- Attrition tracking
Advanced SQL use cases:
- Employee tenure analysis
- Salary percentiles by department
- Retention rate calculations
- Manager-employee relationships
Alternative Dataset Options:
Option A: COVID-19 Data
- Source: Our World in Data
- URL: https://github.com/owid/covid-19-data/tree/master/public/data
- Time-series perfect for window functions
- Country comparisons
Option B: Airbnb Listings
- Source: Inside Airbnb
- URL: http://insideairbnb.com/get-the-data/
- Price analysis, availability patterns
- Geographic data for spatial queries
Part 2: PostgreSQL Setup
Installation
On macOS:
# Using Homebrew
brew install postgresql@15
# Start PostgreSQL service
brew services start postgresql@15
# Verify installation
psql --version
On Ubuntu/Debian:
# Update package list
sudo apt update
# Install PostgreSQL
sudo apt install postgresql postgresql-contrib
# Start service
sudo systemctl start postgresql
sudo systemctl enable postgresql
# Switch to postgres user
sudo -u postgres psql
On Windows:
- Download installer from: https://www.postgresql.org/download/windows/
- Run the installer (includes pgAdmin 4)
- Remember your superuser password
- Add PostgreSQL bin to PATH
Initial Configuration
-- Connect to PostgreSQL
psql -U postgres
-- Create a database for practice
CREATE DATABASE data_engineering_practice;
-- Connect to the new database
c data_engineering_practice
-- Create a user (optional)
CREATE USER data_engineer WITH PASSWORD 'your_password';
GRANT ALL PRIVILEGES ON DATABASE data_engineering_practice TO data_engineer;
Useful PostgreSQL Commands
-- List all databases
l
-- List all tables in current database
dt
-- Describe table structure
d table_name
-- List all schemas
dn
-- Quit psql
q
-- Execute SQL from file
i /path/to/file.sql
-- Show query execution time
timing on
Part 3: Loading Your Data
Method 1: Using COPY (Fastest for CSV)
-- Create table structure first
CREATE TABLE superstore_sales (
row_id SERIAL PRIMARY KEY,
order_id VARCHAR(50),
order_date DATE,
ship_date DATE,
ship_mode VARCHAR(50),
customer_id VARCHAR(50),
customer_name VARCHAR(100),
segment VARCHAR(50),
country VARCHAR(50),
city VARCHAR(100),
state VARCHAR(50),
postal_code VARCHAR(20),
region VARCHAR(50),
product_id VARCHAR(50),
category VARCHAR(50),
sub_category VARCHAR(50),
product_name VARCHAR(255),
sales DECIMAL(10,2),
quantity INTEGER,
discount DECIMAL(4,2),
profit DECIMAL(10,2)
);
-- Load CSV data
COPY superstore_sales(order_id, order_date, ship_date, ship_mode,
customer_id, customer_name, segment, country,
city, state, postal_code, region, product_id,
category, sub_category, product_name, sales,
quantity, discount, profit)
FROM '/path/to/superstore.csv'
DELIMITER ','
CSV HEADER;
Method 2: Using pgAdmin GUI
- Right-click on table → Import/Export
- Select file and format
- Map columns
- Import
Method 3: Using Python (for data cleaning)
import pandas as pd
from sqlalchemy import create_engine
# Read and clean data
df = pd.read_csv('superstore.csv')
# Connect to PostgreSQL
engine = create_engine('postgresql://username:password@localhost:5432/data_engineering_practice')
# Load to database
df.to_sql('superstore_sales', engine, if_exists='replace', index=False)
Part 4: Basic SQL Practice Queries
Query Set 1: Basic SELECT and Filtering
-- 1. View first 10 rows
SELECT * FROM superstore_sales LIMIT 10;
-- 2. Count total orders
SELECT COUNT(DISTINCT order_id) as total_orders
FROM superstore_sales;
-- 3. Filter by category
SELECT * FROM superstore_sales
WHERE category = 'Technology'
LIMIT 20;
-- 4. Multiple conditions
SELECT customer_name, sales, profit
FROM superstore_sales
WHERE region = 'West'
AND sales > 1000
AND order_date >= '2017-01-01';
-- 5. Pattern matching
SELECT DISTINCT product_name
FROM superstore_sales
WHERE product_name LIKE '%Chair%';
Query Set 2: Aggregations
-- 6. Total sales by category
SELECT
category,
SUM(sales) as total_sales,
SUM(profit) as total_profit,
COUNT(*) as order_count
FROM superstore_sales
GROUP BY category
ORDER BY total_sales DESC;
-- 7. Average discount by segment
SELECT
segment,
AVG(discount) as avg_discount,
MIN(discount) as min_discount,
MAX(discount) as max_discount
FROM superstore_sales
GROUP BY segment;
-- 8. Monthly sales trend
SELECT
DATE_TRUNC('month', order_date) as month,
SUM(sales) as monthly_sales,
COUNT(DISTINCT order_id) as order_count
FROM superstore_sales
GROUP BY DATE_TRUNC('month', order_date)
ORDER BY month;
-- 9. HAVING clause
SELECT
customer_name,
SUM(sales) as total_spent
FROM superstore_sales
GROUP BY customer_name
HAVING SUM(sales) > 5000
ORDER BY total_spent DESC;
Query Set 3: JOIN Preparation
-- Create dimension tables for practice
-- Customer dimension
CREATE TABLE customers AS
SELECT DISTINCT
customer_id,
customer_name,
segment,
country,
city,
state,
region
FROM superstore_sales;
-- Product dimension
CREATE TABLE products AS
SELECT DISTINCT
product_id,
product_name,
category,
sub_category
FROM superstore_sales;
-- Fact table (orders)
CREATE TABLE orders AS
SELECT
row_id,
order_id,
order_date,
ship_date,
ship_mode,
customer_id,
product_id,
sales,
quantity,
discount,
profit
FROM superstore_sales;
-- 10. INNER JOIN
SELECT
o.order_id,
c.customer_name,
p.product_name,
o.sales
FROM orders o
INNER JOIN customers c ON o.customer_id = c.customer_id
INNER JOIN products p ON o.product_id = p.product_id
LIMIT 100;
-- 11. LEFT JOIN with NULL check
SELECT
c.customer_name,
COUNT(o.order_id) as order_count,
COALESCE(SUM(o.sales), 0) as total_sales
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_name
ORDER BY order_count DESC;
Query Set 4: Subqueries (Foundation for CTEs)
-- 12. Subquery in WHERE
SELECT * FROM superstore_sales
WHERE sales > (SELECT AVG(sales) FROM superstore_sales);
-- 13. Subquery in FROM
SELECT
category,
avg_sales,
CASE
WHEN avg_sales > overall_avg THEN 'Above Average'
ELSE 'Below Average'
END as performance
FROM (
SELECT
category,
AVG(sales) as avg_sales
FROM superstore_sales
GROUP BY category
) cat_avg
CROSS JOIN (
SELECT AVG(sales) as overall_avg
FROM superstore_sales
) overall;
-- 14. Correlated subquery
SELECT
s1.category,
s1.product_name,
s1.sales
FROM superstore_sales s1
WHERE s1.sales = (
SELECT MAX(s2.sales)
FROM superstore_sales s2
WHERE s2.category = s1.category
);
Part 5: Preview of Advanced Concepts
Window Functions Teaser
-- Running total of sales over time
SELECT
order_date,
sales,
SUM(sales) OVER (ORDER BY order_date) as running_total
FROM superstore_sales
ORDER BY order_date
LIMIT 50;
-- Rank products by sales within each category
SELECT
category,
product_name,
SUM(sales) as total_sales,
RANK() OVER (PARTITION BY category ORDER BY SUM(sales) DESC) as rank_in_category
FROM superstore_sales
GROUP BY category, product_name
ORDER BY category, rank_in_category
LIMIT 30;
CTEs (Common Table Expressions) Teaser
-- Using WITH clause
WITH monthly_sales AS (
SELECT
DATE_TRUNC('month', order_date) as month,
SUM(sales) as total_sales
FROM superstore_sales
GROUP BY DATE_TRUNC('month', order_date)
)
SELECT
month,
total_sales,
LAG(total_sales) OVER (ORDER BY month) as previous_month,
total_sales - LAG(total_sales) OVER (ORDER BY month) as month_over_month_change
FROM monthly_sales
ORDER BY month;
Part 6: Practice Exercises
Exercise 1: Data Exploration
- How many unique customers are there?
- What date range does the dataset cover?
- Which region has the highest total sales?
- What’s the average order value?
Exercise 2: Business Questions
- Find the top 10 customers by total profit
- Which product category has the highest profit margin?
- What’s the average delivery time (ship_date – order_date)?
- Which month had the highest sales in 2017?
Exercise 3: Complex Queries
- Find customers who made purchases in all four regions
- Calculate the percentage of sales each category contributes
- Find products that were never discounted
- Identify customers whose average order value is above the overall average
Exercise 4: Data Quality Checks
-- Check for NULL values
SELECT
COUNT(*) as total_rows,
COUNT(order_id) as non_null_orders,
COUNT(customer_id) as non_null_customers,
COUNT(sales) as non_null_sales
FROM superstore_sales;
-- Check for duplicates
SELECT
order_id,
product_id,
COUNT(*)
FROM superstore_sales
GROUP BY order_id, product_id
HAVING COUNT(*) > 1;
-- Check data ranges
SELECT
MIN(order_date) as earliest_order,
MAX(order_date) as latest_order,
MIN(sales) as min_sale,
MAX(sales) as max_sale,
AVG(sales) as avg_sale
FROM superstore_sales;
Part 7: Performance Tips
Creating Indexes
-- Create indexes on frequently queried columns
CREATE INDEX idx_order_date ON superstore_sales(order_date);
CREATE INDEX idx_customer_id ON superstore_sales(customer_id);
CREATE INDEX idx_category ON superstore_sales(category);
CREATE INDEX idx_region ON superstore_sales(region);
-- Composite index for common query patterns
CREATE INDEX idx_customer_date ON superstore_sales(customer_id, order_date);
EXPLAIN and Query Analysis
-- See query execution plan
EXPLAIN SELECT * FROM superstore_sales WHERE order_date > '2017-01-01';
-- See actual execution stats
EXPLAIN ANALYZE
SELECT category, SUM(sales)
FROM superstore_sales
GROUP BY category;
Part 8: Next Steps Checklist
- [ ] Install PostgreSQL
- [ ] Download 1-2 datasets from recommended sources
- [ ] Create database and load data
- [ ] Complete all basic query exercises
- [ ] Create indexes on key columns
- [ ] Try the window function and CTE teasers
- [ ] Document any questions for the advanced session
Resources
Documentation
- PostgreSQL Official Docs: https://www.postgresql.org/docs/
- PostgreSQL Tutorial: https://www.postgresqltutorial.com/
- Mode Analytics SQL Tutorial: https://mode.com/sql-tutorial/
Practice Platforms
- SQLZoo: https://sqlzoo.net/
- LeetCode SQL: https://leetcode.com/problemset/database/
- HackerRank SQL: https://www.hackerrank.com/domains/sql
Tools
- pgAdmin: PostgreSQL GUI client
- DBeaver: Universal database tool
- DataGrip: JetBrains SQL IDE
Quick Reference Card
-- CONNECTION
psql -U username -d database_name
-- COMMON AGGREGATES
COUNT(), SUM(), AVG(), MIN(), MAX()
-- DATE FUNCTIONS
DATE_TRUNC('month', date_column)
EXTRACT(YEAR FROM date_column)
AGE(date1, date2)
-- STRING FUNCTIONS
CONCAT(), UPPER(), LOWER(), SUBSTRING()
LIKE, ILIKE (case-insensitive)
-- CONDITIONAL
CASE WHEN ... THEN ... ELSE ... END
COALESCE(value1, value2, default)
NULLIF(value1, value2)
-- GROUPING
GROUP BY, HAVING
GROUP BY ROLLUP, CUBE, GROUPING SETS
Good luck with your Day 2 preparation! 🚀
