Snowflake Data Cloud: A Comprehensive Guide

Use the below table of content to navigate.

Foundation & Architecture

Introduction
What Makes Snowflake Different?
Snowflake Architecture
Core Components
Data Storage in Snowflake
Compute Layer: Virtual Warehouses

Getting Started

Getting Started with Snowflake
Working with Databases and Schemas
Loading Your First Data

Best Practices & Optimization

Schema Design Best Practices
Performance Optimization
Warehouse Sizing and Management
Query Optimization Techniques
Cost Management Strategies

Security & Data Modeling

Security and Access Control
Data Modeling Patterns
Advanced Features

Integrations & Monitoring

AWS Integration Deep Dive
S3 Integration and External Stages
VPC and Networking
BI Tool Integrations
Snowflake vs AWS Redshift
System Tables and Metadata
Monitoring and Troubleshooting

Advanced Topics

Data Sharing and Marketplace
Migration Strategies
Real-World Architecture Patterns
Final Thoughts & Resources

Introduction

Hey there! If you’re reading this, you’ve probably heard about Snowflake and are wondering what all the buzz is about. Maybe you’re tired of managing database infrastructure, or perhaps you’re looking for a modern alternative to traditional data warehouses like Redshift. Whatever brought you here, you’re in the right place.

Snowflake isn’t just another database—it’s a complete data platform built from the ground up for the cloud. No hardware to manage, no indexes to tune (yes, really!), and it scales like a dream. This comprehensive guide will walk you through everything you need to know about Snowflake, from architecture to advanced integrations, performance optimization, and real-world implementation patterns.

Let’s dive in!

What Makes Snowflake Different?

Before we get into the technical details, let’s talk about what sets Snowflake apart:

True Multi-Cloud Platform: Snowflake runs natively on AWS, Azure, and Google Cloud. You can even replicate data across clouds seamlessly.

Separation of Storage and Compute: Unlike traditional databases where storage and compute are tightly coupled, Snowflake separates them completely. This means you can:

Scale compute without moving data
Pause compute when not in use (and stop paying for it!)
Run multiple compute clusters against the same data simultaneously

Zero Management: No infrastructure to provision, no vacuum operations, no index management, no partitioning decisions (well, mostly).

Time Travel and Cloning: Want to query data as it existed 3 days ago? Or create an instant copy of a 10TB database for testing? Snowflake makes it trivial.

Snowflake Architecture

Snowflake uses a unique hybrid architecture that combines the best of shared-disk and shared-nothing architectures. Let’s break it down:

┌─────────────────────────────────────────────────────────┐
│                   CLOUD SERVICES LAYER                  │
│  (Authentication, Metadata, Query Optimization, etc.)   │
└─────────────────────────────────────────────────────────┘
                            ↕
┌─────────────────────────────────────────────────────────┐
│                    COMPUTE LAYER                         │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐              │
│  │  Virtual │  │  Virtual │  │  Virtual │              │
│  │Warehouse │  │Warehouse │  │Warehouse │  ...         │
│  │   (VW1)  │  │   (VW2)  │  │   (VW3)  │              │
│  └──────────┘  └──────────┘  └──────────┘              │
└─────────────────────────────────────────────────────────┘
                            ↕
┌─────────────────────────────────────────────────────────┐
│                   STORAGE LAYER                          │
│          (Centralized, Optimized Columnar)              │
│              All data stored in S3/Blob                 │
└─────────────────────────────────────────────────────────┘

The Three Layers

1. Cloud Services Layer
This is the brain of Snowflake. It handles:

Authentication and access control
Query parsing and optimization
Metadata management
Infrastructure management

2. Compute Layer (Virtual Warehouses)
These are independent compute clusters that execute your queries. Each warehouse:

Operates independently (no resource contention)
Can be sized from X-Small to 6X-Large
Caches data locally for performance
Can be started, stopped, or suspended automatically

3. Storage Layer
All your data lives here in a compressed, columnar format. The beauty? It’s completely abstracted from you. Snowflake automatically:

Compresses your data
Organizes it into micro-partitions
Maintains statistics
Handles encryption

Core Components

Let’s understand the key objects you’ll work with:

Component	Description	Example
Database	Top-level container for schemas	`SALES_DB`, `ANALYTICS_DB`
Schema	Namespace within a database	`PUBLIC`, `STAGING`, `PROD`
Table	Where your data lives	`CUSTOMERS`, `ORDERS`
View	Saved query definition	`VW_ACTIVE_CUSTOMERS`
Virtual Warehouse	Compute resource	`LOADING_WH`, `REPORTING_WH`
Stage	Location for data files	`@my_s3_stage`, `@%customers`
File Format	Defines how to parse files	`CSV_FORMAT`, `JSON_FORMAT`
Pipe	Continuous data ingestion	`PIPE_CUSTOMER_UPDATES`
Stream	CDC (Change Data Capture)	`CUSTOMER_CHANGES_STREAM`
Task	Scheduled SQL execution	`DAILY_REFRESH_TASK`

Data Storage in Snowflake

One of Snowflake’s superpowers is how it stores data. Let’s demystify this:

Micro-Partitions

Snowflake automatically divides your tables into micro-partitions:

Each micro-partition contains 50-500 MB of uncompressed data
They’re immutable (never modified, only replaced)
Metadata includes min/max values, null counts, etc.
Queries automatically prune irrelevant partitions

-- You don't need to define partitioning!
-- Snowflake handles it automatically based on ingestion order

CREATE TABLE orders (
    order_id NUMBER,
    customer_id NUMBER,
    order_date DATE,
    amount DECIMAL(10,2)
);

-- As data loads, Snowflake creates micro-partitions
-- and maintains metadata automatically

Clustering

While Snowflake auto-clusters data by insertion order, you can define clustering keys for large tables:

-- For a 1TB+ table with frequent date filters
ALTER TABLE orders 
CLUSTER BY (order_date);

-- Check clustering quality
SELECT SYSTEM$CLUSTERING_INFORMATION('orders', '(order_date)');

When to use clustering:

Tables > 1TB
Queries consistently filter on specific columns
Query performance is critical
Cost justifies automatic re-clustering

Compute Layer: Virtual Warehouses {#compute-layer}

Virtual Warehouses (VW) are where the magic happens. Think of them as independent compute clusters.

Warehouse Sizes

Size	Credits/Hour	Servers	Use Case
X-Small	1	1	Development, small queries
Small	2	2	Light production workloads
Medium	4	4	Standard workloads
Large	8	8	Heavy queries, large data
X-Large	16	16	Very large analytical queries
2X-Large	32	32	Massive parallel processing
3X-Large	64	64	Extreme workloads
4X-Large	128	128	Rarely needed

Creating and Configuring Warehouses

-- Create a warehouse for loading data
CREATE WAREHOUSE loading_wh
    WAREHOUSE_SIZE = 'LARGE'
    AUTO_SUSPEND = 60           -- Suspend after 1 min of inactivity
    AUTO_RESUME = TRUE          -- Auto-start when query submitted
    MIN_CLUSTER_COUNT = 1
    MAX_CLUSTER_COUNT = 3       -- Scale out to handle concurrency
    SCALING_POLICY = 'STANDARD' -- STANDARD or ECONOMY
    COMMENT = 'Warehouse for data loading operations';

-- Create a warehouse for BI queries
CREATE WAREHOUSE reporting_wh
    WAREHOUSE_SIZE = 'MEDIUM'
    AUTO_SUSPEND = 300          -- 5 minutes for BI tools
    AUTO_RESUME = TRUE
    MIN_CLUSTER_COUNT = 1
    MAX_CLUSTER_COUNT = 5;

-- Modify warehouse size on the fly
ALTER WAREHOUSE loading_wh SET WAREHOUSE_SIZE = 'X-LARGE';

-- Suspend/resume manually
ALTER WAREHOUSE loading_wh SUSPEND;
ALTER WAREHOUSE loading_wh RESUME;

Multi-Cluster Warehouses

For handling concurrent users, Snowflake can automatically add clusters:

CREATE WAREHOUSE concurrent_wh
    WAREHOUSE_SIZE = 'MEDIUM'
    MIN_CLUSTER_COUNT = 1
    MAX_CLUSTER_COUNT = 10      -- Scale up to 10 clusters
    SCALING_POLICY = 'STANDARD'; -- Add cluster when queue builds

-- STANDARD: Starts clusters aggressively (favors performance)
-- ECONOMY: Conservative, waits before scaling (favors cost)

Getting Started with Snowflake

Let’s get practical. Here’s how to start using Snowflake:

1. Initial Setup

-- Set your context
USE ROLE ACCOUNTADMIN;

-- Create a database
CREATE DATABASE my_analytics_db;

-- Create a warehouse
CREATE WAREHOUSE my_wh 
    WITH WAREHOUSE_SIZE = 'SMALL' 
    AUTO_SUSPEND = 60 
    AUTO_RESUME = TRUE;

-- Create a user
CREATE USER john_doe
    PASSWORD = 'SecureP@ssw0rd!'
    DEFAULT_ROLE = 'ANALYST'
    DEFAULT_WAREHOUSE = 'my_wh'
    DEFAULT_NAMESPACE = 'my_analytics_db.public';

-- Grant privileges
GRANT USAGE ON WAREHOUSE my_wh TO ROLE analyst;
GRANT USAGE ON DATABASE my_analytics_db TO ROLE analyst;
GRANT USAGE ON SCHEMA my_analytics_db.public TO ROLE analyst;
GRANT SELECT ON ALL TABLES IN SCHEMA my_analytics_db.public TO ROLE analyst;

2. Understanding Context

In Snowflake, you always work within a context:

-- Show current context
SELECT CURRENT_ROLE(), CURRENT_WAREHOUSE(), 
       CURRENT_DATABASE(), CURRENT_SCHEMA();

-- Set context
USE ROLE ANALYST;
USE WAREHOUSE my_wh;
USE DATABASE my_analytics_db;
USE SCHEMA public;

-- Or set all at once
USE ROLE ANALYST;
USE WAREHOUSE my_wh;
USE SCHEMA my_analytics_db.public;

Working with Databases and Schemas

Snowflake organizes data hierarchically:

-- Create a database with additional options
CREATE DATABASE sales_db
    DATA_RETENTION_TIME_IN_DAYS = 7  -- Time Travel days
    COMMENT = 'Sales department data';

-- Create schemas for different purposes
CREATE SCHEMA sales_db.staging 
    COMMENT = 'Staging area for raw data';

CREATE SCHEMA sales_db.production 
    COMMENT = 'Production-ready tables';

CREATE SCHEMA sales_db.archive
    COMMENT = 'Historical data archive';

-- Create a transient database (no fail-safe, lower cost)
CREATE TRANSIENT DATABASE temp_analytics_db;

-- Create a temporary database (session-specific)
CREATE TEMPORARY DATABASE scratch_work;

Database Types:

Type	Time Travel	Fail-Safe	Use Case
Permanent	Up to 90 days	7 days	Production data
Transient	Up to 1 day	No	Staging, temp data
Temporary	Up to 1 day	No	Session-specific work

Loading Your First Data

Let’s load some data! Snowflake offers multiple methods:

Method 1: Load from Local File

-- Create a table
CREATE TABLE customers (
    customer_id NUMBER,
    first_name VARCHAR(50),
    last_name VARCHAR(50),
    email VARCHAR(100),
    signup_date DATE
);

-- Create a file format
CREATE FILE FORMAT my_csv_format
    TYPE = 'CSV'
    FIELD_DELIMITER = ','
    SKIP_HEADER = 1
    NULL_IF = ('NULL', 'null', '')
    EMPTY_FIELD_AS_NULL = TRUE
    COMPRESSION = 'AUTO';

-- Create an internal stage (within Snowflake)
CREATE STAGE my_internal_stage
    FILE_FORMAT = my_csv_format;

-- Upload file via UI or SnowSQL, then load
PUT file://customers.csv @my_internal_stage;

COPY INTO customers
FROM @my_internal_stage/customers.csv
FILE_FORMAT = my_csv_format
ON_ERROR = 'CONTINUE';

-- Check load results
SELECT * FROM TABLE(RESULT_SCAN(LAST_QUERY_ID()));

Method 2: Load from S3

-- Create external stage pointing to S3
CREATE STAGE my_s3_stage
    URL = 's3://my-bucket/data/'
    CREDENTIALS = (AWS_KEY_ID = 'your_key' 
                   AWS_SECRET_KEY = 'your_secret')
    FILE_FORMAT = my_csv_format;

-- Or use IAM role (recommended)
CREATE STAGE my_s3_stage_with_role
    URL = 's3://my-bucket/data/'
    STORAGE_INTEGRATION = my_s3_integration
    FILE_FORMAT = my_csv_format;

-- List files in stage
LIST @my_s3_stage;

-- Load data
COPY INTO customers
FROM @my_s3_stage
PATTERN = '.*customers.*csv'
FILE_FORMAT = my_csv_format;

Method 3: Direct Load (Small Data)

-- For quick tests
INSERT INTO customers VALUES
    (1, 'John', 'Doe', 'john@email.com', '2024-01-15'),
    (2, 'Jane', 'Smith', 'jane@email.com', '2024-01-16');

-- Or use SELECT
INSERT INTO customers
SELECT * FROM other_table WHERE condition = true;

Method 4: Continuous Loading with Snowpipe

-- Create pipe for automated loading
CREATE PIPE customer_pipe
    AUTO_INGEST = TRUE
    AS
    COPY INTO customers
    FROM @my_s3_stage
    FILE_FORMAT = my_csv_format;

-- Show pipe status
SHOW PIPES;

-- Check pipe history
SELECT * 
FROM TABLE(INFORMATION_SCHEMA.PIPE_USAGE_HISTORY(
    DATE_RANGE_START=>DATEADD('day', -7, CURRENT_DATE()),
    PIPE_NAME=>'customer_pipe'
));

Key Concepts to Remember

1. Time Travel
Every table automatically maintains history:

-- Query data as it was 1 hour ago
SELECT * FROM customers 
AT(OFFSET => -3600);

-- Query data at a specific timestamp
SELECT * FROM customers 
AT(TIMESTAMP => '2024-01-15 14:30:00'::TIMESTAMP);

-- Restore a dropped table
UNDROP TABLE customers;

-- Clone at a point in time
CREATE TABLE customers_yesterday 
CLONE customers 
AT(OFFSET => -86400);

2. Zero-Copy Cloning
Create instant, no-cost copies of databases, schemas, or tables:

-- Clone entire database
CREATE DATABASE dev_db CLONE production_db;

-- Clone table for testing
CREATE TABLE customers_test CLONE customers;

-- Clone is instant and shares storage until modified

3. Result Caching
Snowflake caches query results for 24 hours:

-- First run: Executes query
SELECT COUNT(*) FROM large_table;  -- Takes 30 seconds

-- Second run: Instant from cache
SELECT COUNT(*) FROM large_table;  -- Takes 0.1 seconds

-- Bypass cache
SELECT COUNT(*) FROM large_table 
WHERE 1=1;  -- Adding condition bypasses cache

Schema Design Best Practices

1. Choosing the Right Data Types

Snowflake is forgiving, but using optimal data types matters:

-- ❌ DON'T: Oversize your columns
CREATE TABLE customers_bad (
    customer_id VARCHAR(1000),      -- Way too big!
    email VARCHAR(1000),            -- 100 would suffice
    age VARCHAR(50)                 -- Should be NUMBER
);

-- ✅ DO: Right-size your columns
CREATE TABLE customers_good (
    customer_id NUMBER(38,0),       -- Or use AUTOINCREMENT
    email VARCHAR(100),             -- Reasonable size
    age NUMBER(3,0),                -- Appropriate numeric type
    signup_date DATE,               -- Not TIMESTAMP if time not needed
    last_login TIMESTAMP_LTZ        -- With timezone if needed
);

Data Type Recommendations:

Data Type	When to Use	Example
`NUMBER(38,0)`	Integer IDs, counts	Customer ID, Order count
`NUMBER(10,2)`	Money, precise decimals	Price, Amount
`FLOAT`	Measurements, scientific	Temperature, Distance
`VARCHAR(n)`	Text with known max length	Email, Phone
`STRING`	Text with unknown length	Comments, Descriptions
`DATE`	Dates without time	Birth date, Order date
`TIMESTAMP_LTZ`	Timestamps with timezone	Login time, Event time
`VARIANT`	JSON, semi-structured	API responses, logs
`BOOLEAN`	True/false flags	is_active, is_premium

2. Table Types Strategy

-- Production tables: Permanent (default)
CREATE TABLE prod_orders (
    order_id NUMBER,
    order_date DATE,
    amount NUMBER(10,2)
);
-- Has: Time Travel (up to 90 days), Fail-safe (7 days)

-- Staging tables: Transient (cheaper)
CREATE TRANSIENT TABLE staging_orders (
    order_id NUMBER,
    order_date DATE,
    amount NUMBER(10,2)
);
-- Has: Time Travel (up to 1 day), No Fail-safe
-- Saves ~50% on storage costs

-- Session work: Temporary
CREATE TEMPORARY TABLE work_orders (
    order_id NUMBER,
    amount NUMBER(10,2)
);
-- Exists only in session, automatic cleanup

When to Use What:

Permanent Tables
└── Production data
└── Data requiring recovery beyond 1 day
└── Regulatory/compliance requirements

Transient Tables
└── Staging/ETL intermediate tables
└── Data easily recreatable
└── Development/testing
└── Landing zones for raw data

Temporary Tables
└── Session-specific calculations
└── Query intermediate results
└── Data that doesn't need persistence

3. Schema Organization

-- Organize by data lifecycle and purpose
CREATE DATABASE analytics_db;

-- Raw/landing zone
CREATE SCHEMA analytics_db.raw
    DATA_RETENTION_TIME_IN_DAYS = 1;  -- Short retention

-- Staging/transformation
CREATE TRANSIENT SCHEMA analytics_db.staging
    DATA_RETENTION_TIME_IN_DAYS = 1;

-- Production-ready data
CREATE SCHEMA analytics_db.curated
    DATA_RETENTION_TIME_IN_DAYS = 7;

-- Data marts for specific teams
CREATE SCHEMA analytics_db.sales_mart;
CREATE SCHEMA analytics_db.marketing_mart;

-- Utility objects
CREATE SCHEMA analytics_db.utility;  -- File formats, stages, etc.

Performance Optimization

1. Clustering Keys (For Large Tables)

Clustering helps with very large tables (1TB+):

-- Check if clustering would help
SELECT 
    table_name,
    row_count,
    bytes / POWER(1024, 3) AS size_gb
FROM information_schema.tables
WHERE table_schema = 'PUBLIC'
    AND bytes > 1099511627776  -- > 1TB
ORDER BY bytes DESC;

-- Add clustering key
ALTER TABLE large_fact_table 
CLUSTER BY (date_column, category_id);

-- Monitor clustering health (0-100, >80 is good)
SELECT SYSTEM$CLUSTERING_INFORMATION('large_fact_table', 
    '(date_column, category_id)');

-- Check clustering depth (fewer is better)
SELECT SYSTEM$CLUSTERING_DEPTH('large_fact_table', 
    '(date_column, category_id)');

Clustering Key Best Practices:

-- ✅ DO: Cluster by frequently filtered columns
ALTER TABLE orders CLUSTER BY (order_date);  -- Good for date range queries

-- ✅ DO: Use multi-column clustering for multiple filters
ALTER TABLE orders CLUSTER BY (order_date, region_id);

-- ❌ DON'T: Cluster high-cardinality columns alone
-- Bad: customer_id (millions of unique values)

-- ❌ DON'T: Cluster frequently updated tables
-- Clustering maintenance is expensive

-- ✅ DO: Consider clustering linear expressions
ALTER TABLE events CLUSTER BY (DATE_TRUNC('DAY', event_timestamp));

2. Materialized Views

Pre-compute expensive queries:

-- Create materialized view
CREATE MATERIALIZED VIEW daily_sales_summary AS
SELECT 
    DATE_TRUNC('DAY', order_date) AS sale_date,
    region_id,
    SUM(amount) AS total_sales,
    COUNT(*) AS order_count,
    AVG(amount) AS avg_order_value
FROM orders
GROUP BY 1, 2;

-- Snowflake automatically maintains it!
-- Queries automatically use it when beneficial

-- Refresh manually if needed
ALTER MATERIALIZED VIEW daily_sales_summary REFRESH;

-- Check refresh status
SHOW MATERIALIZED VIEWS;

When to Use Materialized Views:

✅ Good candidates:

Expensive aggregations queried frequently
Dashboard queries that don’t need real-time data
Complex joins used repeatedly
Queries with GROUP BY on large datasets

❌ Avoid for:

Rapidly changing base tables
Queries needing real-time data
Simple queries (overhead not worth it)

3. Search Optimization

For substring searches and lookups:

-- Enable search optimization
ALTER TABLE customers 
ADD SEARCH OPTIMIZATION ON EQUALITY(customer_id, email);

-- Or on all columns
ALTER TABLE customers 
ADD SEARCH OPTIMIZATION;

-- Great for:
SELECT * FROM customers WHERE email = 'john@email.com';
SELECT * FROM customers WHERE email LIKE '%@gmail.com';

-- Check optimization status
SELECT SYSTEM$GET_SEARCH_OPTIMIZATION_STATUS('customers');

Warehouse Sizing and Management

Choosing the Right Size

Here’s a practical approach:

-- Start small and monitor
CREATE WAREHOUSE analytics_wh
    WAREHOUSE_SIZE = 'SMALL'
    AUTO_SUSPEND = 60
    AUTO_RESUME = TRUE;

-- Run your workload and check query performance
SELECT 
    query_id,
    query_text,
    warehouse_size,
    execution_time / 1000 AS execution_seconds,
    bytes_scanned / POWER(1024, 3) AS gb_scanned
FROM snowflake.account_usage.query_history
WHERE warehouse_name = 'ANALYTICS_WH'
    AND start_time >= DATEADD('hour', -1, CURRENT_TIMESTAMP())
ORDER BY execution_time DESC
LIMIT 10;

Sizing Strategy:

Scenario	Recommended Approach
Data loading	Start with LARGE, scale up if slow
BI queries (many users)	MEDIUM with multi-cluster (1-5 clusters)
ETL/batch processing	LARGE to X-LARGE, single cluster
Ad-hoc analysis	SMALL to MEDIUM
Development	X-SMALL
Complex analytics	X-LARGE for duration of job

Warehouse Design Patterns

-- Pattern 1: Separate workloads
CREATE WAREHOUSE loading_wh 
    WAREHOUSE_SIZE = 'LARGE'
    AUTO_SUSPEND = 60;           -- Suspend quickly

CREATE WAREHOUSE bi_wh 
    WAREHOUSE_SIZE = 'MEDIUM'
    AUTO_SUSPEND = 600           -- 10 min for BI tools
    MAX_CLUSTER_COUNT = 5;       -- Handle concurrent users

CREATE WAREHOUSE etl_wh 
    WAREHOUSE_SIZE = 'X-LARGE'
    AUTO_SUSPEND = 60;

-- Pattern 2: Time-based sizing
-- Start day with larger warehouse
ALTER WAREHOUSE bi_wh SET WAREHOUSE_SIZE = 'LARGE';  -- 8 AM

-- Scale down during off-hours
ALTER WAREHOUSE bi_wh SET WAREHOUSE_SIZE = 'SMALL';  -- 6 PM

-- Pattern 3: Dynamic sizing in stored procedure
CREATE PROCEDURE scale_warehouse(wh_name STRING, size STRING)
RETURNS STRING
LANGUAGE JAVASCRIPT
AS
$$
    var sql_command = `ALTER WAREHOUSE ${WH_NAME} SET WAREHOUSE_SIZE = '${SIZE}'`;
    snowflake.execute({sqlText: sql_command});
    return `Warehouse ${WH_NAME} scaled to ${SIZE}`;
$$;

-- Use in your ETL
CALL scale_warehouse('ETL_WH', 'X-LARGE');
-- ... run heavy processing ...
CALL scale_warehouse('ETL_WH', 'SMALL');

Multi-Cluster Warehouses

CREATE WAREHOUSE concurrent_bi_wh
    WAREHOUSE_SIZE = 'MEDIUM'
    MIN_CLUSTER_COUNT = 1
    MAX_CLUSTER_COUNT = 10
    SCALING_POLICY = 'STANDARD'  -- or ECONOMY
    AUTO_SUSPEND = 300;

-- STANDARD: Aggressive scaling (performance priority)
-- - Starts new cluster immediately when queue builds
-- - Estimated to run for at least 6 minutes

-- ECONOMY: Conservative scaling (cost priority)
-- - Waits to ensure 6 minutes of work before adding cluster
-- - May queue queries briefly

Cost Comparison:

Example: 10 concurrent users running 2-minute queries

STANDARD policy:
- May spin up 5-6 clusters
- Minimal queuing
- Higher cost, better UX

ECONOMY policy:
- May spin up 3-4 clusters  
- Some queries queue 10-30 seconds
- Lower cost, acceptable for most BI use

Query Optimization Techniques

1. Understanding the Query Profile

-- Run a query and get its ID
SET query_id = LAST_QUERY_ID();

-- Analyze via UI: Query Profile tab
-- Or programmatically:
SELECT * 
FROM TABLE(GET_QUERY_OPERATOR_STATS($query_id));

Key metrics to watch:

Partitions scanned vs. total: Should be low %
Bytes spilled to disk: Should be zero or minimal
Rows produced: Check for unexpectedly high numbers

2. Common Optimization Patterns

-- ❌ DON'T: SELECT *
SELECT * FROM large_table;  -- Scans all columns

-- ✅ DO: Select only needed columns
SELECT customer_id, order_date, amount 
FROM large_table;

-- ❌ DON'T: Functions on filtered columns (prevents pruning)
SELECT * FROM orders 
WHERE YEAR(order_date) = 2024;

-- ✅ DO: Direct column comparisons
SELECT * FROM orders 
WHERE order_date >= '2024-01-01' 
  AND order_date < '2025-01-01';

-- ❌ DON'T: Implicit type conversions
SELECT * FROM orders 
WHERE customer_id = '12345';  -- customer_id is NUMBER

-- ✅ DO: Use correct types
SELECT * FROM orders 
WHERE customer_id = 12345;

-- ❌ DON'T: Correlated subqueries
SELECT o.*,
    (SELECT SUM(amount) 
     FROM order_items 
     WHERE order_id = o.order_id) AS total
FROM orders o;

-- ✅ DO: Use JOINs
SELECT o.*, oi.total
FROM orders o
JOIN (
    SELECT order_id, SUM(amount) AS total
    FROM order_items
    GROUP BY order_id
) oi ON o.order_id = oi.order_id;

3. Join Optimization

-- Join order matters for performance
-- Put larger table first, smaller table second

-- ❌ Less optimal
SELECT *
FROM small_dimension d
JOIN large_fact f ON d.id = f.dimension_id;

-- ✅ Better
SELECT *
FROM large_fact f
JOIN small_dimension d ON f.dimension_id = d.id;

-- Use appropriate join types
-- INNER JOIN when possible (faster than OUTER)

-- Consider clustering on join keys for very large tables
ALTER TABLE large_fact CLUSTER BY (dimension_id);

4. Leveraging Cache

-- Three types of caching in Snowflake:

-- 1. Result Cache (24 hours, automatic)
SELECT COUNT(*) FROM orders;  -- Cached

-- 2. Local Disk Cache (per warehouse)
SELECT * FROM orders 
WHERE order_date = '2024-01-15';  -- Cached on warehouse nodes

-- 3. Remote Disk Cache (metadata layer)
-- Automatically used for small queries

-- Pro tip: Run analytics on same warehouse for cache benefit
USE WAREHOUSE analytics_wh;  -- Keep using same warehouse

Cost Management Strategies

1. Monitor Credit Usage

-- View credit usage by warehouse
SELECT 
    warehouse_name,
    SUM(credits_used) AS total_credits,
    SUM(credits_used) * 4 AS approximate_cost_usd  -- ~$4/credit
FROM snowflake.account_usage.warehouse_metering_history
WHERE start_time >= DATEADD('month', -1, CURRENT_TIMESTAMP())
GROUP BY warehouse_name
ORDER BY total_credits DESC;

-- Identify expensive queries
SELECT 
    query_id,
    user_name,
    warehouse_name,
    execution_time / 1000 AS seconds,
    bytes_scanned / POWER(1024, 3) AS gb_scanned,
    credits_used_cloud_services
FROM snowflake.account_usage.query_history
WHERE start_time >= DATEADD('day', -7, CURRENT_TIMESTAMP())
    AND execution_time > 60000  -- > 1 minute
ORDER BY credits_used_cloud_services DESC
LIMIT 20;

2. Cost Optimization Checklist

Strategy	Implementation	Savings
Auto-suspend	Set to 60-300 seconds	30-50%
Right-size warehouses	Monitor and adjust	20-40%
Use transient tables	For staging/temp data	50% on storage
Limit time travel	Set to minimum needed	10-30% on storage
Clustering discipline	Only for 1TB+ tables	Avoid unnecessary costs
Query optimization	Reduce scan volumes	20-50%
Dedicated warehouses	Isolate workloads	Better visibility
Resource monitors	Set credit limits	Prevent overruns

3. Resource Monitors

-- Create account-level monitor
USE ROLE ACCOUNTADMIN;

CREATE RESOURCE MONITOR monthly_limit
WITH CREDIT_QUOTA = 1000          -- 1000 credits per month
    FREQUENCY = MONTHLY
    START_TIMESTAMP = IMMEDIATELY
    TRIGGERS
        ON 75 PERCENT DO NOTIFY    -- Alert at 75%
        ON 100 PERCENT DO SUSPEND  -- Stop at 100%
        ON 110 PERCENT DO SUSPEND_IMMEDIATE;  -- Force stop

-- Apply to account
ALTER ACCOUNT SET RESOURCE_MONITOR = monthly_limit;

-- Create warehouse-specific monitor
CREATE RESOURCE MONITOR bi_warehouse_limit
WITH CREDIT_QUOTA = 100
    FREQUENCY = DAILY
    START_TIMESTAMP = IMMEDIATELY
    TRIGGERS
        ON 90 PERCENT DO SUSPEND_IMMEDIATE;

-- Apply to specific warehouse
ALTER WAREHOUSE bi_wh 
SET RESOURCE_MONITOR = bi_warehouse_limit;

4. Storage Cost Management

-- Identify large tables
SELECT 
    table_catalog || '.' || table_schema || '.' || table_name AS full_table_name,
    row_count,
    bytes / POWER(1024, 3) AS size_gb,
    CASE 
        WHEN table_type = 'BASE TABLE' THEN 'Permanent'
        WHEN is_transient = 'YES' THEN 'Transient'
        ELSE 'Temporary'
    END AS table_type
FROM information_schema.tables
WHERE table_type = 'BASE TABLE'
    AND bytes > 1073741824  -- > 1GB
ORDER BY bytes DESC
LIMIT 20;

-- Check time travel storage
SELECT 
    table_name,
    active_bytes / POWER(1024, 3) AS active_gb,
    time_travel_bytes / POWER(1024, 3) AS time_travel_gb,
    failsafe_bytes / POWER(1024, 3) AS failsafe_gb
FROM snowflake.account_usage.table_storage_metrics
WHERE time_travel_bytes > 0
ORDER BY time_travel_bytes DESC;

-- Reduce time travel if not needed
ALTER TABLE staging_data 
SET DATA_RETENTION_TIME_IN_DAYS = 1;  -- From default 7 or 90

-- Convert to transient if appropriate
CREATE TRANSIENT TABLE staging_data_new AS 
SELECT * FROM staging_data;

DROP TABLE staging_data;
ALTER TABLE staging_data_new RENAME TO staging_data;

Security and Access Control

Role-Based Access Control (RBAC)

Snowflake uses a hierarchy of roles:

-- Standard role hierarchy
-- ACCOUNTADMIN
--   └── SECURITYADMIN
--        └── USERADMIN
--   └── SYSADMIN
--        └── Custom roles
--   └── PUBLIC (granted to all users)

-- Create custom roles
USE ROLE SECURITYADMIN;

CREATE ROLE data_engineer;
CREATE ROLE data_analyst;
CREATE ROLE data_scientist;

-- Grant roles to users
GRANT ROLE data_analyst TO USER john_doe;
GRANT ROLE data_engineer TO USER jane_smith;

-- Create role hierarchy
GRANT ROLE data_analyst TO ROLE data_scientist;  -- Scientists can do analyst work
GRANT ROLE data_engineer TO ROLE sysadmin;       -- Engineers escalate to sysadmin

Granting Privileges

-- Grant warehouse access
GRANT USAGE ON WAREHOUSE analytics_wh TO ROLE data_analyst;
GRANT OPERATE ON WAREHOUSE analytics_wh TO ROLE data_engineer;

-- Grant database/schema access
GRANT USAGE ON DATABASE analytics_db TO ROLE data_analyst;
GRANT USAGE ON SCHEMA analytics_db.curated TO ROLE data_analyst;

-- Grant table access
GRANT SELECT ON ALL TABLES IN SCHEMA analytics_db.curated TO ROLE data_analyst;
GRANT SELECT ON FUTURE TABLES IN SCHEMA analytics_db.curated TO ROLE data_analyst;

-- Grant write access to engineers
GRANT INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA analytics_db.staging 
    TO ROLE data_engineer;
GRANT CREATE TABLE ON SCHEMA analytics_db.staging TO ROLE data_engineer;

-- View privileges
SHOW GRANTS TO ROLE data_analyst;
SHOW GRANTS ON TABLE customers;

Row-Level Security with Row Access Policies

-- Create mapping table
CREATE TABLE user_region_access (
    user_name VARCHAR,
    allowed_region VARCHAR
);

INSERT INTO user_region_access VALUES
    ('john_doe', 'US-WEST'),
    ('jane_smith', 'US-EAST'),
    ('admin_user', 'ALL');

-- Create row access policy
CREATE ROW ACCESS POLICY region_access_policy
AS (region_code VARCHAR) RETURNS BOOLEAN ->
    'admin_user' = CURRENT_USER()
    OR EXISTS (
        SELECT 1 FROM user_region_access
        WHERE user_name = CURRENT_USER()
            AND (allowed_region = region_code OR allowed_region = 'ALL')
    );

-- Apply policy to table
ALTER TABLE orders 
ADD ROW ACCESS POLICY region_access_policy ON (region);

-- Now users automatically see only their regions!

Column-Level Security with Masking Policies

-- Create masking policy for PII
CREATE MASKING POLICY email_mask AS (val STRING) RETURNS STRING ->
    CASE
        WHEN CURRENT_ROLE() IN ('ADMIN', 'SECURITY') THEN val
        WHEN CURRENT_ROLE() = 'ANALYST' THEN REGEXP_REPLACE(val, '.+@', '****@')
        ELSE '****'
    END;

-- Apply to column
ALTER TABLE customers 
MODIFY COLUMN email SET MASKING POLICY email_mask;

-- Create masking for SSN/credit cards
CREATE MASKING POLICY ssn_mask AS (val STRING) RETURNS STRING ->
    CASE
        WHEN CURRENT_ROLE() IN ('ADMIN', 'COMPLIANCE') THEN val
        ELSE '***-**-' || RIGHT(val, 4)
    END;

ALTER TABLE customers 
MODIFY COLUMN ssn SET MASKING POLICY ssn_mask;

Data Modeling Patterns

1. Star Schema (Most Common)

-- Dimension tables (descriptive attributes)
CREATE TABLE dim_customer (
    customer_key NUMBER AUTOINCREMENT,  -- Surrogate key
    customer_id VARCHAR(50),            -- Natural key
    first_name VARCHAR(100),
    last_name VARCHAR(100),
    email VARCHAR(200),
    -- SCD Type 2 columns
    effective_date DATE,
    end_date DATE,
    is_current BOOLEAN,
    PRIMARY KEY (customer_key)
);

CREATE TABLE dim_product (
    product_key NUMBER AUTOINCREMENT,
    product_id VARCHAR(50),
    product_name VARCHAR(200),
    category VARCHAR(100),
    unit_price NUMBER(10,2),
    effective_date DATE,
    is_current BOOLEAN,
    PRIMARY KEY (product_key)
);

-- Fact table (measures/metrics)
CREATE TABLE fact_sales (
    sales_key NUMBER AUTOINCREMENT,
    sale_date DATE,
    customer_key NUMBER,           -- FK to dim_customer
    product_key NUMBER,            -- FK to dim_product
    store_key NUMBER,              -- FK to dim_store
    quantity NUMBER,
    unit_price NUMBER(10,2),
    discount_amount NUMBER(10,2),
    sales_amount NUMBER(10,2),
    cost_amount NUMBER(10,2),
    PRIMARY KEY (sales_key),
    -- Snowflake doesn't enforce FKs, but good for documentation
    FOREIGN KEY (customer_key) REFERENCES dim_customer(customer_key),
    FOREIGN KEY (product_key) REFERENCES dim_product(product_key)
)
CLUSTER BY (sale_date);  -- If large table

2. Slowly Changing Dimensions (SCD)

-- Type 1: Overwrite (no history)
UPDATE dim_customer 
SET email = 'newemail@example.com'
WHERE customer_id = '12345';

-- Type 2: Track history (recommended)
-- Close current record
UPDATE dim_customer
SET end_date = CURRENT_DATE(),
    is_current = FALSE
WHERE customer_id = '12345' AND is_current = TRUE;

-- Insert new record
INSERT INTO dim_customer 
    (customer_id, first_name, last_name, email, effective_date, end_date, is_current)
VALUES 
    ('12345', 'John', 'Doe', 'newemail@example.com', CURRENT_DATE(), '9999-12-31', TRUE);

-- Query current records
SELECT * FROM dim_customer WHERE is_current = TRUE;

-- Query at specific point in time
SELECT * FROM dim_customer 
WHERE customer_id = '12345'
    AND effective_date <= '2024-01-15'
    AND end_date >= '2024-01-15';

3. Data Vault 2.0 (For Complex Enterprises)

-- Hubs: Unique business keys
CREATE TABLE hub_customer (
    customer_hashkey BINARY(16),        -- MD5 of business key
    customer_id VARCHAR(50),            -- Business key
    load_date TIMESTAMP_NTZ,
    record_source VARCHAR(50),
    PRIMARY KEY (customer_hashkey)
);

-- Links: Relationships between hubs
CREATE TABLE link_order (
    order_hashkey BINARY(16),
    customer_hashkey BINARY(16),
    product_hashkey BINARY(16),
    load_date TIMESTAMP_NTZ,
    record_source VARCHAR(50),
    PRIMARY KEY (order_hashkey)
);

-- Satellites: Descriptive attributes
CREATE TABLE sat_customer (
    customer_hashkey BINARY(16),
    load_date TIMESTAMP_NTZ,
    first_name VARCHAR(100),
    last_name VARCHAR(100),
    email VARCHAR(200),
    hashdiff BINARY(16),                -- Hash of all attributes
    record_source VARCHAR(50),
    PRIMARY KEY (customer_hashkey, load_date)
);

Advanced Features

1. Streams for CDC

-- Create a stream on a table
CREATE STREAM customer_changes ON TABLE customers;

-- The stream tracks: INSERT, UPDATE, DELETE
-- Insert some data
INSERT INTO customers VALUES (1, 'John', 'Doe');
UPDATE customers SET first_name = 'Jane' WHERE customer_id = 1;

-- Query the stream
SELECT * FROM customer_changes;
-- Shows: METADATA$ACTION (INSERT/DELETE), METADATA$ISUPDATE

-- Process changes
MERGE INTO customer_summary s
USING customer_changes c
ON s.customer_id = c.customer_id
WHEN MATCHED AND c.METADATA$ACTION = 'DELETE' AND c.METADATA$ISUPDATE = FALSE
    THEN DELETE
WHEN MATCHED AND c.METADATA$ACTION = 'INSERT' AND c.METADATA$ISUPDATE = FALSE
    THEN UPDATE SET s.first_name = c.first_name, s.last_name = c.last_name
WHEN NOT MATCHED AND c.METADATA$ACTION = 'INSERT'
    THEN INSERT (customer_id, first_name, last_name) 
         VALUES (c.customer_id, c.first_name, c.last_name);

-- Stream is automatically consumed after successful MERGE

2. Tasks for Automation

-- Create a task to run hourly
CREATE TASK refresh_summary_hourly
    WAREHOUSE = etl_wh
    SCHEDULE = '60 MINUTE'
AS
    INSERT INTO sales_summary
    SELECT 
        DATE_TRUNC('HOUR', sale_timestamp) AS sale_hour,
        SUM(amount) AS total_sales,
        COUNT(*) AS transaction_count
    FROM sales
    WHERE sale_timestamp >= DATEADD('HOUR', -1, CURRENT_TIMESTAMP())
    GROUP BY 1;

-- Enable the task
ALTER TASK refresh_summary_hourly RESUME;

-- Create task triggered by stream
CREATE TASK process_customer_changes
    WAREHOUSE = etl_wh
    WHEN SYSTEM$STREAM_HAS_DATA('customer_changes')
AS
    MERGE INTO customer_summary s
    USING customer_changes c
    ON s.customer_id = c.customer_id
    -- ... (merge logic from above)

-- Chain tasks (DAG)
CREATE TASK step1 
    SCHEDULE = '60 MINUTE'
    AS INSERT INTO stage1 SELECT * FROM source;

CREATE TASK step2
    AFTER step1  -- Runs after step1 completes
    AS INSERT INTO stage2 SELECT * FROM stage1;

CREATE TASK step3
    AFTER step2
    AS INSERT INTO final SELECT * FROM stage2;

-- Resume in reverse order
ALTER TASK step3 RESUME;
ALTER TASK step2 RESUME;
ALTER TASK step1 RESUME;

-- Monitor tasks
SHOW TASKS;
SELECT * FROM TABLE(INFORMATION_SCHEMA.TASK_HISTORY())
ORDER BY SCHEDULED_TIME DESC;

3. Stored Procedures

-- JavaScript stored procedure
CREATE OR REPLACE PROCEDURE load_sales_data(source_table STRING, target_table STRING)
RETURNS STRING
LANGUAGE JAVASCRIPT
AS
$$
    try {
        // Execute SQL statements
        var sql_truncate = `TRUNCATE TABLE ${TARGET_TABLE}`;
        snowflake.execute({sqlText: sql_truncate});

        var sql_insert = `INSERT INTO ${TARGET_TABLE} SELECT * FROM ${SOURCE_TABLE}`;
        var stmt = snowflake.execute({sqlText: sql_insert});

        var rows_inserted = stmt.getNumRowsAffected();
        return `Successfully loaded ${rows_inserted} rows into ${TARGET_TABLE}`;
    } catch (err) {
        return `Error: ${err.message}`;
    }
$$;

-- Call the procedure
CALL load_sales_data('staging.sales', 'prod.sales');

-- SQL stored procedure (Snowflake Scripting)
CREATE OR REPLACE PROCEDURE calculate_aggregates()
RETURNS STRING
LANGUAGE SQL
AS
$$
DECLARE
    row_count NUMBER;
    result_message STRING;
BEGIN
    -- Truncate target
    TRUNCATE TABLE daily_aggregates;

    -- Insert aggregated data
    INSERT INTO daily_aggregates
    SELECT 
        sale_date,
        SUM(amount) AS total_amount,
        COUNT(*) AS transaction_count
    FROM sales
    WHERE sale_date >= DATEADD('day', -7, CURRENT_DATE())
    GROUP BY sale_date;

    -- Get affected rows
    row_count := SQLROWCOUNT;
    result_message := 'Processed ' || row_count || ' days';

    RETURN result_message;
END;
$$;

CALL calculate_aggregates();

AWS Integration Deep Dive

1. Storage Integration with S3

The recommended way to connect Snowflake to S3 uses IAM roles (not access keys):

Step 1: Create IAM Policy in AWS

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:GetObjectVersion",
                "s3:ListBucket",
                "s3:GetBucketLocation"
            ],
            "Resource": [
                "arn:aws:s3:::my-data-bucket/*",
                "arn:aws:s3:::my-data-bucket"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:DeleteObject"
            ],
            "Resource": [
                "arn:aws:s3:::my-data-bucket/unload/*"
            ]
        }
    ]
}

Step 2: Create Storage Integration in Snowflake

-- Create storage integration
USE ROLE ACCOUNTADMIN;

CREATE STORAGE INTEGRATION s3_integration
    TYPE = EXTERNAL_STAGE
    STORAGE_PROVIDER = S3
    ENABLED = TRUE
    STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::123456789012:role/snowflake-s3-role'
    STORAGE_ALLOWED_LOCATIONS = ('s3://my-data-bucket/data/', 's3://my-data-bucket/raw/');

-- Get the Snowflake IAM user ARN and External ID
DESC STORAGE INTEGRATION s3_integration;
-- Note: STORAGE_AWS_IAM_USER_ARN and STORAGE_AWS_EXTERNAL_ID

Step 3: Update AWS IAM Role Trust Policy

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::123456789012:user/abc12345-s"
            },
            "Action": "sts:AssumeRole",
            "Condition": {
                "StringEquals": {
                    "sts:ExternalId": "ABC123_SFCRole=1_abcdefg="
                }
            }
        }
    ]
}

Step 4: Create Stage Using Integration

-- Create external stage
CREATE STAGE my_s3_stage
    URL = 's3://my-data-bucket/data/'
    STORAGE_INTEGRATION = s3_integration
    FILE_FORMAT = (TYPE = 'CSV' FIELD_DELIMITER = ',' SKIP_HEADER = 1);

-- Test it
LIST @my_s3_stage;

-- Load data
COPY INTO customers
FROM @my_s3_stage/customers.csv
FILE_FORMAT = (TYPE = 'CSV');

2. Snowpipe with S3 Event Notifications

Automatically load data as it arrives in S3:

Step 1: Create Pipe in Snowflake

CREATE PIPE customer_pipe
    AUTO_INGEST = TRUE
    AS
    COPY INTO customers
    FROM @my_s3_stage
    FILE_FORMAT = (TYPE = 'CSV')
    ON_ERROR = 'CONTINUE';

-- Get the SQS queue ARN
SHOW PIPES;
-- Note the "notification_channel" value

Step 2: Configure S3 Event Notification

{
    "QueueConfigurations": [
        {
            "Id": "SnowflakeAutoIngest",
            "QueueArn": "arn:aws:sqs:us-east-1:123456789:sf-snowpipe-AIDAIDAIDAID-abcdefg",
            "Events": ["s3:ObjectCreated:*"],
            "Filter": {
                "Key": {
                    "FilterRules": [
                        {
                            "Name": "prefix",
                            "Value": "data/customers/"
                        },
                        {
                            "Name": "suffix",
                            "Value": ".csv"
                        }
                    ]
                }
            }
        }
    ]
}

Step 3: Monitor Pipe

-- Check pipe status
SELECT SYSTEM$PIPE_STATUS('customer_pipe');

-- View load history
SELECT * FROM TABLE(
    INFORMATION_SCHEMA.COPY_HISTORY(
        TABLE_NAME => 'CUSTOMERS',
        START_TIME => DATEADD(hours, -24, CURRENT_TIMESTAMP())
    )
);

-- Manually refresh if needed
ALTER PIPE customer_pipe REFRESH;

3. Unloading Data to S3

-- Unload query results to S3
COPY INTO @my_s3_stage/exports/customers_
FROM (
    SELECT * FROM customers 
    WHERE created_date >= '2024-01-01'
)
FILE_FORMAT = (TYPE = 'PARQUET')
PARTITION BY (DATE_TRUNC('MONTH', created_date))
HEADER = TRUE
OVERWRITE = TRUE
MAX_FILE_SIZE = 104857600;  -- 100 MB files

-- Unload with compression
COPY INTO @my_s3_stage/exports/customers.gz
FROM customers
FILE_FORMAT = (TYPE = 'CSV' COMPRESSION = 'GZIP')
SINGLE = TRUE;  -- Single file

-- Unload as JSON
COPY INTO @my_s3_stage/exports/customers.json
FROM customers
FILE_FORMAT = (TYPE = 'JSON')
OVERWRITE = TRUE;

S3 Integration and External Stages

External Tables (Query S3 Without Loading)

-- Create external table pointing to S3 Parquet files
CREATE EXTERNAL TABLE ext_customer_parquet
WITH LOCATION = @my_s3_stage/parquet_data/
FILE_FORMAT = (TYPE = PARQUET)
AUTO_REFRESH = TRUE
PATTERN = '.*.parquet';

-- Query external table
SELECT * FROM ext_customer_parquet
WHERE region = 'US-WEST'
LIMIT 100;

-- Create materialized view for performance
CREATE MATERIALIZED VIEW mv_ext_customers AS
SELECT 
    customer_id,
    region,
    total_purchases,
    DATE_TRUNC('MONTH', last_purchase_date) AS month
FROM ext_customer_parquet;

-- Manually refresh if AUTO_REFRESH is false
ALTER EXTERNAL TABLE ext_customer_parquet REFRESH;

Directory Tables (Catalog S3 Files)

-- Enable directory table
ALTER STAGE my_s3_stage 
SET DIRECTORY = (ENABLE = TRUE AUTO_REFRESH = TRUE);

-- Query directory metadata
SELECT * 
FROM DIRECTORY(@my_s3_stage)
WHERE RELATIVE_PATH LIKE '%2024-01%';

-- Use directory table to process files selectively
COPY INTO customers
FROM (
    SELECT 
        $1::VARCHAR AS customer_id,
        $2::VARCHAR AS name,
        $3::VARCHAR AS email,
        METADATA$FILENAME AS source_file
    FROM @my_s3_stage
)
WHERE METADATA$FILENAME IN (
    SELECT RELATIVE_PATH 
    FROM DIRECTORY(@my_s3_stage)
    WHERE LAST_MODIFIED > DATEADD('hour', -1, CURRENT_TIMESTAMP())
);

VPC and Networking

AWS PrivateLink Setup

For secure, private connectivity between AWS VPC and Snowflake:

Step 1: Enable PrivateLink in Snowflake

USE ROLE ACCOUNTADMIN;

-- For AWS regions that support PrivateLink
SELECT SYSTEM$GET_PRIVATELINK_CONFIG();

-- Returns the VPC endpoint service name
-- Example: com.amazonaws.vpce.us-east-1.vpce-svc-1234567890abcdef0

Step 2: Create VPC Endpoint in AWS

# Using AWS CLI
aws ec2 create-vpc-endpoint 
    --vpc-id vpc-12345678 
    --service-name com.amazonaws.vpce.us-east-1.vpce-svc-1234567890abcdef0 
    --vpc-endpoint-type Interface 
    --subnet-ids subnet-12345678 subnet-87654321 
    --security-group-ids sg-12345678

Step 3: Configure Network Policies

-- Create network policy to allow only PrivateLink
CREATE NETWORK POLICY privatelink_only
    ALLOWED_IP_LIST = ('0.0.0.0/0')  -- Or specific IPs
    BLOCKED_IP_LIST = ()
    COMMENT = 'Allow access via PrivateLink';

-- Apply to account
ALTER ACCOUNT SET NETWORK_POLICY = privatelink_only;

-- Or apply to specific users
ALTER USER john_doe SET NETWORK_POLICY = privatelink_only;

Network Policies

-- Allow only corporate IPs
CREATE NETWORK POLICY corporate_access
    ALLOWED_IP_LIST = ('203.0.113.0/24', '198.51.100.0/24')
    BLOCKED_IP_LIST = ()
    COMMENT = 'Corporate network access only';

-- Allow all except blocked IPs
CREATE NETWORK POLICY block_suspicious
    ALLOWED_IP_LIST = ('0.0.0.0/0')
    BLOCKED_IP_LIST = ('192.0.2.0/24')  -- Block suspicious range
    COMMENT = 'Block specific IPs';

-- View network policies
SHOW NETWORK POLICIES;

-- Apply globally
ALTER ACCOUNT SET NETWORK_POLICY = corporate_access;

BI Tool Integrations

Tableau Integration

Connection Setup:

In Tableau Desktop: Connect → Snowflake
Enter details:
- Server: account.region.snowflakecomputing.com
- Authentication: Username/Password or OAuth
- Warehouse, Database, Schema

Optimization Tips:

-- Create aggregated tables for Tableau
CREATE TABLE tableau_daily_sales AS
SELECT 
    DATE_TRUNC('DAY', sale_date) AS day,
    region,
    product_category,
    SUM(amount) AS total_sales,
    COUNT(DISTINCT customer_id) AS unique_customers
FROM sales
GROUP BY 1, 2, 3;

-- Create dedicated warehouse
CREATE WAREHOUSE tableau_wh
    WAREHOUSE_SIZE = 'MEDIUM'
    AUTO_SUSPEND = 600  -- 10 minutes (Tableau keeps connection)
    AUTO_RESUME = TRUE
    MAX_CLUSTER_COUNT = 5  -- Handle concurrent users
    SCALING_POLICY = 'STANDARD';

-- Grant usage to Tableau service account
GRANT USAGE ON WAREHOUSE tableau_wh TO ROLE tableau_role;
GRANT SELECT ON ALL TABLES IN SCHEMA analytics.curated TO ROLE tableau_role;

Performance best practices:

Use extracts for large datasets (published to Tableau Server)
Use live connections for real-time dashboards
Leverage Snowflake’s result caching
Create materialized views for complex calculations

Power BI Integration

Connection String:

Server: account.region.snowflakecomputing.com
Warehouse: POWERBI_WH
Database: ANALYTICS_DB
Schema: CURATED

DirectQuery vs Import:

Mode	When to Use	Pros	Cons
DirectQuery	Real-time dashboards, large datasets	Always current, no size limits	Query performance dependent on Snowflake
Import	Static reports, complex DAX	Fast in Power BI, offline access	Stale data, size limits

Optimization:

-- Pre-aggregate for Power BI
CREATE VIEW powerbi_sales_summary AS
SELECT 
    DATE_TRUNC('DAY', sale_timestamp) AS sale_date,
    store_id,
    product_id,
    SUM(quantity) AS total_quantity,
    SUM(amount) AS total_amount,
    COUNT(*) AS transaction_count
FROM sales
GROUP BY 1, 2, 3;

-- Use appropriate data types
-- Power BI works best with DATE (not TIMESTAMP) for date tables

Looker Integration

Connection Setup in Looker:

connection: "snowflake_prod" {
  url: "account.region.snowflakecomputing.com"
  database: "ANALYTICS_DB"
  warehouse: "LOOKER_WH"
  schema: "CURATED"
  max_connections: 20
  connection_pooling: yes
  ssl: yes
}

view: customers {
  sql_table_name: PUBLIC.CUSTOMERS ;;

  dimension: customer_id {
    primary_key: yes
    type: number
    sql: ${TABLE}.CUSTOMER_ID ;;
  }

  dimension: email {
    type: string
    sql: ${TABLE}.EMAIL ;;
  }

  dimension_group: signup {
    type: time
    timeframes: [date, week, month, year]
    sql: ${TABLE}.SIGNUP_DATE ;;
  }

  measure: count {
    type: count
    drill_fields: [customer_id, email]
  }
}

Performance Tips:

-- Create persistent derived tables (PDTs)
-- Looker will create and refresh these tables

-- Enable query results caching
ALTER SESSION SET USE_CACHED_RESULT = TRUE;

-- Create aggregate awareness tables
CREATE TABLE daily_rollup AS
SELECT 
    DATE_TRUNC('DAY', event_time) AS event_day,
    user_id,
    COUNT(*) AS event_count
FROM events
GROUP BY 1, 2;

Python/Pandas Integration

import snowflake.connector
import pandas as pd
from snowflake.connector.pandas_tools import write_pandas

# Connect to Snowflake
conn = snowflake.connector.connect(
    user='your_user',
    password='your_password',
    account='account.region',
    warehouse='PYTHON_WH',
    database='ANALYTICS_DB',
    schema='PUBLIC'
)

# Query to DataFrame
df = pd.read_sql("""
    SELECT * FROM customers 
    WHERE signup_date >= '2024-01-01'
""", conn)

# Write DataFrame to Snowflake (fast!)
write_pandas(
    conn=conn,
    df=df,
    table_name='CUSTOMERS_UPLOAD',
    database='ANALYTICS_DB',
    schema='STAGING',
    auto_create_table=True
)

# Using cursor for row-by-row processing
cursor = conn.cursor()
cursor.execute("SELECT * FROM large_table")
for row in cursor:
    process_row(row)

# Close connection
conn.close()

# Using SQLAlchemy (for compatibility with other tools)
from sqlalchemy import create_engine

engine = create_engine(
    'snowflake://user:password@account.region/database/schema?warehouse=PYTHON_WH'
)

df = pd.read_sql_table('customers', engine)
df.to_sql('customers_new', engine, if_exists='replace')

Snowflake vs AWS Redshift

Let’s compare these two popular cloud data warehouses:

Architecture Differences

Aspect	Snowflake	Redshift
Architecture	Shared-data, separated compute/storage	Shared-nothing, coupled compute/storage
Scaling	Elastic, instant	Requires cluster resize
Concurrency	Multi-cluster, no resource contention	Limited by cluster size, queuing common
Management	Fully managed, zero-admin	Some management required
Pause/Resume	Instant, per-warehouse	Whole cluster, slower
Cloning	Zero-copy, instant	Not available
Time Travel	Built-in, up to 90 days	Manual (snapshots)

Performance Comparison

-- SNOWFLAKE: No distribution keys needed
CREATE TABLE orders (
    order_id NUMBER,
    customer_id NUMBER,
    order_date DATE,
    amount NUMBER(10,2)
);
-- Automatically micro-partitioned and distributed

-- REDSHIFT: Manual distribution key required
CREATE TABLE orders (
    order_id INTEGER,
    customer_id INTEGER,
    order_date DATE,
    amount DECIMAL(10,2)
)
DISTKEY(customer_id)  -- Must choose distribution
SORTKEY(order_date);  -- Must choose sort key

-- SNOWFLAKE: No VACUUM needed
-- Updates handled automatically

-- REDSHIFT: Regular VACUUM required
VACUUM orders;
ANALYZE orders;

Feature Comparison Table

Feature	Snowflake	Redshift	Winner
Setup Complexity	Minutes	Hours	❄️ Snowflake
Concurrency	Excellent (multi-cluster)	Good (WLM)	❄️ Snowflake
Query Performance	Excellent	Excellent	🤝 Tie
Data Sharing	Native, instant	Manual export/import	❄️ Snowflake
Semi-structured Data	Native VARIANT type	JSON strings	❄️ Snowflake
Cost Model	Compute (credits) + Storage	Instance hours + Storage	🤷 Depends
AWS Integration	Good	Excellent	🔴 Redshift
Multi-cloud	AWS, Azure, GCP	AWS only	❄️ Snowflake
Tool Ecosystem	Growing	Mature	🔴 Redshift
Maintenance	Zero	Moderate	❄️ Snowflake

Cost Comparison Example

Scenario: 5TB data, 100 concurrent users, 24/7 availability

Snowflake:

Storage: 5TB × $40/TB/month = $200/month
Compute: 
  - Medium WH for BI: 4 credits/hr × 730 hrs × $4 = $11,680
  - Multi-cluster (2-5): ~20% utilization = $2,336
  Total: ~$2,536/month

Total: $2,736/month (with auto-suspend optimization)

Redshift:

Cluster: ra3.4xlarge × 2 nodes
  = $3.26/hr × 2 × 730 hrs = $4,759/month
Storage: Included in instance cost

Total: $4,759/month (must run 24/7)

Snowflake wins on cost if:

Workload is bursty (can auto-suspend)
Need multiple isolated workloads
Development/testing environments

Redshift wins on cost if:

Truly 24/7 workload
Deeply integrated with AWS services
Committed to reserved instances

Migration: Redshift → Snowflake

-- 1. Export from Redshift to S3
-- In Redshift:
UNLOAD ('SELECT * FROM customers')
TO 's3://migration-bucket/customers/'
IAM_ROLE 'arn:aws:iam::123456789:role/RedshiftUnloadRole'
PARQUET
PARALLEL ON;

-- 2. Create storage integration in Snowflake
CREATE STORAGE INTEGRATION redshift_migration
    TYPE = EXTERNAL_STAGE
    STORAGE_PROVIDER = S3
    ENABLED = TRUE
    STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::123456789:role/SnowflakeLoadRole'
    STORAGE_ALLOWED_LOCATIONS = ('s3://migration-bucket/');

-- 3. Create stage and load
CREATE STAGE migration_stage
    URL = 's3://migration-bucket/'
    STORAGE_INTEGRATION = redshift_migration
    FILE_FORMAT = (TYPE = 'PARQUET');

CREATE TABLE customers AS
SELECT * FROM @migration_stage/customers/;

-- 4. Schema conversion notes:
-- Redshift                → Snowflake
-- DISTKEY/SORTKEY        → Remove (automatic)
-- VARCHAR(MAX)           → VARCHAR or STRING
-- ENCODE                 → Remove (automatic compression)
-- IDENTITY               → AUTOINCREMENT
-- ANALYZE/VACUUM         → Remove (automatic)

System Tables and Metadata

Snowflake provides extensive metadata through system tables and views:

Account Usage Schema (Historical Data)

-- 1. Query History (up to 1 year)
SELECT 
    query_id,
    query_text,
    user_name,
    warehouse_name,
    execution_status,
    start_time,
    end_time,
    DATEDIFF('second', start_time, end_time) AS duration_sec,
    total_elapsed_time / 1000 AS total_sec,
    bytes_scanned / POWER(1024, 3) AS gb_scanned,
    rows_produced
FROM snowflake.account_usage.query_history
WHERE start_time >= DATEADD('day', -30, CURRENT_TIMESTAMP())
    AND warehouse_name = 'PROD_WH'
ORDER BY start_time DESC
LIMIT 100;

-- 2. Storage Usage Over Time
SELECT 
    usage_date,
    storage_bytes / POWER(1024, 4) AS storage_tb,
    stage_bytes / POWER(1024, 4) AS stage_tb,
    failsafe_bytes / POWER(1024, 4) AS failsafe_tb
FROM snowflake.account_usage.storage_usage
WHERE usage_date >= DATEADD('month', -3, CURRENT_DATE())
ORDER BY usage_date DESC;

-- 3. Warehouse Metering (Credit Consumption)
SELECT 
    warehouse_name,
    DATE_TRUNC('DAY', start_time) AS day,
    SUM(credits_used) AS daily_credits,
    SUM(credits_used) * 4 AS approx_cost_usd
FROM snowflake.account_usage.warehouse_metering_history
WHERE start_time >= DATEADD('month', -1, CURRENT_TIMESTAMP())
GROUP BY 1, 2
ORDER BY 1, 2 DESC;

-- 4. Login History (Security Audit)
SELECT 
    event_timestamp,
    user_name,
    client_ip,
    reported_client_type,
    first_authentication_factor,
    is_success,
    error_code,
    error_message
FROM snowflake.account_usage.login_history
WHERE event_timestamp >= DATEADD('day', -7, CURRENT_TIMESTAMP())
    AND is_success = 'NO'  -- Failed logins
ORDER BY event_timestamp DESC;

-- 5. Data Transfer History (Egress Costs)
SELECT 
    DATE_TRUNC('DAY', start_time) AS day,
    source_cloud,
    source_region,
    target_cloud,
    target_region,
    bytes_transferred / POWER(1024, 3) AS gb_transferred
FROM snowflake.account_usage.data_transfer_history
WHERE start_time >= DATEADD('month', -1, CURRENT_TIMESTAMP())
ORDER BY day DESC;

-- 6. Table Storage Details
SELECT 
    table_catalog || '.' || table_schema || '.' || table_name AS full_name,
    active_bytes / POWER(1024, 3) AS active_gb,
    time_travel_bytes / POWER(1024, 3) AS time_travel_gb,
    failsafe_bytes / POWER(1024, 3) AS failsafe_gb,
    (active_bytes + time_travel_bytes + failsafe_bytes) / POWER(1024, 3) AS total_gb
FROM snowflake.account_usage.table_storage_metrics
WHERE deleted IS NULL  -- Exclude dropped tables
ORDER BY total_gb DESC
LIMIT 50;

-- 7. Copy History (Data Loading)
SELECT 
    table_name,
    stage_location,
    file_name,
    status,
    row_count,
    row_parsed,
    first_error_message,
    execution_time / 1000 AS load_seconds
FROM snowflake.account_usage.copy_history
WHERE table_name = 'CUSTOMERS'
    AND last_load_time >= DATEADD('day', -7, CURRENT_TIMESTAMP())
ORDER BY last_load_time DESC;

-- 8. Access History (Column-level Auditing)
SELECT 
    query_id,
    query_start_time,
    user_name,
    direct_objects_accessed,
    base_objects_accessed,
    objects_modified
FROM snowflake.account_usage.access_history
WHERE query_start_time >= DATEADD('day', -1, CURRENT_TIMESTAMP())
    AND ARRAY_SIZE(objects_modified) > 0  -- Only modifications
ORDER BY query_start_time DESC;

Information Schema (Current State)

-- 1. List all tables with row counts
SELECT 
    table_catalog,
    table_schema,
    table_name,
    table_type,
    row_count,
    bytes / POWER(1024, 3) AS size_gb,
    created,
    last_altered
FROM information_schema.tables
WHERE table_schema NOT IN ('INFORMATION_SCHEMA')
ORDER BY bytes DESC;

-- 2. Get table DDL
SELECT GET_DDL('TABLE', 'CUSTOMERS');
SELECT GET_DDL('VIEW', 'ACTIVE_CUSTOMERS');
SELECT GET_DDL('DATABASE', 'ANALYTICS_DB');

-- 3. Column details
SELECT 
    table_schema,
    table_name,
    column_name,
    ordinal_position,
    data_type,
    is_nullable,
    column_default
FROM information_schema.columns
WHERE table_name = 'CUSTOMERS'
ORDER BY ordinal_position;

-- 4. View dependencies
SELECT 
    *
FROM information_schema.view_dependencies
WHERE referenced_object_name = 'CUSTOMERS';

Monitoring and Troubleshooting

Essential Monitoring Queries

-- 1. Currently running queries
SELECT 
    query_id,
    user_name,
    warehouse_name,
    execution_status,
    DATEDIFF('second', start_time, CURRENT_TIMESTAMP()) AS running_seconds,
    query_text
FROM snowflake.account_usage.query_history
WHERE execution_status = 'RUNNING'
ORDER BY start_time;

-- 2. Long-running queries (last 24 hours)
SELECT 
    query_id,
    user_name,
    warehouse_name,
    execution_time / 1000 AS duration_seconds,
    bytes_scanned / POWER(1024, 3) AS gb_scanned,
    LEFT(query_text, 100) AS query_snippet
FROM snowflake.account_usage.query_history
WHERE start_time >= DATEADD('hour', -24, CURRENT_TIMESTAMP())
    AND execution_time > 300000  -- > 5 minutes
ORDER BY execution_time DESC;

-- 3. Warehouse utilization
SELECT 
    warehouse_name,
    DATE_TRUNC('HOUR', start_time) AS hour,
    SUM(credits_used) AS credits,
    AVG(avg_running) AS avg_queries_running,
    MAX(avg_queued_load) AS max_queued
FROM snowflake.account_usage.warehouse_load_history
WHERE start_time >= DATEADD('day', -7, CURRENT_TIMESTAMP())
GROUP BY 1, 2
ORDER BY 1, 2 DESC;

-- 4. Failed queries
SELECT 
    query_id,
    user_name,
    error_code,
    error_message,
    LEFT(query_text, 100) AS query_snippet
FROM snowflake.account_usage.query_history
WHERE start_time >= DATEADD('hour', -24, CURRENT_TIMESTAMP())
    AND execution_status = 'FAIL'
ORDER BY start_time DESC;

-- 5. Storage usage trends
SELECT 
    DATE_TRUNC('DAY', usage_date) AS day,
    AVG(storage_bytes) / POWER(1024, 3) AS avg_storage_gb,
    AVG(stage_bytes) / POWER(1024, 3) AS avg_stage_gb,
    AVG(failsafe_bytes) / POWER(1024, 3) AS avg_failsafe_gb
FROM snowflake.account_usage.storage_usage
WHERE usage_date >= DATEADD('month', -1, CURRENT_DATE())
GROUP BY 1
ORDER BY 1;

Performance Troubleshooting Guide

-- Problem: Query running slow
-- Step 1: Check if warehouse is overwhelmed
SELECT 
    warehouse_name,
    avg_running,
    avg_queued_load,
    avg_blocked
FROM snowflake.account_usage.warehouse_load_history
WHERE start_time >= DATEADD('hour', -1, CURRENT_TIMESTAMP())
    AND warehouse_name = 'YOUR_WAREHOUSE'
ORDER BY start_time DESC
LIMIT 10;

-- Step 2: Check query profile for the slow query
-- Use UI: Query -> Profile tab
-- Look for:
-- - High "Bytes spilled to local/remote storage"
-- - Many partitions scanned
-- - Inefficient joins

-- Step 3: Check if data is properly clustered
SELECT 
    table_name,
    SYSTEM$CLUSTERING_INFORMATION(table_name)
FROM information_schema.tables
WHERE table_schema = 'YOUR_SCHEMA'
    AND table_name = 'YOUR_TABLE';

Data Sharing and Marketplace

One of Snowflake’s killer features: share data without copying!

Creating a Data Share

-- As provider:
USE ROLE ACCOUNTADMIN;

-- Create share
CREATE SHARE customer_data_share;

-- Add database to share
GRANT USAGE ON DATABASE analytics_db TO SHARE customer_data_share;
GRANT USAGE ON SCHEMA analytics_db.public TO SHARE customer_data_share;

-- Add specific objects
GRANT SELECT ON TABLE analytics_db.public.customers TO SHARE customer_data_share;
GRANT SELECT ON VIEW analytics_db.public.customer_summary TO SHARE customer_data_share;

-- Add consumer account
ALTER SHARE customer_data_share 
ADD ACCOUNTS = abc12345, xyz67890;

-- Show what's shared
SHOW GRANTS TO SHARE customer_data_share;

-- Add secure view with row filtering
CREATE SECURE VIEW shared_customer_view AS
SELECT 
    customer_id,
    first_name,
    last_name,
    region
FROM customers
WHERE region IN (
    SELECT region FROM consumer_allowed_regions
);

GRANT SELECT ON VIEW shared_customer_view TO SHARE customer_data_share;

Consuming a Data Share

-- As consumer:
-- List available shares
SHOW SHARES;

-- Create database from share
CREATE DATABASE shared_customer_data
FROM SHARE provider_account.customer_data_share;

-- Query shared data (no copy!)
SELECT * FROM shared_customer_data.public.customers
WHERE region = 'US-WEST';

-- Shared data usage doesn't affect provider's compute

Snowflake Marketplace

-- Browse available datasets
-- Go to: Snowsight > Data Products > Marketplace

-- Get free weather data (example)
-- 1. Find "Weather Source" in Marketplace
-- 2. Click "Get"
-- 3. Creates database automatically

-- Query marketplace data
SELECT * 
FROM weather_source.standard_tile.forecast_day
WHERE postal_code = '10001'
    AND forecast_date = CURRENT_DATE();

-- Combine with your data
SELECT 
    o.order_date,
    o.total_sales,
    w.avg_temperature_air_2m_f,
    w.tot_precipitation_in
FROM analytics_db.public.daily_sales o
JOIN weather_source.standard_tile.history_day w
    ON o.store_zip = w.postal_code
    AND o.order_date = w.date_valid_std
WHERE o.order_date >= DATEADD('month', -1, CURRENT_DATE());

Migration Strategies

From Traditional Databases (Oracle, SQL Server, MySQL)

Step 1: Assessment

-- Inventory source system
-- Document:
-- - Table sizes
-- - Data types
-- - Stored procedures
-- - ETL processes
-- - BI tool connections
-- - Query patterns

Step 2: Schema Conversion

-- Oracle → Snowflake conversions:
-- NUMBER(n)              → NUMBER(n, 0)
-- NUMBER(n,m)           → NUMBER(n, m)
-- VARCHAR2(n)           → VARCHAR(n)
-- DATE                  → DATE or TIMESTAMP
-- CLOB                  → STRING or VARCHAR(16MB)
-- RAW/BLOB              → BINARY
-- Sequence              → AUTOINCREMENT or sequence

-- SQL Server → Snowflake:
-- INT                   → NUMBER(38, 0)
-- BIGINT                → NUMBER(38, 0)
-- VARCHAR(MAX)          → STRING
-- DATETIME              → TIMESTAMP_NTZ
-- UNIQUEIDENTIFIER      → VARCHAR(36)
-- IDENTITY              → AUTOINCREMENT

-- Create tables in Snowflake
CREATE TABLE customers (
    customer_id NUMBER AUTOINCREMENT,
    created_date TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP(),
    -- ...
);

Step 3: Data Migration

-- Option 1: Export to CSV, upload to S3, COPY into Snowflake
-- (Covered earlier)

-- Option 2: Use third-party tools
-- - Fivetran
-- - Matillion
-- - AWS DMS (Database Migration Service)
-- - Talend
-- - Informatica

-- Option 3: Snowflake connector for Spark
-- For large-scale migrations

Step 4: Procedure/Function Migration

-- Convert PL/SQL or T-SQL to Snowflake SQL or JavaScript

-- Oracle PL/SQL:
/*
CREATE OR REPLACE PROCEDURE update_customer_tier IS
BEGIN
    UPDATE customers
    SET tier = 'GOLD'
    WHERE total_purchases > 10000;

    COMMIT;
END;
*/

-- Snowflake equivalent:
CREATE OR REPLACE PROCEDURE update_customer_tier()
RETURNS STRING
LANGUAGE SQL
AS
$$
BEGIN
    UPDATE customers
    SET tier = 'GOLD'
    WHERE total_purchases > 10000;

    RETURN 'Updated ' || SQLROWCOUNT || ' customers';
END;
$$;

Parallel Running Strategy

-- Phase 1: Shadow mode (dual-write)
-- Write to both old system and Snowflake
-- Compare results

-- Phase 2: Read from Snowflake for non-critical workloads
-- Keep old system as backup

-- Phase 3: Full cutover
-- Decommission old system

Real-World Architecture Patterns

Pattern 1: Modern Data Stack

Data Sources (SaaS, Databases, APIs)
          ↓
    Fivetran/Airbyte (Ingestion)
          ↓
      Snowflake (Storage & Transformation)
          ↓
    dbt (Transformation Layer)
          ↓
  BI Tools (Tableau, Looker, Mode)
          ↓
   Reverse ETL (Census, Hightouch)
          ↓
    Operational Systems

Implementation:

-- 1. Raw data layer (from Fivetran)
CREATE SCHEMA raw_data;

-- 2. Staging layer (dbt)
CREATE SCHEMA staging;

-- 3. Marts (dbt)
CREATE SCHEMA analytics;
CREATE SCHEMA sales_mart;
CREATE SCHEMA marketing_mart;

-- Sample dbt model
-- models/staging/stg_customers.sql
{{
  config(
    materialized='view'
  )
}}

SELECT 
    customer_id,
    LOWER(TRIM(email)) AS email,
    created_at::TIMESTAMP_NTZ AS created_timestamp
FROM {{ source('raw_data', 'customers') }}
WHERE customer_id IS NOT NULL

Pattern 2: Data Lake + Data Warehouse

   S3 Data Lake (Raw Files)
          ↓
External Tables / Snowpipe → Snowflake
          ↓
    Transformation
          ↓
    Curated Layer

Implementation:

-- Query data lake directly
CREATE EXTERNAL TABLE raw_events
WITH LOCATION = @s3_stage/events/
FILE_FORMAT = (TYPE = 'PARQUET')
PARTITION BY (DATE_TRUNC('DAY', event_timestamp));

-- Stream to internal tables
CREATE PIPE load_events
    AUTO_INGEST = TRUE
AS
    COPY INTO events_internal
    FROM @s3_stage/events/;

-- Transform
CREATE TABLE events_processed AS
SELECT 
    event_id,
    user_id,
    event_type,
    DATE_TRUNC('HOUR', event_timestamp) AS event_hour,
    properties
FROM events_internal
WHERE event_type IN ('purchase', 'signup', 'cancellation');

Pattern 3: Multi-Tenant SaaS

-- Separate databases per customer
CREATE DATABASE customer_abc_prod;
CREATE DATABASE customer_xyz_prod;

-- Or single database with row-level security
CREATE TABLE saas_data (
    tenant_id VARCHAR,
    record_id NUMBER,
    data VARCHAR
);

CREATE ROW ACCESS POLICY tenant_isolation
AS (tenant_id VARCHAR) RETURNS BOOLEAN ->
    tenant_id = CURRENT_USER()
    OR 'ADMIN' = CURRENT_ROLE();

ALTER TABLE saas_data 
ADD ROW ACCESS POLICY tenant_isolation ON (tenant_id);

-- Users automatically see only their data

Pattern 4: Real-Time Analytics

  Application Databases
          ↓
  CDC (Debezium/Kafka)
          ↓
    S3 / Event Hub
          ↓
      Snowpipe
          ↓
       Stream
          ↓
        Task
          ↓
  Materialized View
          ↓
      BI Dashboard

Implementation:

-- 1. Snowpipe loads data continuously
CREATE PIPE real_time_events
    AUTO_INGEST = TRUE
AS COPY INTO events FROM @s3_stage/events/;

-- 2. Stream captures changes
CREATE STREAM event_stream ON TABLE events;

-- 3. Task processes stream
CREATE TASK process_events
    WAREHOUSE = real_time_wh
    SCHEDULE = '1 MINUTE'
    WHEN SYSTEM$STREAM_HAS_DATA('event_stream')
AS
    INSERT INTO events_aggregated
    SELECT 
        DATE_TRUNC('MINUTE', event_time) AS minute,
        event_type,
        COUNT(*) AS event_count
    FROM event_stream
    GROUP BY 1, 2;

-- 4. Materialized view for dashboards
CREATE MATERIALIZED VIEW events_last_hour AS
SELECT 
    event_type,
    COUNT(*) AS count,
    MAX(event_time) AS last_seen
FROM events
WHERE event_time >= DATEADD('hour', -1, CURRENT_TIMESTAMP())
GROUP BY event_type;

Final Thoughts & Resources

Congratulations! You’ve completed this comprehensive Snowflake guide. Let’s recap what you’ve learned:

Key Topics Covered

Foundation:

✅ Snowflake’s unique three-layer architecture
✅ Virtual warehouses and micro-partitions
✅ Core components and data organization
✅ Time Travel and Zero-Copy Cloning

Getting Started:

✅ Initial setup and context management
✅ Loading data via multiple methods
✅ Working with databases and schemas

Best Practices:

✅ Schema design and data type selection
✅ Performance optimization techniques
✅ Warehouse sizing and management
✅ Query optimization strategies
✅ Cost management and monitoring

Security & Modeling:

✅ RBAC and access control
✅ Row and column-level security
✅ Data modeling patterns (Star, SCD, Data Vault)
✅ Advanced features (Streams, Tasks, Procedures)

Integrations:

✅ AWS integration (S3, IAM, PrivateLink)
✅ BI tool connections (Tableau, Power BI, Looker)
✅ System tables and monitoring
✅ Python/Pandas integration

Advanced Topics:

✅ Snowflake vs Redshift comparison
✅ Data sharing and marketplace
✅ Migration strategies
✅ Real-world architecture patterns

Key Takeaways

1. Start Simple, Scale as Needed

Begin with X-Small warehouses
Enable auto-suspend and auto-resume
Monitor and optimize based on actual usage

2. Embrace Snowflake’s Philosophy

Let Snowflake handle optimization
Don’t over-engineer (no indexes, no partitioning decisions)
Use modern data stack tools (dbt, Fivetran, etc.)

3. Cost Control is Crucial

Set resource monitors
Use transient tables for staging
Optimize warehouse sizes
Monitor query performance

4. Security First

Use IAM roles, not access keys
Implement network policies
Use row-level and column-level security
Audit access via system tables

5. Leverage Unique Features

Data sharing for collaboration
Time Travel for recovery
Zero-copy cloning for dev/test
Streams and Tasks for automation

Next Steps

Beginners:

Sign up for Snowflake trial (30 days free, $400 credits)
Complete Snowflake’s hands-on labs
Load sample data and run queries
Connect a BI tool (Tableau/Power BI)

Intermediate:

Implement dbt for transformations
Set up automated pipelines with Tasks
Create materialized views for dashboards
Optimize costs using monitoring queries

Advanced:

Implement Data Vault or complex data models
Set up cross-cloud replication
Build real-time pipelines with Streams/Tasks
Share data with partners via Data Sharing

Resources

Official Documentation:

docs.snowflake.com
Snowflake Community (community.snowflake.com)
Snowflake University (training courses)

Community:

r/snowflake on Reddit
Snowflake LinkedIn group
Medium articles on Snowflake

Tools Ecosystem:

dbt (getdbt.com)
Fivetran (fivetran.com)
Census/Hightouch (reverse ETL)
Great Expectations (data quality)

Quick Reference Commands

-- Context management
SHOW DATABASES;
SHOW SCHEMAS;
SHOW TABLES;
SHOW WAREHOUSES;

-- Resource monitoring
SELECT * FROM SNOWFLAKE.ACCOUNT_USAGE.WAREHOUSE_METERING_HISTORY
WHERE START_TIME >= DATEADD('day', -7, CURRENT_TIMESTAMP());

-- Session info
SELECT CURRENT_ACCOUNT(), CURRENT_REGION(), CURRENT_VERSION();

-- Query history
SELECT * FROM TABLE(INFORMATION_SCHEMA.QUERY_HISTORY())
ORDER BY START_TIME DESC LIMIT 10;

-- Table info
SELECT GET_DDL('TABLE', 'table_name');
DESCRIBE TABLE table_name;

-- Warehouse operations
ALTER WAREHOUSE my_wh SUSPEND;
ALTER WAREHOUSE my_wh RESUME;
ALTER WAREHOUSE my_wh SET WAREHOUSE_SIZE = 'LARGE';

-- Time Travel
SELECT * FROM table_name AT(OFFSET => -3600);
SELECT * FROM table_name BEFORE(STATEMENT => 'query_id');

-- Cloning
CREATE TABLE new_table CLONE existing_table;
CREATE DATABASE new_db CLONE existing_db;

Thank You!

Thanks for reading this comprehensive guide! Snowflake is a powerful platform that continues to evolve. Keep learning, experimenting, and building awesome data solutions.

Happy Snowflaking! ❄️☁️

Questions or feedback? Connect with me on LinkedIn or leave a comment below. I’d love to hear about your Snowflake journey!

Found this helpful? Share it with your team and fellow data enthusiasts! ⭐

Table of Contents

Introduction

What Makes Snowflake Different?

Snowflake Architecture

The Three Layers

Core Components

Data Storage in Snowflake

Micro-Partitions

Clustering

Compute Layer: Virtual Warehouses {#compute-layer}

Warehouse Sizes

Creating and Configuring Warehouses

Multi-Cluster Warehouses

Getting Started with Snowflake

1. Initial Setup

2. Understanding Context

Working with Databases and Schemas

Loading Your First Data

Method 1: Load from Local File

Method 2: Load from S3

Method 3: Direct Load (Small Data)

Method 4: Continuous Loading with Snowpipe

Key Concepts to Remember

Schema Design Best Practices

1. Choosing the Right Data Types

2. Table Types Strategy

3. Schema Organization

Performance Optimization

1. Clustering Keys (For Large Tables)

2. Materialized Views

3. Search Optimization

Warehouse Sizing and Management

Choosing the Right Size

Warehouse Design Patterns

Multi-Cluster Warehouses

Query Optimization Techniques

1. Understanding the Query Profile

2. Common Optimization Patterns

3. Join Optimization

4. Leveraging Cache

Cost Management Strategies

1. Monitor Credit Usage

2. Cost Optimization Checklist

3. Resource Monitors

4. Storage Cost Management

Security and Access Control

Role-Based Access Control (RBAC)

Granting Privileges

Row-Level Security with Row Access Policies

Column-Level Security with Masking Policies

Data Modeling Patterns

1. Star Schema (Most Common)

2. Slowly Changing Dimensions (SCD)

3. Data Vault 2.0 (For Complex Enterprises)

Advanced Features

1. Streams for CDC

2. Tasks for Automation

3. Stored Procedures

AWS Integration Deep Dive

1. Storage Integration with S3

2. Snowpipe with S3 Event Notifications

3. Unloading Data to S3

S3 Integration and External Stages

External Tables (Query S3 Without Loading)

Directory Tables (Catalog S3 Files)

VPC and Networking

AWS PrivateLink Setup

Network Policies

BI Tool Integrations

Tableau Integration

Power BI Integration

Looker Integration

Python/Pandas Integration

Snowflake vs AWS Redshift

Architecture Differences

Performance Comparison

Feature Comparison Table

Cost Comparison Example

Migration: Redshift → Snowflake

System Tables and Metadata