Use the below table of content to navigate.
Table of Contents
Foundation & Architecture
- Introduction
- What Makes Snowflake Different?
- Snowflake Architecture
- Core Components
- Data Storage in Snowflake
- Compute Layer: Virtual Warehouses
Getting Started
- Getting Started with Snowflake
- Working with Databases and Schemas
- Loading Your First Data
Best Practices & Optimization
- Schema Design Best Practices
- Performance Optimization
- Warehouse Sizing and Management
- Query Optimization Techniques
- Cost Management Strategies
Security & Data Modeling
- Security and Access Control
- Data Modeling Patterns
- Advanced Features
Integrations & Monitoring
- AWS Integration Deep Dive
- S3 Integration and External Stages
- VPC and Networking
- BI Tool Integrations
- Snowflake vs AWS Redshift
- System Tables and Metadata
- Monitoring and Troubleshooting
Advanced Topics
- Data Sharing and Marketplace
- Migration Strategies
- Real-World Architecture Patterns
- Final Thoughts & Resources
Introduction
Hey there! If you’re reading this, you’ve probably heard about Snowflake and are wondering what all the buzz is about. Maybe you’re tired of managing database infrastructure, or perhaps you’re looking for a modern alternative to traditional data warehouses like Redshift. Whatever brought you here, you’re in the right place.
Snowflake isn’t just another database—it’s a complete data platform built from the ground up for the cloud. No hardware to manage, no indexes to tune (yes, really!), and it scales like a dream. This comprehensive guide will walk you through everything you need to know about Snowflake, from architecture to advanced integrations, performance optimization, and real-world implementation patterns.
Let’s dive in!
What Makes Snowflake Different?
Before we get into the technical details, let’s talk about what sets Snowflake apart:
True Multi-Cloud Platform: Snowflake runs natively on AWS, Azure, and Google Cloud. You can even replicate data across clouds seamlessly.
Separation of Storage and Compute: Unlike traditional databases where storage and compute are tightly coupled, Snowflake separates them completely. This means you can:
- Scale compute without moving data
- Pause compute when not in use (and stop paying for it!)
- Run multiple compute clusters against the same data simultaneously
Zero Management: No infrastructure to provision, no vacuum operations, no index management, no partitioning decisions (well, mostly).
Time Travel and Cloning: Want to query data as it existed 3 days ago? Or create an instant copy of a 10TB database for testing? Snowflake makes it trivial.
Snowflake Architecture
Snowflake uses a unique hybrid architecture that combines the best of shared-disk and shared-nothing architectures. Let’s break it down:
┌─────────────────────────────────────────────────────────┐
│ CLOUD SERVICES LAYER │
│ (Authentication, Metadata, Query Optimization, etc.) │
└─────────────────────────────────────────────────────────┘
↕
┌─────────────────────────────────────────────────────────┐
│ COMPUTE LAYER │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Virtual │ │ Virtual │ │ Virtual │ │
│ │Warehouse │ │Warehouse │ │Warehouse │ ... │
│ │ (VW1) │ │ (VW2) │ │ (VW3) │ │
│ └──────────┘ └──────────┘ └──────────┘ │
└─────────────────────────────────────────────────────────┘
↕
┌─────────────────────────────────────────────────────────┐
│ STORAGE LAYER │
│ (Centralized, Optimized Columnar) │
│ All data stored in S3/Blob │
└─────────────────────────────────────────────────────────┘
The Three Layers
1. Cloud Services Layer
This is the brain of Snowflake. It handles:
- Authentication and access control
- Query parsing and optimization
- Metadata management
- Infrastructure management
2. Compute Layer (Virtual Warehouses)
These are independent compute clusters that execute your queries. Each warehouse:
- Operates independently (no resource contention)
- Can be sized from X-Small to 6X-Large
- Caches data locally for performance
- Can be started, stopped, or suspended automatically
3. Storage Layer
All your data lives here in a compressed, columnar format. The beauty? It’s completely abstracted from you. Snowflake automatically:
- Compresses your data
- Organizes it into micro-partitions
- Maintains statistics
- Handles encryption
Core Components
Let’s understand the key objects you’ll work with:
| Component | Description | Example |
|---|---|---|
| Database | Top-level container for schemas |
SALES_DB, ANALYTICS_DB
|
| Schema | Namespace within a database |
PUBLIC, STAGING, PROD
|
| Table | Where your data lives |
CUSTOMERS, ORDERS
|
| View | Saved query definition | VW_ACTIVE_CUSTOMERS |
| Virtual Warehouse | Compute resource |
LOADING_WH, REPORTING_WH
|
| Stage | Location for data files |
@my_s3_stage, @%customers
|
| File Format | Defines how to parse files |
CSV_FORMAT, JSON_FORMAT
|
| Pipe | Continuous data ingestion | PIPE_CUSTOMER_UPDATES |
| Stream | CDC (Change Data Capture) | CUSTOMER_CHANGES_STREAM |
| Task | Scheduled SQL execution | DAILY_REFRESH_TASK |
Data Storage in Snowflake
One of Snowflake’s superpowers is how it stores data. Let’s demystify this:
Micro-Partitions
Snowflake automatically divides your tables into micro-partitions:
- Each micro-partition contains 50-500 MB of uncompressed data
- They’re immutable (never modified, only replaced)
- Metadata includes min/max values, null counts, etc.
- Queries automatically prune irrelevant partitions
-- You don't need to define partitioning!
-- Snowflake handles it automatically based on ingestion order
CREATE TABLE orders (
order_id NUMBER,
customer_id NUMBER,
order_date DATE,
amount DECIMAL(10,2)
);
-- As data loads, Snowflake creates micro-partitions
-- and maintains metadata automatically
Clustering
While Snowflake auto-clusters data by insertion order, you can define clustering keys for large tables:
-- For a 1TB+ table with frequent date filters
ALTER TABLE orders
CLUSTER BY (order_date);
-- Check clustering quality
SELECT SYSTEM$CLUSTERING_INFORMATION('orders', '(order_date)');
When to use clustering:
- Tables > 1TB
- Queries consistently filter on specific columns
- Query performance is critical
- Cost justifies automatic re-clustering
Compute Layer: Virtual Warehouses {#compute-layer}
Virtual Warehouses (VW) are where the magic happens. Think of them as independent compute clusters.
Warehouse Sizes
| Size | Credits/Hour | Servers | Use Case |
|---|---|---|---|
| X-Small | 1 | 1 | Development, small queries |
| Small | 2 | 2 | Light production workloads |
| Medium | 4 | 4 | Standard workloads |
| Large | 8 | 8 | Heavy queries, large data |
| X-Large | 16 | 16 | Very large analytical queries |
| 2X-Large | 32 | 32 | Massive parallel processing |
| 3X-Large | 64 | 64 | Extreme workloads |
| 4X-Large | 128 | 128 | Rarely needed |
Creating and Configuring Warehouses
-- Create a warehouse for loading data
CREATE WAREHOUSE loading_wh
WAREHOUSE_SIZE = 'LARGE'
AUTO_SUSPEND = 60 -- Suspend after 1 min of inactivity
AUTO_RESUME = TRUE -- Auto-start when query submitted
MIN_CLUSTER_COUNT = 1
MAX_CLUSTER_COUNT = 3 -- Scale out to handle concurrency
SCALING_POLICY = 'STANDARD' -- STANDARD or ECONOMY
COMMENT = 'Warehouse for data loading operations';
-- Create a warehouse for BI queries
CREATE WAREHOUSE reporting_wh
WAREHOUSE_SIZE = 'MEDIUM'
AUTO_SUSPEND = 300 -- 5 minutes for BI tools
AUTO_RESUME = TRUE
MIN_CLUSTER_COUNT = 1
MAX_CLUSTER_COUNT = 5;
-- Modify warehouse size on the fly
ALTER WAREHOUSE loading_wh SET WAREHOUSE_SIZE = 'X-LARGE';
-- Suspend/resume manually
ALTER WAREHOUSE loading_wh SUSPEND;
ALTER WAREHOUSE loading_wh RESUME;
Multi-Cluster Warehouses
For handling concurrent users, Snowflake can automatically add clusters:
CREATE WAREHOUSE concurrent_wh
WAREHOUSE_SIZE = 'MEDIUM'
MIN_CLUSTER_COUNT = 1
MAX_CLUSTER_COUNT = 10 -- Scale up to 10 clusters
SCALING_POLICY = 'STANDARD'; -- Add cluster when queue builds
-- STANDARD: Starts clusters aggressively (favors performance)
-- ECONOMY: Conservative, waits before scaling (favors cost)
Getting Started with Snowflake
Let’s get practical. Here’s how to start using Snowflake:
1. Initial Setup
-- Set your context
USE ROLE ACCOUNTADMIN;
-- Create a database
CREATE DATABASE my_analytics_db;
-- Create a warehouse
CREATE WAREHOUSE my_wh
WITH WAREHOUSE_SIZE = 'SMALL'
AUTO_SUSPEND = 60
AUTO_RESUME = TRUE;
-- Create a user
CREATE USER john_doe
PASSWORD = 'SecureP@ssw0rd!'
DEFAULT_ROLE = 'ANALYST'
DEFAULT_WAREHOUSE = 'my_wh'
DEFAULT_NAMESPACE = 'my_analytics_db.public';
-- Grant privileges
GRANT USAGE ON WAREHOUSE my_wh TO ROLE analyst;
GRANT USAGE ON DATABASE my_analytics_db TO ROLE analyst;
GRANT USAGE ON SCHEMA my_analytics_db.public TO ROLE analyst;
GRANT SELECT ON ALL TABLES IN SCHEMA my_analytics_db.public TO ROLE analyst;
2. Understanding Context
In Snowflake, you always work within a context:
-- Show current context
SELECT CURRENT_ROLE(), CURRENT_WAREHOUSE(),
CURRENT_DATABASE(), CURRENT_SCHEMA();
-- Set context
USE ROLE ANALYST;
USE WAREHOUSE my_wh;
USE DATABASE my_analytics_db;
USE SCHEMA public;
-- Or set all at once
USE ROLE ANALYST;
USE WAREHOUSE my_wh;
USE SCHEMA my_analytics_db.public;
Working with Databases and Schemas
Snowflake organizes data hierarchically:
-- Create a database with additional options
CREATE DATABASE sales_db
DATA_RETENTION_TIME_IN_DAYS = 7 -- Time Travel days
COMMENT = 'Sales department data';
-- Create schemas for different purposes
CREATE SCHEMA sales_db.staging
COMMENT = 'Staging area for raw data';
CREATE SCHEMA sales_db.production
COMMENT = 'Production-ready tables';
CREATE SCHEMA sales_db.archive
COMMENT = 'Historical data archive';
-- Create a transient database (no fail-safe, lower cost)
CREATE TRANSIENT DATABASE temp_analytics_db;
-- Create a temporary database (session-specific)
CREATE TEMPORARY DATABASE scratch_work;
Database Types:
| Type | Time Travel | Fail-Safe | Use Case |
|---|---|---|---|
| Permanent | Up to 90 days | 7 days | Production data |
| Transient | Up to 1 day | No | Staging, temp data |
| Temporary | Up to 1 day | No | Session-specific work |
Loading Your First Data
Let’s load some data! Snowflake offers multiple methods:
Method 1: Load from Local File
-- Create a table
CREATE TABLE customers (
customer_id NUMBER,
first_name VARCHAR(50),
last_name VARCHAR(50),
email VARCHAR(100),
signup_date DATE
);
-- Create a file format
CREATE FILE FORMAT my_csv_format
TYPE = 'CSV'
FIELD_DELIMITER = ','
SKIP_HEADER = 1
NULL_IF = ('NULL', 'null', '')
EMPTY_FIELD_AS_NULL = TRUE
COMPRESSION = 'AUTO';
-- Create an internal stage (within Snowflake)
CREATE STAGE my_internal_stage
FILE_FORMAT = my_csv_format;
-- Upload file via UI or SnowSQL, then load
PUT file://customers.csv @my_internal_stage;
COPY INTO customers
FROM @my_internal_stage/customers.csv
FILE_FORMAT = my_csv_format
ON_ERROR = 'CONTINUE';
-- Check load results
SELECT * FROM TABLE(RESULT_SCAN(LAST_QUERY_ID()));
Method 2: Load from S3
-- Create external stage pointing to S3
CREATE STAGE my_s3_stage
URL = 's3://my-bucket/data/'
CREDENTIALS = (AWS_KEY_ID = 'your_key'
AWS_SECRET_KEY = 'your_secret')
FILE_FORMAT = my_csv_format;
-- Or use IAM role (recommended)
CREATE STAGE my_s3_stage_with_role
URL = 's3://my-bucket/data/'
STORAGE_INTEGRATION = my_s3_integration
FILE_FORMAT = my_csv_format;
-- List files in stage
LIST @my_s3_stage;
-- Load data
COPY INTO customers
FROM @my_s3_stage
PATTERN = '.*customers.*csv'
FILE_FORMAT = my_csv_format;
Method 3: Direct Load (Small Data)
-- For quick tests
INSERT INTO customers VALUES
(1, 'John', 'Doe', 'john@email.com', '2024-01-15'),
(2, 'Jane', 'Smith', 'jane@email.com', '2024-01-16');
-- Or use SELECT
INSERT INTO customers
SELECT * FROM other_table WHERE condition = true;
Method 4: Continuous Loading with Snowpipe
-- Create pipe for automated loading
CREATE PIPE customer_pipe
AUTO_INGEST = TRUE
AS
COPY INTO customers
FROM @my_s3_stage
FILE_FORMAT = my_csv_format;
-- Show pipe status
SHOW PIPES;
-- Check pipe history
SELECT *
FROM TABLE(INFORMATION_SCHEMA.PIPE_USAGE_HISTORY(
DATE_RANGE_START=>DATEADD('day', -7, CURRENT_DATE()),
PIPE_NAME=>'customer_pipe'
));
Key Concepts to Remember
1. Time Travel
Every table automatically maintains history:
-- Query data as it was 1 hour ago
SELECT * FROM customers
AT(OFFSET => -3600);
-- Query data at a specific timestamp
SELECT * FROM customers
AT(TIMESTAMP => '2024-01-15 14:30:00'::TIMESTAMP);
-- Restore a dropped table
UNDROP TABLE customers;
-- Clone at a point in time
CREATE TABLE customers_yesterday
CLONE customers
AT(OFFSET => -86400);
2. Zero-Copy Cloning
Create instant, no-cost copies of databases, schemas, or tables:
-- Clone entire database
CREATE DATABASE dev_db CLONE production_db;
-- Clone table for testing
CREATE TABLE customers_test CLONE customers;
-- Clone is instant and shares storage until modified
3. Result Caching
Snowflake caches query results for 24 hours:
-- First run: Executes query
SELECT COUNT(*) FROM large_table; -- Takes 30 seconds
-- Second run: Instant from cache
SELECT COUNT(*) FROM large_table; -- Takes 0.1 seconds
-- Bypass cache
SELECT COUNT(*) FROM large_table
WHERE 1=1; -- Adding condition bypasses cache
Schema Design Best Practices
1. Choosing the Right Data Types
Snowflake is forgiving, but using optimal data types matters:
-- ❌ DON'T: Oversize your columns
CREATE TABLE customers_bad (
customer_id VARCHAR(1000), -- Way too big!
email VARCHAR(1000), -- 100 would suffice
age VARCHAR(50) -- Should be NUMBER
);
-- ✅ DO: Right-size your columns
CREATE TABLE customers_good (
customer_id NUMBER(38,0), -- Or use AUTOINCREMENT
email VARCHAR(100), -- Reasonable size
age NUMBER(3,0), -- Appropriate numeric type
signup_date DATE, -- Not TIMESTAMP if time not needed
last_login TIMESTAMP_LTZ -- With timezone if needed
);
Data Type Recommendations:
| Data Type | When to Use | Example |
|---|---|---|
NUMBER(38,0) |
Integer IDs, counts | Customer ID, Order count |
NUMBER(10,2) |
Money, precise decimals | Price, Amount |
FLOAT |
Measurements, scientific | Temperature, Distance |
VARCHAR(n) |
Text with known max length | Email, Phone |
STRING |
Text with unknown length | Comments, Descriptions |
DATE |
Dates without time | Birth date, Order date |
TIMESTAMP_LTZ |
Timestamps with timezone | Login time, Event time |
VARIANT |
JSON, semi-structured | API responses, logs |
BOOLEAN |
True/false flags | is_active, is_premium |
2. Table Types Strategy
-- Production tables: Permanent (default)
CREATE TABLE prod_orders (
order_id NUMBER,
order_date DATE,
amount NUMBER(10,2)
);
-- Has: Time Travel (up to 90 days), Fail-safe (7 days)
-- Staging tables: Transient (cheaper)
CREATE TRANSIENT TABLE staging_orders (
order_id NUMBER,
order_date DATE,
amount NUMBER(10,2)
);
-- Has: Time Travel (up to 1 day), No Fail-safe
-- Saves ~50% on storage costs
-- Session work: Temporary
CREATE TEMPORARY TABLE work_orders (
order_id NUMBER,
amount NUMBER(10,2)
);
-- Exists only in session, automatic cleanup
When to Use What:
Permanent Tables
└── Production data
└── Data requiring recovery beyond 1 day
└── Regulatory/compliance requirements
Transient Tables
└── Staging/ETL intermediate tables
└── Data easily recreatable
└── Development/testing
└── Landing zones for raw data
Temporary Tables
└── Session-specific calculations
└── Query intermediate results
└── Data that doesn't need persistence
3. Schema Organization
-- Organize by data lifecycle and purpose
CREATE DATABASE analytics_db;
-- Raw/landing zone
CREATE SCHEMA analytics_db.raw
DATA_RETENTION_TIME_IN_DAYS = 1; -- Short retention
-- Staging/transformation
CREATE TRANSIENT SCHEMA analytics_db.staging
DATA_RETENTION_TIME_IN_DAYS = 1;
-- Production-ready data
CREATE SCHEMA analytics_db.curated
DATA_RETENTION_TIME_IN_DAYS = 7;
-- Data marts for specific teams
CREATE SCHEMA analytics_db.sales_mart;
CREATE SCHEMA analytics_db.marketing_mart;
-- Utility objects
CREATE SCHEMA analytics_db.utility; -- File formats, stages, etc.
Performance Optimization
1. Clustering Keys (For Large Tables)
Clustering helps with very large tables (1TB+):
-- Check if clustering would help
SELECT
table_name,
row_count,
bytes / POWER(1024, 3) AS size_gb
FROM information_schema.tables
WHERE table_schema = 'PUBLIC'
AND bytes > 1099511627776 -- > 1TB
ORDER BY bytes DESC;
-- Add clustering key
ALTER TABLE large_fact_table
CLUSTER BY (date_column, category_id);
-- Monitor clustering health (0-100, >80 is good)
SELECT SYSTEM$CLUSTERING_INFORMATION('large_fact_table',
'(date_column, category_id)');
-- Check clustering depth (fewer is better)
SELECT SYSTEM$CLUSTERING_DEPTH('large_fact_table',
'(date_column, category_id)');
Clustering Key Best Practices:
-- ✅ DO: Cluster by frequently filtered columns
ALTER TABLE orders CLUSTER BY (order_date); -- Good for date range queries
-- ✅ DO: Use multi-column clustering for multiple filters
ALTER TABLE orders CLUSTER BY (order_date, region_id);
-- ❌ DON'T: Cluster high-cardinality columns alone
-- Bad: customer_id (millions of unique values)
-- ❌ DON'T: Cluster frequently updated tables
-- Clustering maintenance is expensive
-- ✅ DO: Consider clustering linear expressions
ALTER TABLE events CLUSTER BY (DATE_TRUNC('DAY', event_timestamp));
2. Materialized Views
Pre-compute expensive queries:
-- Create materialized view
CREATE MATERIALIZED VIEW daily_sales_summary AS
SELECT
DATE_TRUNC('DAY', order_date) AS sale_date,
region_id,
SUM(amount) AS total_sales,
COUNT(*) AS order_count,
AVG(amount) AS avg_order_value
FROM orders
GROUP BY 1, 2;
-- Snowflake automatically maintains it!
-- Queries automatically use it when beneficial
-- Refresh manually if needed
ALTER MATERIALIZED VIEW daily_sales_summary REFRESH;
-- Check refresh status
SHOW MATERIALIZED VIEWS;
When to Use Materialized Views:
✅ Good candidates:
- Expensive aggregations queried frequently
- Dashboard queries that don’t need real-time data
- Complex joins used repeatedly
- Queries with GROUP BY on large datasets
❌ Avoid for:
- Rapidly changing base tables
- Queries needing real-time data
- Simple queries (overhead not worth it)
3. Search Optimization
For substring searches and lookups:
-- Enable search optimization
ALTER TABLE customers
ADD SEARCH OPTIMIZATION ON EQUALITY(customer_id, email);
-- Or on all columns
ALTER TABLE customers
ADD SEARCH OPTIMIZATION;
-- Great for:
SELECT * FROM customers WHERE email = 'john@email.com';
SELECT * FROM customers WHERE email LIKE '%@gmail.com';
-- Check optimization status
SELECT SYSTEM$GET_SEARCH_OPTIMIZATION_STATUS('customers');
Warehouse Sizing and Management
Choosing the Right Size
Here’s a practical approach:
-- Start small and monitor
CREATE WAREHOUSE analytics_wh
WAREHOUSE_SIZE = 'SMALL'
AUTO_SUSPEND = 60
AUTO_RESUME = TRUE;
-- Run your workload and check query performance
SELECT
query_id,
query_text,
warehouse_size,
execution_time / 1000 AS execution_seconds,
bytes_scanned / POWER(1024, 3) AS gb_scanned
FROM snowflake.account_usage.query_history
WHERE warehouse_name = 'ANALYTICS_WH'
AND start_time >= DATEADD('hour', -1, CURRENT_TIMESTAMP())
ORDER BY execution_time DESC
LIMIT 10;
Sizing Strategy:
| Scenario | Recommended Approach |
|---|---|
| Data loading | Start with LARGE, scale up if slow |
| BI queries (many users) | MEDIUM with multi-cluster (1-5 clusters) |
| ETL/batch processing | LARGE to X-LARGE, single cluster |
| Ad-hoc analysis | SMALL to MEDIUM |
| Development | X-SMALL |
| Complex analytics | X-LARGE for duration of job |
Warehouse Design Patterns
-- Pattern 1: Separate workloads
CREATE WAREHOUSE loading_wh
WAREHOUSE_SIZE = 'LARGE'
AUTO_SUSPEND = 60; -- Suspend quickly
CREATE WAREHOUSE bi_wh
WAREHOUSE_SIZE = 'MEDIUM'
AUTO_SUSPEND = 600 -- 10 min for BI tools
MAX_CLUSTER_COUNT = 5; -- Handle concurrent users
CREATE WAREHOUSE etl_wh
WAREHOUSE_SIZE = 'X-LARGE'
AUTO_SUSPEND = 60;
-- Pattern 2: Time-based sizing
-- Start day with larger warehouse
ALTER WAREHOUSE bi_wh SET WAREHOUSE_SIZE = 'LARGE'; -- 8 AM
-- Scale down during off-hours
ALTER WAREHOUSE bi_wh SET WAREHOUSE_SIZE = 'SMALL'; -- 6 PM
-- Pattern 3: Dynamic sizing in stored procedure
CREATE PROCEDURE scale_warehouse(wh_name STRING, size STRING)
RETURNS STRING
LANGUAGE JAVASCRIPT
AS
$$
var sql_command = `ALTER WAREHOUSE ${WH_NAME} SET WAREHOUSE_SIZE = '${SIZE}'`;
snowflake.execute({sqlText: sql_command});
return `Warehouse ${WH_NAME} scaled to ${SIZE}`;
$$;
-- Use in your ETL
CALL scale_warehouse('ETL_WH', 'X-LARGE');
-- ... run heavy processing ...
CALL scale_warehouse('ETL_WH', 'SMALL');
Multi-Cluster Warehouses
CREATE WAREHOUSE concurrent_bi_wh
WAREHOUSE_SIZE = 'MEDIUM'
MIN_CLUSTER_COUNT = 1
MAX_CLUSTER_COUNT = 10
SCALING_POLICY = 'STANDARD' -- or ECONOMY
AUTO_SUSPEND = 300;
-- STANDARD: Aggressive scaling (performance priority)
-- - Starts new cluster immediately when queue builds
-- - Estimated to run for at least 6 minutes
-- ECONOMY: Conservative scaling (cost priority)
-- - Waits to ensure 6 minutes of work before adding cluster
-- - May queue queries briefly
Cost Comparison:
Example: 10 concurrent users running 2-minute queries
STANDARD policy:
- May spin up 5-6 clusters
- Minimal queuing
- Higher cost, better UX
ECONOMY policy:
- May spin up 3-4 clusters
- Some queries queue 10-30 seconds
- Lower cost, acceptable for most BI use
Query Optimization Techniques
1. Understanding the Query Profile
-- Run a query and get its ID
SET query_id = LAST_QUERY_ID();
-- Analyze via UI: Query Profile tab
-- Or programmatically:
SELECT *
FROM TABLE(GET_QUERY_OPERATOR_STATS($query_id));
Key metrics to watch:
- Partitions scanned vs. total: Should be low %
- Bytes spilled to disk: Should be zero or minimal
- Rows produced: Check for unexpectedly high numbers
2. Common Optimization Patterns
-- ❌ DON'T: SELECT *
SELECT * FROM large_table; -- Scans all columns
-- ✅ DO: Select only needed columns
SELECT customer_id, order_date, amount
FROM large_table;
-- ❌ DON'T: Functions on filtered columns (prevents pruning)
SELECT * FROM orders
WHERE YEAR(order_date) = 2024;
-- ✅ DO: Direct column comparisons
SELECT * FROM orders
WHERE order_date >= '2024-01-01'
AND order_date < '2025-01-01';
-- ❌ DON'T: Implicit type conversions
SELECT * FROM orders
WHERE customer_id = '12345'; -- customer_id is NUMBER
-- ✅ DO: Use correct types
SELECT * FROM orders
WHERE customer_id = 12345;
-- ❌ DON'T: Correlated subqueries
SELECT o.*,
(SELECT SUM(amount)
FROM order_items
WHERE order_id = o.order_id) AS total
FROM orders o;
-- ✅ DO: Use JOINs
SELECT o.*, oi.total
FROM orders o
JOIN (
SELECT order_id, SUM(amount) AS total
FROM order_items
GROUP BY order_id
) oi ON o.order_id = oi.order_id;
3. Join Optimization
-- Join order matters for performance
-- Put larger table first, smaller table second
-- ❌ Less optimal
SELECT *
FROM small_dimension d
JOIN large_fact f ON d.id = f.dimension_id;
-- ✅ Better
SELECT *
FROM large_fact f
JOIN small_dimension d ON f.dimension_id = d.id;
-- Use appropriate join types
-- INNER JOIN when possible (faster than OUTER)
-- Consider clustering on join keys for very large tables
ALTER TABLE large_fact CLUSTER BY (dimension_id);
4. Leveraging Cache
-- Three types of caching in Snowflake:
-- 1. Result Cache (24 hours, automatic)
SELECT COUNT(*) FROM orders; -- Cached
-- 2. Local Disk Cache (per warehouse)
SELECT * FROM orders
WHERE order_date = '2024-01-15'; -- Cached on warehouse nodes
-- 3. Remote Disk Cache (metadata layer)
-- Automatically used for small queries
-- Pro tip: Run analytics on same warehouse for cache benefit
USE WAREHOUSE analytics_wh; -- Keep using same warehouse
Cost Management Strategies
1. Monitor Credit Usage
-- View credit usage by warehouse
SELECT
warehouse_name,
SUM(credits_used) AS total_credits,
SUM(credits_used) * 4 AS approximate_cost_usd -- ~$4/credit
FROM snowflake.account_usage.warehouse_metering_history
WHERE start_time >= DATEADD('month', -1, CURRENT_TIMESTAMP())
GROUP BY warehouse_name
ORDER BY total_credits DESC;
-- Identify expensive queries
SELECT
query_id,
user_name,
warehouse_name,
execution_time / 1000 AS seconds,
bytes_scanned / POWER(1024, 3) AS gb_scanned,
credits_used_cloud_services
FROM snowflake.account_usage.query_history
WHERE start_time >= DATEADD('day', -7, CURRENT_TIMESTAMP())
AND execution_time > 60000 -- > 1 minute
ORDER BY credits_used_cloud_services DESC
LIMIT 20;
2. Cost Optimization Checklist
| Strategy | Implementation | Savings |
|---|---|---|
| Auto-suspend | Set to 60-300 seconds | 30-50% |
| Right-size warehouses | Monitor and adjust | 20-40% |
| Use transient tables | For staging/temp data | 50% on storage |
| Limit time travel | Set to minimum needed | 10-30% on storage |
| Clustering discipline | Only for 1TB+ tables | Avoid unnecessary costs |
| Query optimization | Reduce scan volumes | 20-50% |
| Dedicated warehouses | Isolate workloads | Better visibility |
| Resource monitors | Set credit limits | Prevent overruns |
3. Resource Monitors
-- Create account-level monitor
USE ROLE ACCOUNTADMIN;
CREATE RESOURCE MONITOR monthly_limit
WITH CREDIT_QUOTA = 1000 -- 1000 credits per month
FREQUENCY = MONTHLY
START_TIMESTAMP = IMMEDIATELY
TRIGGERS
ON 75 PERCENT DO NOTIFY -- Alert at 75%
ON 100 PERCENT DO SUSPEND -- Stop at 100%
ON 110 PERCENT DO SUSPEND_IMMEDIATE; -- Force stop
-- Apply to account
ALTER ACCOUNT SET RESOURCE_MONITOR = monthly_limit;
-- Create warehouse-specific monitor
CREATE RESOURCE MONITOR bi_warehouse_limit
WITH CREDIT_QUOTA = 100
FREQUENCY = DAILY
START_TIMESTAMP = IMMEDIATELY
TRIGGERS
ON 90 PERCENT DO SUSPEND_IMMEDIATE;
-- Apply to specific warehouse
ALTER WAREHOUSE bi_wh
SET RESOURCE_MONITOR = bi_warehouse_limit;
4. Storage Cost Management
-- Identify large tables
SELECT
table_catalog || '.' || table_schema || '.' || table_name AS full_table_name,
row_count,
bytes / POWER(1024, 3) AS size_gb,
CASE
WHEN table_type = 'BASE TABLE' THEN 'Permanent'
WHEN is_transient = 'YES' THEN 'Transient'
ELSE 'Temporary'
END AS table_type
FROM information_schema.tables
WHERE table_type = 'BASE TABLE'
AND bytes > 1073741824 -- > 1GB
ORDER BY bytes DESC
LIMIT 20;
-- Check time travel storage
SELECT
table_name,
active_bytes / POWER(1024, 3) AS active_gb,
time_travel_bytes / POWER(1024, 3) AS time_travel_gb,
failsafe_bytes / POWER(1024, 3) AS failsafe_gb
FROM snowflake.account_usage.table_storage_metrics
WHERE time_travel_bytes > 0
ORDER BY time_travel_bytes DESC;
-- Reduce time travel if not needed
ALTER TABLE staging_data
SET DATA_RETENTION_TIME_IN_DAYS = 1; -- From default 7 or 90
-- Convert to transient if appropriate
CREATE TRANSIENT TABLE staging_data_new AS
SELECT * FROM staging_data;
DROP TABLE staging_data;
ALTER TABLE staging_data_new RENAME TO staging_data;
Security and Access Control
Role-Based Access Control (RBAC)
Snowflake uses a hierarchy of roles:
-- Standard role hierarchy
-- ACCOUNTADMIN
-- └── SECURITYADMIN
-- └── USERADMIN
-- └── SYSADMIN
-- └── Custom roles
-- └── PUBLIC (granted to all users)
-- Create custom roles
USE ROLE SECURITYADMIN;
CREATE ROLE data_engineer;
CREATE ROLE data_analyst;
CREATE ROLE data_scientist;
-- Grant roles to users
GRANT ROLE data_analyst TO USER john_doe;
GRANT ROLE data_engineer TO USER jane_smith;
-- Create role hierarchy
GRANT ROLE data_analyst TO ROLE data_scientist; -- Scientists can do analyst work
GRANT ROLE data_engineer TO ROLE sysadmin; -- Engineers escalate to sysadmin
Granting Privileges
-- Grant warehouse access
GRANT USAGE ON WAREHOUSE analytics_wh TO ROLE data_analyst;
GRANT OPERATE ON WAREHOUSE analytics_wh TO ROLE data_engineer;
-- Grant database/schema access
GRANT USAGE ON DATABASE analytics_db TO ROLE data_analyst;
GRANT USAGE ON SCHEMA analytics_db.curated TO ROLE data_analyst;
-- Grant table access
GRANT SELECT ON ALL TABLES IN SCHEMA analytics_db.curated TO ROLE data_analyst;
GRANT SELECT ON FUTURE TABLES IN SCHEMA analytics_db.curated TO ROLE data_analyst;
-- Grant write access to engineers
GRANT INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA analytics_db.staging
TO ROLE data_engineer;
GRANT CREATE TABLE ON SCHEMA analytics_db.staging TO ROLE data_engineer;
-- View privileges
SHOW GRANTS TO ROLE data_analyst;
SHOW GRANTS ON TABLE customers;
Row-Level Security with Row Access Policies
-- Create mapping table
CREATE TABLE user_region_access (
user_name VARCHAR,
allowed_region VARCHAR
);
INSERT INTO user_region_access VALUES
('john_doe', 'US-WEST'),
('jane_smith', 'US-EAST'),
('admin_user', 'ALL');
-- Create row access policy
CREATE ROW ACCESS POLICY region_access_policy
AS (region_code VARCHAR) RETURNS BOOLEAN ->
'admin_user' = CURRENT_USER()
OR EXISTS (
SELECT 1 FROM user_region_access
WHERE user_name = CURRENT_USER()
AND (allowed_region = region_code OR allowed_region = 'ALL')
);
-- Apply policy to table
ALTER TABLE orders
ADD ROW ACCESS POLICY region_access_policy ON (region);
-- Now users automatically see only their regions!
Column-Level Security with Masking Policies
-- Create masking policy for PII
CREATE MASKING POLICY email_mask AS (val STRING) RETURNS STRING ->
CASE
WHEN CURRENT_ROLE() IN ('ADMIN', 'SECURITY') THEN val
WHEN CURRENT_ROLE() = 'ANALYST' THEN REGEXP_REPLACE(val, '.+@', '****@')
ELSE '****'
END;
-- Apply to column
ALTER TABLE customers
MODIFY COLUMN email SET MASKING POLICY email_mask;
-- Create masking for SSN/credit cards
CREATE MASKING POLICY ssn_mask AS (val STRING) RETURNS STRING ->
CASE
WHEN CURRENT_ROLE() IN ('ADMIN', 'COMPLIANCE') THEN val
ELSE '***-**-' || RIGHT(val, 4)
END;
ALTER TABLE customers
MODIFY COLUMN ssn SET MASKING POLICY ssn_mask;
Data Modeling Patterns
1. Star Schema (Most Common)
-- Dimension tables (descriptive attributes)
CREATE TABLE dim_customer (
customer_key NUMBER AUTOINCREMENT, -- Surrogate key
customer_id VARCHAR(50), -- Natural key
first_name VARCHAR(100),
last_name VARCHAR(100),
email VARCHAR(200),
-- SCD Type 2 columns
effective_date DATE,
end_date DATE,
is_current BOOLEAN,
PRIMARY KEY (customer_key)
);
CREATE TABLE dim_product (
product_key NUMBER AUTOINCREMENT,
product_id VARCHAR(50),
product_name VARCHAR(200),
category VARCHAR(100),
unit_price NUMBER(10,2),
effective_date DATE,
is_current BOOLEAN,
PRIMARY KEY (product_key)
);
-- Fact table (measures/metrics)
CREATE TABLE fact_sales (
sales_key NUMBER AUTOINCREMENT,
sale_date DATE,
customer_key NUMBER, -- FK to dim_customer
product_key NUMBER, -- FK to dim_product
store_key NUMBER, -- FK to dim_store
quantity NUMBER,
unit_price NUMBER(10,2),
discount_amount NUMBER(10,2),
sales_amount NUMBER(10,2),
cost_amount NUMBER(10,2),
PRIMARY KEY (sales_key),
-- Snowflake doesn't enforce FKs, but good for documentation
FOREIGN KEY (customer_key) REFERENCES dim_customer(customer_key),
FOREIGN KEY (product_key) REFERENCES dim_product(product_key)
)
CLUSTER BY (sale_date); -- If large table
2. Slowly Changing Dimensions (SCD)
-- Type 1: Overwrite (no history)
UPDATE dim_customer
SET email = 'newemail@example.com'
WHERE customer_id = '12345';
-- Type 2: Track history (recommended)
-- Close current record
UPDATE dim_customer
SET end_date = CURRENT_DATE(),
is_current = FALSE
WHERE customer_id = '12345' AND is_current = TRUE;
-- Insert new record
INSERT INTO dim_customer
(customer_id, first_name, last_name, email, effective_date, end_date, is_current)
VALUES
('12345', 'John', 'Doe', 'newemail@example.com', CURRENT_DATE(), '9999-12-31', TRUE);
-- Query current records
SELECT * FROM dim_customer WHERE is_current = TRUE;
-- Query at specific point in time
SELECT * FROM dim_customer
WHERE customer_id = '12345'
AND effective_date <= '2024-01-15'
AND end_date >= '2024-01-15';
3. Data Vault 2.0 (For Complex Enterprises)
-- Hubs: Unique business keys
CREATE TABLE hub_customer (
customer_hashkey BINARY(16), -- MD5 of business key
customer_id VARCHAR(50), -- Business key
load_date TIMESTAMP_NTZ,
record_source VARCHAR(50),
PRIMARY KEY (customer_hashkey)
);
-- Links: Relationships between hubs
CREATE TABLE link_order (
order_hashkey BINARY(16),
customer_hashkey BINARY(16),
product_hashkey BINARY(16),
load_date TIMESTAMP_NTZ,
record_source VARCHAR(50),
PRIMARY KEY (order_hashkey)
);
-- Satellites: Descriptive attributes
CREATE TABLE sat_customer (
customer_hashkey BINARY(16),
load_date TIMESTAMP_NTZ,
first_name VARCHAR(100),
last_name VARCHAR(100),
email VARCHAR(200),
hashdiff BINARY(16), -- Hash of all attributes
record_source VARCHAR(50),
PRIMARY KEY (customer_hashkey, load_date)
);
Advanced Features
1. Streams for CDC
-- Create a stream on a table
CREATE STREAM customer_changes ON TABLE customers;
-- The stream tracks: INSERT, UPDATE, DELETE
-- Insert some data
INSERT INTO customers VALUES (1, 'John', 'Doe');
UPDATE customers SET first_name = 'Jane' WHERE customer_id = 1;
-- Query the stream
SELECT * FROM customer_changes;
-- Shows: METADATA$ACTION (INSERT/DELETE), METADATA$ISUPDATE
-- Process changes
MERGE INTO customer_summary s
USING customer_changes c
ON s.customer_id = c.customer_id
WHEN MATCHED AND c.METADATA$ACTION = 'DELETE' AND c.METADATA$ISUPDATE = FALSE
THEN DELETE
WHEN MATCHED AND c.METADATA$ACTION = 'INSERT' AND c.METADATA$ISUPDATE = FALSE
THEN UPDATE SET s.first_name = c.first_name, s.last_name = c.last_name
WHEN NOT MATCHED AND c.METADATA$ACTION = 'INSERT'
THEN INSERT (customer_id, first_name, last_name)
VALUES (c.customer_id, c.first_name, c.last_name);
-- Stream is automatically consumed after successful MERGE
2. Tasks for Automation
-- Create a task to run hourly
CREATE TASK refresh_summary_hourly
WAREHOUSE = etl_wh
SCHEDULE = '60 MINUTE'
AS
INSERT INTO sales_summary
SELECT
DATE_TRUNC('HOUR', sale_timestamp) AS sale_hour,
SUM(amount) AS total_sales,
COUNT(*) AS transaction_count
FROM sales
WHERE sale_timestamp >= DATEADD('HOUR', -1, CURRENT_TIMESTAMP())
GROUP BY 1;
-- Enable the task
ALTER TASK refresh_summary_hourly RESUME;
-- Create task triggered by stream
CREATE TASK process_customer_changes
WAREHOUSE = etl_wh
WHEN SYSTEM$STREAM_HAS_DATA('customer_changes')
AS
MERGE INTO customer_summary s
USING customer_changes c
ON s.customer_id = c.customer_id
-- ... (merge logic from above)
-- Chain tasks (DAG)
CREATE TASK step1
SCHEDULE = '60 MINUTE'
AS INSERT INTO stage1 SELECT * FROM source;
CREATE TASK step2
AFTER step1 -- Runs after step1 completes
AS INSERT INTO stage2 SELECT * FROM stage1;
CREATE TASK step3
AFTER step2
AS INSERT INTO final SELECT * FROM stage2;
-- Resume in reverse order
ALTER TASK step3 RESUME;
ALTER TASK step2 RESUME;
ALTER TASK step1 RESUME;
-- Monitor tasks
SHOW TASKS;
SELECT * FROM TABLE(INFORMATION_SCHEMA.TASK_HISTORY())
ORDER BY SCHEDULED_TIME DESC;
3. Stored Procedures
-- JavaScript stored procedure
CREATE OR REPLACE PROCEDURE load_sales_data(source_table STRING, target_table STRING)
RETURNS STRING
LANGUAGE JAVASCRIPT
AS
$$
try {
// Execute SQL statements
var sql_truncate = `TRUNCATE TABLE ${TARGET_TABLE}`;
snowflake.execute({sqlText: sql_truncate});
var sql_insert = `INSERT INTO ${TARGET_TABLE} SELECT * FROM ${SOURCE_TABLE}`;
var stmt = snowflake.execute({sqlText: sql_insert});
var rows_inserted = stmt.getNumRowsAffected();
return `Successfully loaded ${rows_inserted} rows into ${TARGET_TABLE}`;
} catch (err) {
return `Error: ${err.message}`;
}
$$;
-- Call the procedure
CALL load_sales_data('staging.sales', 'prod.sales');
-- SQL stored procedure (Snowflake Scripting)
CREATE OR REPLACE PROCEDURE calculate_aggregates()
RETURNS STRING
LANGUAGE SQL
AS
$$
DECLARE
row_count NUMBER;
result_message STRING;
BEGIN
-- Truncate target
TRUNCATE TABLE daily_aggregates;
-- Insert aggregated data
INSERT INTO daily_aggregates
SELECT
sale_date,
SUM(amount) AS total_amount,
COUNT(*) AS transaction_count
FROM sales
WHERE sale_date >= DATEADD('day', -7, CURRENT_DATE())
GROUP BY sale_date;
-- Get affected rows
row_count := SQLROWCOUNT;
result_message := 'Processed ' || row_count || ' days';
RETURN result_message;
END;
$$;
CALL calculate_aggregates();
AWS Integration Deep Dive
1. Storage Integration with S3
The recommended way to connect Snowflake to S3 uses IAM roles (not access keys):
Step 1: Create IAM Policy in AWS
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:GetObjectVersion",
"s3:ListBucket",
"s3:GetBucketLocation"
],
"Resource": [
"arn:aws:s3:::my-data-bucket/*",
"arn:aws:s3:::my-data-bucket"
]
},
{
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:DeleteObject"
],
"Resource": [
"arn:aws:s3:::my-data-bucket/unload/*"
]
}
]
}
Step 2: Create Storage Integration in Snowflake
-- Create storage integration
USE ROLE ACCOUNTADMIN;
CREATE STORAGE INTEGRATION s3_integration
TYPE = EXTERNAL_STAGE
STORAGE_PROVIDER = S3
ENABLED = TRUE
STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::123456789012:role/snowflake-s3-role'
STORAGE_ALLOWED_LOCATIONS = ('s3://my-data-bucket/data/', 's3://my-data-bucket/raw/');
-- Get the Snowflake IAM user ARN and External ID
DESC STORAGE INTEGRATION s3_integration;
-- Note: STORAGE_AWS_IAM_USER_ARN and STORAGE_AWS_EXTERNAL_ID
Step 3: Update AWS IAM Role Trust Policy
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::123456789012:user/abc12345-s"
},
"Action": "sts:AssumeRole",
"Condition": {
"StringEquals": {
"sts:ExternalId": "ABC123_SFCRole=1_abcdefg="
}
}
}
]
}
Step 4: Create Stage Using Integration
-- Create external stage
CREATE STAGE my_s3_stage
URL = 's3://my-data-bucket/data/'
STORAGE_INTEGRATION = s3_integration
FILE_FORMAT = (TYPE = 'CSV' FIELD_DELIMITER = ',' SKIP_HEADER = 1);
-- Test it
LIST @my_s3_stage;
-- Load data
COPY INTO customers
FROM @my_s3_stage/customers.csv
FILE_FORMAT = (TYPE = 'CSV');
2. Snowpipe with S3 Event Notifications
Automatically load data as it arrives in S3:
Step 1: Create Pipe in Snowflake
CREATE PIPE customer_pipe
AUTO_INGEST = TRUE
AS
COPY INTO customers
FROM @my_s3_stage
FILE_FORMAT = (TYPE = 'CSV')
ON_ERROR = 'CONTINUE';
-- Get the SQS queue ARN
SHOW PIPES;
-- Note the "notification_channel" value
Step 2: Configure S3 Event Notification
{
"QueueConfigurations": [
{
"Id": "SnowflakeAutoIngest",
"QueueArn": "arn:aws:sqs:us-east-1:123456789:sf-snowpipe-AIDAIDAIDAID-abcdefg",
"Events": ["s3:ObjectCreated:*"],
"Filter": {
"Key": {
"FilterRules": [
{
"Name": "prefix",
"Value": "data/customers/"
},
{
"Name": "suffix",
"Value": ".csv"
}
]
}
}
}
]
}
Step 3: Monitor Pipe
-- Check pipe status
SELECT SYSTEM$PIPE_STATUS('customer_pipe');
-- View load history
SELECT * FROM TABLE(
INFORMATION_SCHEMA.COPY_HISTORY(
TABLE_NAME => 'CUSTOMERS',
START_TIME => DATEADD(hours, -24, CURRENT_TIMESTAMP())
)
);
-- Manually refresh if needed
ALTER PIPE customer_pipe REFRESH;
3. Unloading Data to S3
-- Unload query results to S3
COPY INTO @my_s3_stage/exports/customers_
FROM (
SELECT * FROM customers
WHERE created_date >= '2024-01-01'
)
FILE_FORMAT = (TYPE = 'PARQUET')
PARTITION BY (DATE_TRUNC('MONTH', created_date))
HEADER = TRUE
OVERWRITE = TRUE
MAX_FILE_SIZE = 104857600; -- 100 MB files
-- Unload with compression
COPY INTO @my_s3_stage/exports/customers.gz
FROM customers
FILE_FORMAT = (TYPE = 'CSV' COMPRESSION = 'GZIP')
SINGLE = TRUE; -- Single file
-- Unload as JSON
COPY INTO @my_s3_stage/exports/customers.json
FROM customers
FILE_FORMAT = (TYPE = 'JSON')
OVERWRITE = TRUE;
S3 Integration and External Stages
External Tables (Query S3 Without Loading)
-- Create external table pointing to S3 Parquet files
CREATE EXTERNAL TABLE ext_customer_parquet
WITH LOCATION = @my_s3_stage/parquet_data/
FILE_FORMAT = (TYPE = PARQUET)
AUTO_REFRESH = TRUE
PATTERN = '.*.parquet';
-- Query external table
SELECT * FROM ext_customer_parquet
WHERE region = 'US-WEST'
LIMIT 100;
-- Create materialized view for performance
CREATE MATERIALIZED VIEW mv_ext_customers AS
SELECT
customer_id,
region,
total_purchases,
DATE_TRUNC('MONTH', last_purchase_date) AS month
FROM ext_customer_parquet;
-- Manually refresh if AUTO_REFRESH is false
ALTER EXTERNAL TABLE ext_customer_parquet REFRESH;
Directory Tables (Catalog S3 Files)
-- Enable directory table
ALTER STAGE my_s3_stage
SET DIRECTORY = (ENABLE = TRUE AUTO_REFRESH = TRUE);
-- Query directory metadata
SELECT *
FROM DIRECTORY(@my_s3_stage)
WHERE RELATIVE_PATH LIKE '%2024-01%';
-- Use directory table to process files selectively
COPY INTO customers
FROM (
SELECT
$1::VARCHAR AS customer_id,
$2::VARCHAR AS name,
$3::VARCHAR AS email,
METADATA$FILENAME AS source_file
FROM @my_s3_stage
)
WHERE METADATA$FILENAME IN (
SELECT RELATIVE_PATH
FROM DIRECTORY(@my_s3_stage)
WHERE LAST_MODIFIED > DATEADD('hour', -1, CURRENT_TIMESTAMP())
);
VPC and Networking
AWS PrivateLink Setup
For secure, private connectivity between AWS VPC and Snowflake:
Step 1: Enable PrivateLink in Snowflake
USE ROLE ACCOUNTADMIN;
-- For AWS regions that support PrivateLink
SELECT SYSTEM$GET_PRIVATELINK_CONFIG();
-- Returns the VPC endpoint service name
-- Example: com.amazonaws.vpce.us-east-1.vpce-svc-1234567890abcdef0
Step 2: Create VPC Endpoint in AWS
# Using AWS CLI
aws ec2 create-vpc-endpoint
--vpc-id vpc-12345678
--service-name com.amazonaws.vpce.us-east-1.vpce-svc-1234567890abcdef0
--vpc-endpoint-type Interface
--subnet-ids subnet-12345678 subnet-87654321
--security-group-ids sg-12345678
Step 3: Configure Network Policies
-- Create network policy to allow only PrivateLink
CREATE NETWORK POLICY privatelink_only
ALLOWED_IP_LIST = ('0.0.0.0/0') -- Or specific IPs
BLOCKED_IP_LIST = ()
COMMENT = 'Allow access via PrivateLink';
-- Apply to account
ALTER ACCOUNT SET NETWORK_POLICY = privatelink_only;
-- Or apply to specific users
ALTER USER john_doe SET NETWORK_POLICY = privatelink_only;
Network Policies
-- Allow only corporate IPs
CREATE NETWORK POLICY corporate_access
ALLOWED_IP_LIST = ('203.0.113.0/24', '198.51.100.0/24')
BLOCKED_IP_LIST = ()
COMMENT = 'Corporate network access only';
-- Allow all except blocked IPs
CREATE NETWORK POLICY block_suspicious
ALLOWED_IP_LIST = ('0.0.0.0/0')
BLOCKED_IP_LIST = ('192.0.2.0/24') -- Block suspicious range
COMMENT = 'Block specific IPs';
-- View network policies
SHOW NETWORK POLICIES;
-- Apply globally
ALTER ACCOUNT SET NETWORK_POLICY = corporate_access;
BI Tool Integrations
Tableau Integration
Connection Setup:
- In Tableau Desktop: Connect → Snowflake
- Enter details:
- Server:
account.region.snowflakecomputing.com - Authentication: Username/Password or OAuth
- Warehouse, Database, Schema
- Server:
Optimization Tips:
-- Create aggregated tables for Tableau
CREATE TABLE tableau_daily_sales AS
SELECT
DATE_TRUNC('DAY', sale_date) AS day,
region,
product_category,
SUM(amount) AS total_sales,
COUNT(DISTINCT customer_id) AS unique_customers
FROM sales
GROUP BY 1, 2, 3;
-- Create dedicated warehouse
CREATE WAREHOUSE tableau_wh
WAREHOUSE_SIZE = 'MEDIUM'
AUTO_SUSPEND = 600 -- 10 minutes (Tableau keeps connection)
AUTO_RESUME = TRUE
MAX_CLUSTER_COUNT = 5 -- Handle concurrent users
SCALING_POLICY = 'STANDARD';
-- Grant usage to Tableau service account
GRANT USAGE ON WAREHOUSE tableau_wh TO ROLE tableau_role;
GRANT SELECT ON ALL TABLES IN SCHEMA analytics.curated TO ROLE tableau_role;
Performance best practices:
- Use extracts for large datasets (published to Tableau Server)
- Use live connections for real-time dashboards
- Leverage Snowflake’s result caching
- Create materialized views for complex calculations
Power BI Integration
Connection String:
Server: account.region.snowflakecomputing.com
Warehouse: POWERBI_WH
Database: ANALYTICS_DB
Schema: CURATED
DirectQuery vs Import:
| Mode | When to Use | Pros | Cons |
|---|---|---|---|
| DirectQuery | Real-time dashboards, large datasets | Always current, no size limits | Query performance dependent on Snowflake |
| Import | Static reports, complex DAX | Fast in Power BI, offline access | Stale data, size limits |
Optimization:
-- Pre-aggregate for Power BI
CREATE VIEW powerbi_sales_summary AS
SELECT
DATE_TRUNC('DAY', sale_timestamp) AS sale_date,
store_id,
product_id,
SUM(quantity) AS total_quantity,
SUM(amount) AS total_amount,
COUNT(*) AS transaction_count
FROM sales
GROUP BY 1, 2, 3;
-- Use appropriate data types
-- Power BI works best with DATE (not TIMESTAMP) for date tables
Looker Integration
Connection Setup in Looker:
connection: "snowflake_prod" {
url: "account.region.snowflakecomputing.com"
database: "ANALYTICS_DB"
warehouse: "LOOKER_WH"
schema: "CURATED"
max_connections: 20
connection_pooling: yes
ssl: yes
}
view: customers {
sql_table_name: PUBLIC.CUSTOMERS ;;
dimension: customer_id {
primary_key: yes
type: number
sql: ${TABLE}.CUSTOMER_ID ;;
}
dimension: email {
type: string
sql: ${TABLE}.EMAIL ;;
}
dimension_group: signup {
type: time
timeframes: [date, week, month, year]
sql: ${TABLE}.SIGNUP_DATE ;;
}
measure: count {
type: count
drill_fields: [customer_id, email]
}
}
Performance Tips:
-- Create persistent derived tables (PDTs)
-- Looker will create and refresh these tables
-- Enable query results caching
ALTER SESSION SET USE_CACHED_RESULT = TRUE;
-- Create aggregate awareness tables
CREATE TABLE daily_rollup AS
SELECT
DATE_TRUNC('DAY', event_time) AS event_day,
user_id,
COUNT(*) AS event_count
FROM events
GROUP BY 1, 2;
Python/Pandas Integration
import snowflake.connector
import pandas as pd
from snowflake.connector.pandas_tools import write_pandas
# Connect to Snowflake
conn = snowflake.connector.connect(
user='your_user',
password='your_password',
account='account.region',
warehouse='PYTHON_WH',
database='ANALYTICS_DB',
schema='PUBLIC'
)
# Query to DataFrame
df = pd.read_sql("""
SELECT * FROM customers
WHERE signup_date >= '2024-01-01'
""", conn)
# Write DataFrame to Snowflake (fast!)
write_pandas(
conn=conn,
df=df,
table_name='CUSTOMERS_UPLOAD',
database='ANALYTICS_DB',
schema='STAGING',
auto_create_table=True
)
# Using cursor for row-by-row processing
cursor = conn.cursor()
cursor.execute("SELECT * FROM large_table")
for row in cursor:
process_row(row)
# Close connection
conn.close()
# Using SQLAlchemy (for compatibility with other tools)
from sqlalchemy import create_engine
engine = create_engine(
'snowflake://user:password@account.region/database/schema?warehouse=PYTHON_WH'
)
df = pd.read_sql_table('customers', engine)
df.to_sql('customers_new', engine, if_exists='replace')
Snowflake vs AWS Redshift
Let’s compare these two popular cloud data warehouses:
Architecture Differences
| Aspect | Snowflake | Redshift |
|---|---|---|
| Architecture | Shared-data, separated compute/storage | Shared-nothing, coupled compute/storage |
| Scaling | Elastic, instant | Requires cluster resize |
| Concurrency | Multi-cluster, no resource contention | Limited by cluster size, queuing common |
| Management | Fully managed, zero-admin | Some management required |
| Pause/Resume | Instant, per-warehouse | Whole cluster, slower |
| Cloning | Zero-copy, instant | Not available |
| Time Travel | Built-in, up to 90 days | Manual (snapshots) |
Performance Comparison
-- SNOWFLAKE: No distribution keys needed
CREATE TABLE orders (
order_id NUMBER,
customer_id NUMBER,
order_date DATE,
amount NUMBER(10,2)
);
-- Automatically micro-partitioned and distributed
-- REDSHIFT: Manual distribution key required
CREATE TABLE orders (
order_id INTEGER,
customer_id INTEGER,
order_date DATE,
amount DECIMAL(10,2)
)
DISTKEY(customer_id) -- Must choose distribution
SORTKEY(order_date); -- Must choose sort key
-- SNOWFLAKE: No VACUUM needed
-- Updates handled automatically
-- REDSHIFT: Regular VACUUM required
VACUUM orders;
ANALYZE orders;
Feature Comparison Table
| Feature | Snowflake | Redshift | Winner |
|---|---|---|---|
| Setup Complexity | Minutes | Hours | ❄️ Snowflake |
| Concurrency | Excellent (multi-cluster) | Good (WLM) | ❄️ Snowflake |
| Query Performance | Excellent | Excellent | 🤝 Tie |
| Data Sharing | Native, instant | Manual export/import | ❄️ Snowflake |
| Semi-structured Data | Native VARIANT type | JSON strings | ❄️ Snowflake |
| Cost Model | Compute (credits) + Storage | Instance hours + Storage | 🤷 Depends |
| AWS Integration | Good | Excellent | 🔴 Redshift |
| Multi-cloud | AWS, Azure, GCP | AWS only | ❄️ Snowflake |
| Tool Ecosystem | Growing | Mature | 🔴 Redshift |
| Maintenance | Zero | Moderate | ❄️ Snowflake |
Cost Comparison Example
Scenario: 5TB data, 100 concurrent users, 24/7 availability
Snowflake:
Storage: 5TB × $40/TB/month = $200/month
Compute:
- Medium WH for BI: 4 credits/hr × 730 hrs × $4 = $11,680
- Multi-cluster (2-5): ~20% utilization = $2,336
Total: ~$2,536/month
Total: $2,736/month (with auto-suspend optimization)
Redshift:
Cluster: ra3.4xlarge × 2 nodes
= $3.26/hr × 2 × 730 hrs = $4,759/month
Storage: Included in instance cost
Total: $4,759/month (must run 24/7)
Snowflake wins on cost if:
- Workload is bursty (can auto-suspend)
- Need multiple isolated workloads
- Development/testing environments
Redshift wins on cost if:
- Truly 24/7 workload
- Deeply integrated with AWS services
- Committed to reserved instances
Migration: Redshift → Snowflake
-- 1. Export from Redshift to S3
-- In Redshift:
UNLOAD ('SELECT * FROM customers')
TO 's3://migration-bucket/customers/'
IAM_ROLE 'arn:aws:iam::123456789:role/RedshiftUnloadRole'
PARQUET
PARALLEL ON;
-- 2. Create storage integration in Snowflake
CREATE STORAGE INTEGRATION redshift_migration
TYPE = EXTERNAL_STAGE
STORAGE_PROVIDER = S3
ENABLED = TRUE
STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::123456789:role/SnowflakeLoadRole'
STORAGE_ALLOWED_LOCATIONS = ('s3://migration-bucket/');
-- 3. Create stage and load
CREATE STAGE migration_stage
URL = 's3://migration-bucket/'
STORAGE_INTEGRATION = redshift_migration
FILE_FORMAT = (TYPE = 'PARQUET');
CREATE TABLE customers AS
SELECT * FROM @migration_stage/customers/;
-- 4. Schema conversion notes:
-- Redshift → Snowflake
-- DISTKEY/SORTKEY → Remove (automatic)
-- VARCHAR(MAX) → VARCHAR or STRING
-- ENCODE → Remove (automatic compression)
-- IDENTITY → AUTOINCREMENT
-- ANALYZE/VACUUM → Remove (automatic)
System Tables and Metadata
Snowflake provides extensive metadata through system tables and views:
Account Usage Schema (Historical Data)
-- 1. Query History (up to 1 year)
SELECT
query_id,
query_text,
user_name,
warehouse_name,
execution_status,
start_time,
end_time,
DATEDIFF('second', start_time, end_time) AS duration_sec,
total_elapsed_time / 1000 AS total_sec,
bytes_scanned / POWER(1024, 3) AS gb_scanned,
rows_produced
FROM snowflake.account_usage.query_history
WHERE start_time >= DATEADD('day', -30, CURRENT_TIMESTAMP())
AND warehouse_name = 'PROD_WH'
ORDER BY start_time DESC
LIMIT 100;
-- 2. Storage Usage Over Time
SELECT
usage_date,
storage_bytes / POWER(1024, 4) AS storage_tb,
stage_bytes / POWER(1024, 4) AS stage_tb,
failsafe_bytes / POWER(1024, 4) AS failsafe_tb
FROM snowflake.account_usage.storage_usage
WHERE usage_date >= DATEADD('month', -3, CURRENT_DATE())
ORDER BY usage_date DESC;
-- 3. Warehouse Metering (Credit Consumption)
SELECT
warehouse_name,
DATE_TRUNC('DAY', start_time) AS day,
SUM(credits_used) AS daily_credits,
SUM(credits_used) * 4 AS approx_cost_usd
FROM snowflake.account_usage.warehouse_metering_history
WHERE start_time >= DATEADD('month', -1, CURRENT_TIMESTAMP())
GROUP BY 1, 2
ORDER BY 1, 2 DESC;
-- 4. Login History (Security Audit)
SELECT
event_timestamp,
user_name,
client_ip,
reported_client_type,
first_authentication_factor,
is_success,
error_code,
error_message
FROM snowflake.account_usage.login_history
WHERE event_timestamp >= DATEADD('day', -7, CURRENT_TIMESTAMP())
AND is_success = 'NO' -- Failed logins
ORDER BY event_timestamp DESC;
-- 5. Data Transfer History (Egress Costs)
SELECT
DATE_TRUNC('DAY', start_time) AS day,
source_cloud,
source_region,
target_cloud,
target_region,
bytes_transferred / POWER(1024, 3) AS gb_transferred
FROM snowflake.account_usage.data_transfer_history
WHERE start_time >= DATEADD('month', -1, CURRENT_TIMESTAMP())
ORDER BY day DESC;
-- 6. Table Storage Details
SELECT
table_catalog || '.' || table_schema || '.' || table_name AS full_name,
active_bytes / POWER(1024, 3) AS active_gb,
time_travel_bytes / POWER(1024, 3) AS time_travel_gb,
failsafe_bytes / POWER(1024, 3) AS failsafe_gb,
(active_bytes + time_travel_bytes + failsafe_bytes) / POWER(1024, 3) AS total_gb
FROM snowflake.account_usage.table_storage_metrics
WHERE deleted IS NULL -- Exclude dropped tables
ORDER BY total_gb DESC
LIMIT 50;
-- 7. Copy History (Data Loading)
SELECT
table_name,
stage_location,
file_name,
status,
row_count,
row_parsed,
first_error_message,
execution_time / 1000 AS load_seconds
FROM snowflake.account_usage.copy_history
WHERE table_name = 'CUSTOMERS'
AND last_load_time >= DATEADD('day', -7, CURRENT_TIMESTAMP())
ORDER BY last_load_time DESC;
-- 8. Access History (Column-level Auditing)
SELECT
query_id,
query_start_time,
user_name,
direct_objects_accessed,
base_objects_accessed,
objects_modified
FROM snowflake.account_usage.access_history
WHERE query_start_time >= DATEADD('day', -1, CURRENT_TIMESTAMP())
AND ARRAY_SIZE(objects_modified) > 0 -- Only modifications
ORDER BY query_start_time DESC;
Information Schema (Current State)
-- 1. List all tables with row counts
SELECT
table_catalog,
table_schema,
table_name,
table_type,
row_count,
bytes / POWER(1024, 3) AS size_gb,
created,
last_altered
FROM information_schema.tables
WHERE table_schema NOT IN ('INFORMATION_SCHEMA')
ORDER BY bytes DESC;
-- 2. Get table DDL
SELECT GET_DDL('TABLE', 'CUSTOMERS');
SELECT GET_DDL('VIEW', 'ACTIVE_CUSTOMERS');
SELECT GET_DDL('DATABASE', 'ANALYTICS_DB');
-- 3. Column details
SELECT
table_schema,
table_name,
column_name,
ordinal_position,
data_type,
is_nullable,
column_default
FROM information_schema.columns
WHERE table_name = 'CUSTOMERS'
ORDER BY ordinal_position;
-- 4. View dependencies
SELECT
*
FROM information_schema.view_dependencies
WHERE referenced_object_name = 'CUSTOMERS';
Monitoring and Troubleshooting
Essential Monitoring Queries
-- 1. Currently running queries
SELECT
query_id,
user_name,
warehouse_name,
execution_status,
DATEDIFF('second', start_time, CURRENT_TIMESTAMP()) AS running_seconds,
query_text
FROM snowflake.account_usage.query_history
WHERE execution_status = 'RUNNING'
ORDER BY start_time;
-- 2. Long-running queries (last 24 hours)
SELECT
query_id,
user_name,
warehouse_name,
execution_time / 1000 AS duration_seconds,
bytes_scanned / POWER(1024, 3) AS gb_scanned,
LEFT(query_text, 100) AS query_snippet
FROM snowflake.account_usage.query_history
WHERE start_time >= DATEADD('hour', -24, CURRENT_TIMESTAMP())
AND execution_time > 300000 -- > 5 minutes
ORDER BY execution_time DESC;
-- 3. Warehouse utilization
SELECT
warehouse_name,
DATE_TRUNC('HOUR', start_time) AS hour,
SUM(credits_used) AS credits,
AVG(avg_running) AS avg_queries_running,
MAX(avg_queued_load) AS max_queued
FROM snowflake.account_usage.warehouse_load_history
WHERE start_time >= DATEADD('day', -7, CURRENT_TIMESTAMP())
GROUP BY 1, 2
ORDER BY 1, 2 DESC;
-- 4. Failed queries
SELECT
query_id,
user_name,
error_code,
error_message,
LEFT(query_text, 100) AS query_snippet
FROM snowflake.account_usage.query_history
WHERE start_time >= DATEADD('hour', -24, CURRENT_TIMESTAMP())
AND execution_status = 'FAIL'
ORDER BY start_time DESC;
-- 5. Storage usage trends
SELECT
DATE_TRUNC('DAY', usage_date) AS day,
AVG(storage_bytes) / POWER(1024, 3) AS avg_storage_gb,
AVG(stage_bytes) / POWER(1024, 3) AS avg_stage_gb,
AVG(failsafe_bytes) / POWER(1024, 3) AS avg_failsafe_gb
FROM snowflake.account_usage.storage_usage
WHERE usage_date >= DATEADD('month', -1, CURRENT_DATE())
GROUP BY 1
ORDER BY 1;
Performance Troubleshooting Guide
-- Problem: Query running slow
-- Step 1: Check if warehouse is overwhelmed
SELECT
warehouse_name,
avg_running,
avg_queued_load,
avg_blocked
FROM snowflake.account_usage.warehouse_load_history
WHERE start_time >= DATEADD('hour', -1, CURRENT_TIMESTAMP())
AND warehouse_name = 'YOUR_WAREHOUSE'
ORDER BY start_time DESC
LIMIT 10;
-- Step 2: Check query profile for the slow query
-- Use UI: Query -> Profile tab
-- Look for:
-- - High "Bytes spilled to local/remote storage"
-- - Many partitions scanned
-- - Inefficient joins
-- Step 3: Check if data is properly clustered
SELECT
table_name,
SYSTEM$CLUSTERING_INFORMATION(table_name)
FROM information_schema.tables
WHERE table_schema = 'YOUR_SCHEMA'
AND table_name = 'YOUR_TABLE';
Data Sharing and Marketplace
One of Snowflake’s killer features: share data without copying!
Creating a Data Share
-- As provider:
USE ROLE ACCOUNTADMIN;
-- Create share
CREATE SHARE customer_data_share;
-- Add database to share
GRANT USAGE ON DATABASE analytics_db TO SHARE customer_data_share;
GRANT USAGE ON SCHEMA analytics_db.public TO SHARE customer_data_share;
-- Add specific objects
GRANT SELECT ON TABLE analytics_db.public.customers TO SHARE customer_data_share;
GRANT SELECT ON VIEW analytics_db.public.customer_summary TO SHARE customer_data_share;
-- Add consumer account
ALTER SHARE customer_data_share
ADD ACCOUNTS = abc12345, xyz67890;
-- Show what's shared
SHOW GRANTS TO SHARE customer_data_share;
-- Add secure view with row filtering
CREATE SECURE VIEW shared_customer_view AS
SELECT
customer_id,
first_name,
last_name,
region
FROM customers
WHERE region IN (
SELECT region FROM consumer_allowed_regions
);
GRANT SELECT ON VIEW shared_customer_view TO SHARE customer_data_share;
Consuming a Data Share
-- As consumer:
-- List available shares
SHOW SHARES;
-- Create database from share
CREATE DATABASE shared_customer_data
FROM SHARE provider_account.customer_data_share;
-- Query shared data (no copy!)
SELECT * FROM shared_customer_data.public.customers
WHERE region = 'US-WEST';
-- Shared data usage doesn't affect provider's compute
Snowflake Marketplace
-- Browse available datasets
-- Go to: Snowsight > Data Products > Marketplace
-- Get free weather data (example)
-- 1. Find "Weather Source" in Marketplace
-- 2. Click "Get"
-- 3. Creates database automatically
-- Query marketplace data
SELECT *
FROM weather_source.standard_tile.forecast_day
WHERE postal_code = '10001'
AND forecast_date = CURRENT_DATE();
-- Combine with your data
SELECT
o.order_date,
o.total_sales,
w.avg_temperature_air_2m_f,
w.tot_precipitation_in
FROM analytics_db.public.daily_sales o
JOIN weather_source.standard_tile.history_day w
ON o.store_zip = w.postal_code
AND o.order_date = w.date_valid_std
WHERE o.order_date >= DATEADD('month', -1, CURRENT_DATE());
Migration Strategies
From Traditional Databases (Oracle, SQL Server, MySQL)
Step 1: Assessment
-- Inventory source system
-- Document:
-- - Table sizes
-- - Data types
-- - Stored procedures
-- - ETL processes
-- - BI tool connections
-- - Query patterns
Step 2: Schema Conversion
-- Oracle → Snowflake conversions:
-- NUMBER(n) → NUMBER(n, 0)
-- NUMBER(n,m) → NUMBER(n, m)
-- VARCHAR2(n) → VARCHAR(n)
-- DATE → DATE or TIMESTAMP
-- CLOB → STRING or VARCHAR(16MB)
-- RAW/BLOB → BINARY
-- Sequence → AUTOINCREMENT or sequence
-- SQL Server → Snowflake:
-- INT → NUMBER(38, 0)
-- BIGINT → NUMBER(38, 0)
-- VARCHAR(MAX) → STRING
-- DATETIME → TIMESTAMP_NTZ
-- UNIQUEIDENTIFIER → VARCHAR(36)
-- IDENTITY → AUTOINCREMENT
-- Create tables in Snowflake
CREATE TABLE customers (
customer_id NUMBER AUTOINCREMENT,
created_date TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP(),
-- ...
);
Step 3: Data Migration
-- Option 1: Export to CSV, upload to S3, COPY into Snowflake
-- (Covered earlier)
-- Option 2: Use third-party tools
-- - Fivetran
-- - Matillion
-- - AWS DMS (Database Migration Service)
-- - Talend
-- - Informatica
-- Option 3: Snowflake connector for Spark
-- For large-scale migrations
Step 4: Procedure/Function Migration
-- Convert PL/SQL or T-SQL to Snowflake SQL or JavaScript
-- Oracle PL/SQL:
/*
CREATE OR REPLACE PROCEDURE update_customer_tier IS
BEGIN
UPDATE customers
SET tier = 'GOLD'
WHERE total_purchases > 10000;
COMMIT;
END;
*/
-- Snowflake equivalent:
CREATE OR REPLACE PROCEDURE update_customer_tier()
RETURNS STRING
LANGUAGE SQL
AS
$$
BEGIN
UPDATE customers
SET tier = 'GOLD'
WHERE total_purchases > 10000;
RETURN 'Updated ' || SQLROWCOUNT || ' customers';
END;
$$;
Parallel Running Strategy
-- Phase 1: Shadow mode (dual-write)
-- Write to both old system and Snowflake
-- Compare results
-- Phase 2: Read from Snowflake for non-critical workloads
-- Keep old system as backup
-- Phase 3: Full cutover
-- Decommission old system
Real-World Architecture Patterns
Pattern 1: Modern Data Stack
Data Sources (SaaS, Databases, APIs)
↓
Fivetran/Airbyte (Ingestion)
↓
Snowflake (Storage & Transformation)
↓
dbt (Transformation Layer)
↓
BI Tools (Tableau, Looker, Mode)
↓
Reverse ETL (Census, Hightouch)
↓
Operational Systems
Implementation:
-- 1. Raw data layer (from Fivetran)
CREATE SCHEMA raw_data;
-- 2. Staging layer (dbt)
CREATE SCHEMA staging;
-- 3. Marts (dbt)
CREATE SCHEMA analytics;
CREATE SCHEMA sales_mart;
CREATE SCHEMA marketing_mart;
-- Sample dbt model
-- models/staging/stg_customers.sql
{{
config(
materialized='view'
)
}}
SELECT
customer_id,
LOWER(TRIM(email)) AS email,
created_at::TIMESTAMP_NTZ AS created_timestamp
FROM {{ source('raw_data', 'customers') }}
WHERE customer_id IS NOT NULL
Pattern 2: Data Lake + Data Warehouse
S3 Data Lake (Raw Files)
↓
External Tables / Snowpipe → Snowflake
↓
Transformation
↓
Curated Layer
Implementation:
-- Query data lake directly
CREATE EXTERNAL TABLE raw_events
WITH LOCATION = @s3_stage/events/
FILE_FORMAT = (TYPE = 'PARQUET')
PARTITION BY (DATE_TRUNC('DAY', event_timestamp));
-- Stream to internal tables
CREATE PIPE load_events
AUTO_INGEST = TRUE
AS
COPY INTO events_internal
FROM @s3_stage/events/;
-- Transform
CREATE TABLE events_processed AS
SELECT
event_id,
user_id,
event_type,
DATE_TRUNC('HOUR', event_timestamp) AS event_hour,
properties
FROM events_internal
WHERE event_type IN ('purchase', 'signup', 'cancellation');
Pattern 3: Multi-Tenant SaaS
-- Separate databases per customer
CREATE DATABASE customer_abc_prod;
CREATE DATABASE customer_xyz_prod;
-- Or single database with row-level security
CREATE TABLE saas_data (
tenant_id VARCHAR,
record_id NUMBER,
data VARCHAR
);
CREATE ROW ACCESS POLICY tenant_isolation
AS (tenant_id VARCHAR) RETURNS BOOLEAN ->
tenant_id = CURRENT_USER()
OR 'ADMIN' = CURRENT_ROLE();
ALTER TABLE saas_data
ADD ROW ACCESS POLICY tenant_isolation ON (tenant_id);
-- Users automatically see only their data
Pattern 4: Real-Time Analytics
Application Databases
↓
CDC (Debezium/Kafka)
↓
S3 / Event Hub
↓
Snowpipe
↓
Stream
↓
Task
↓
Materialized View
↓
BI Dashboard
Implementation:
-- 1. Snowpipe loads data continuously
CREATE PIPE real_time_events
AUTO_INGEST = TRUE
AS COPY INTO events FROM @s3_stage/events/;
-- 2. Stream captures changes
CREATE STREAM event_stream ON TABLE events;
-- 3. Task processes stream
CREATE TASK process_events
WAREHOUSE = real_time_wh
SCHEDULE = '1 MINUTE'
WHEN SYSTEM$STREAM_HAS_DATA('event_stream')
AS
INSERT INTO events_aggregated
SELECT
DATE_TRUNC('MINUTE', event_time) AS minute,
event_type,
COUNT(*) AS event_count
FROM event_stream
GROUP BY 1, 2;
-- 4. Materialized view for dashboards
CREATE MATERIALIZED VIEW events_last_hour AS
SELECT
event_type,
COUNT(*) AS count,
MAX(event_time) AS last_seen
FROM events
WHERE event_time >= DATEADD('hour', -1, CURRENT_TIMESTAMP())
GROUP BY event_type;
Final Thoughts & Resources
Congratulations! You’ve completed this comprehensive Snowflake guide. Let’s recap what you’ve learned:
Key Topics Covered
Foundation:
- ✅ Snowflake’s unique three-layer architecture
- ✅ Virtual warehouses and micro-partitions
- ✅ Core components and data organization
- ✅ Time Travel and Zero-Copy Cloning
Getting Started:
- ✅ Initial setup and context management
- ✅ Loading data via multiple methods
- ✅ Working with databases and schemas
Best Practices:
- ✅ Schema design and data type selection
- ✅ Performance optimization techniques
- ✅ Warehouse sizing and management
- ✅ Query optimization strategies
- ✅ Cost management and monitoring
Security & Modeling:
- ✅ RBAC and access control
- ✅ Row and column-level security
- ✅ Data modeling patterns (Star, SCD, Data Vault)
- ✅ Advanced features (Streams, Tasks, Procedures)
Integrations:
- ✅ AWS integration (S3, IAM, PrivateLink)
- ✅ BI tool connections (Tableau, Power BI, Looker)
- ✅ System tables and monitoring
- ✅ Python/Pandas integration
Advanced Topics:
- ✅ Snowflake vs Redshift comparison
- ✅ Data sharing and marketplace
- ✅ Migration strategies
- ✅ Real-world architecture patterns
Key Takeaways
1. Start Simple, Scale as Needed
- Begin with X-Small warehouses
- Enable auto-suspend and auto-resume
- Monitor and optimize based on actual usage
2. Embrace Snowflake’s Philosophy
- Let Snowflake handle optimization
- Don’t over-engineer (no indexes, no partitioning decisions)
- Use modern data stack tools (dbt, Fivetran, etc.)
3. Cost Control is Crucial
- Set resource monitors
- Use transient tables for staging
- Optimize warehouse sizes
- Monitor query performance
4. Security First
- Use IAM roles, not access keys
- Implement network policies
- Use row-level and column-level security
- Audit access via system tables
5. Leverage Unique Features
- Data sharing for collaboration
- Time Travel for recovery
- Zero-copy cloning for dev/test
- Streams and Tasks for automation
Next Steps
Beginners:
- Sign up for Snowflake trial (30 days free, $400 credits)
- Complete Snowflake’s hands-on labs
- Load sample data and run queries
- Connect a BI tool (Tableau/Power BI)
Intermediate:
- Implement dbt for transformations
- Set up automated pipelines with Tasks
- Create materialized views for dashboards
- Optimize costs using monitoring queries
Advanced:
- Implement Data Vault or complex data models
- Set up cross-cloud replication
- Build real-time pipelines with Streams/Tasks
- Share data with partners via Data Sharing
Resources
Official Documentation:
- docs.snowflake.com
- Snowflake Community (community.snowflake.com)
- Snowflake University (training courses)
Community:
- r/snowflake on Reddit
- Snowflake LinkedIn group
- Medium articles on Snowflake
Tools Ecosystem:
- dbt (getdbt.com)
- Fivetran (fivetran.com)
- Census/Hightouch (reverse ETL)
- Great Expectations (data quality)
Quick Reference Commands
-- Context management
SHOW DATABASES;
SHOW SCHEMAS;
SHOW TABLES;
SHOW WAREHOUSES;
-- Resource monitoring
SELECT * FROM SNOWFLAKE.ACCOUNT_USAGE.WAREHOUSE_METERING_HISTORY
WHERE START_TIME >= DATEADD('day', -7, CURRENT_TIMESTAMP());
-- Session info
SELECT CURRENT_ACCOUNT(), CURRENT_REGION(), CURRENT_VERSION();
-- Query history
SELECT * FROM TABLE(INFORMATION_SCHEMA.QUERY_HISTORY())
ORDER BY START_TIME DESC LIMIT 10;
-- Table info
SELECT GET_DDL('TABLE', 'table_name');
DESCRIBE TABLE table_name;
-- Warehouse operations
ALTER WAREHOUSE my_wh SUSPEND;
ALTER WAREHOUSE my_wh RESUME;
ALTER WAREHOUSE my_wh SET WAREHOUSE_SIZE = 'LARGE';
-- Time Travel
SELECT * FROM table_name AT(OFFSET => -3600);
SELECT * FROM table_name BEFORE(STATEMENT => 'query_id');
-- Cloning
CREATE TABLE new_table CLONE existing_table;
CREATE DATABASE new_db CLONE existing_db;
Thank You!
Thanks for reading this comprehensive guide! Snowflake is a powerful platform that continues to evolve. Keep learning, experimenting, and building awesome data solutions.
Happy Snowflaking! ❄️☁️
Questions or feedback? Connect with me on LinkedIn or leave a comment below. I’d love to hear about your Snowflake journey!
Found this helpful? Share it with your team and fellow data enthusiasts! ⭐
