To start with, I must say SageMaker Unified Studio (later we’ll use SUS as abbreviation) is confusing if you come from the traditional individual AWS analytical services—because it wraps all the services you’ve already worked with, like below:
- S3: Storage
- Lake Formation: Data Governance with fine-grained permissions
- Glue: For Spark workloads and data catalog management
- Redshift: Data warehouse
- Athena: Ad-hoc SQL queries
- SageMaker Notebook: Running Python scripts or connecting to Glue Interactive Sessions
- Bedrock: For Generative and Agentic AI components
- Amazon Q: For AI-assisted code generation (SQL and Python)
- DataZone: For business catalog, project management, and cross-domain data sharing
- EMR: For big data processing with Spark
The different components and services adds operational and governance overhead for data, compute and security, as well as they exist in silo.
This brings the need for Lakehouse architecture.
Now with Unified Studio, you have all storage, compute (Athena, EMR, Redshift, and Glue), and governance (DataZone and Lake Formation) wrapped under one managed umbrella.
But here’s the interesting part—SUS provides a unified platform to implement lakehouse architecture seamlessly. And before you ask “why lakehouse?”—let me explain the problem this solves.
The Problem: Why We Need Lakehouse Architecture
You might be wondering—why are we even talking about lakehouse architecture? Because it solves a massive pain point you’ve probably experienced.
The Traditional Mess We’ve All Dealt With
Scenario 1: The Data Lake Approach
- You dump all your data into S3 (cheap storage ✅)
- But now you need to run analytics…
- Performance is terrible ❌
- No ACID transactions ❌
- Data quality enforcement? Good luck! ❌
- Result: You end up copying data to Redshift for actual analytics
Scenario 2: The Data Warehouse Approach
- You load everything into Redshift (great performance ✅)
- But storage costs skyrocket 💸
- Can’t handle semi-structured data well ❌
- ML teams want raw data in S3 anyway ❌
- Result: You end up maintaining both S3 and Redshift with duplicate data
The Real Problem:
- Data duplication everywhere—paying for the same data multiple times
- Complex ETL pipelines just to move data between systems
- Multiple permission models to manage across different services
- Inconsistent data across systems causing trust issues
- Slow time-to-insight because of all this overhead
- High operational costs for maintaining duplicate infrastructure
Sound familiar? This is exactly why SageMaker Unified Studio provides a unified platform to implement lakehouse architecture.
The Solution: What is Lakehouse Architecture?
Lakehouse architecture is an approach that combines the best of both worlds:
Data Lake + Data Warehouse = Lakehouse
According to AWS documentation:
“A data lakehouse is an architecture that unifies the scalability and cost-effectiveness of data lakes with the performance and reliability characteristics of data warehouses. This approach eliminates the traditional trade-offs between storing diverse data types and maintaining query performance for analytical workloads.”
Key Benefits of Lakehouse Architecture:
✅ Transactional consistency – ACID compliance ensures reliable concurrent operations
✅ Schema management – Flexible schema evolution without breaking existing queries
✅ Compute-storage separation – Independent scaling of processing and storage resources
✅ Open standards – Compatibility with Apache Iceberg open standard
✅ Single source of truth – Eliminates data silos and redundant storage costs
✅ Real-time and batch processing – Supports both streaming and historical analytics
✅ Direct file access – Enables both SQL queries and programmatic data access
✅ Unified governance – Consistent security and compliance across all data types
This architectural approach is what SageMaker Unified Studio helps you implement without the complexity. Let’s see how.
How SageMaker Unified Studio Implements Lakehouse Architecture
SageMaker Unified Studio provides a unified platform that implements lakehouse architecture for you automatically. According to AWS documentation:
“The lakehouse architecture of Amazon SageMaker unifies data across Amazon S3 data lakes and Amazon Redshift data warehouses so you can work with your data in one place.”
Here’s how SUS implements this architecture:
1. Unified Data Access Through Single Catalog
Instead of managing separate connections to S3, Redshift, Aurora, DynamoDB, and other sources—you get one unified interface:
- AWS Glue Data Catalog serves as the single catalog where you discover and query all your data
- Apache Iceberg open table format provides interoperability across different analytics engines
- Multiple query engines (Athena, Redshift, Spark on EMR) can all access the same data without duplication
How it works:
“When you run a query, AWS Lake Formation checks your permissions while the query engine processes data directly from its original storage location, whether that’s Amazon S3 or Amazon Redshift.”
This means data stays where it is—no unnecessary movement or duplication.
2. Two Types of Data Access
Managed Data Sources:
- Amazon S3 data lakes – Including Amazon S3 Tables with built-in Apache Iceberg support
- Amazon Redshift warehouse tables – Accessible as Iceberg tables through Redshift Spectrum
-
Zero-ETL destinations – Near real-time data replication from:
- SaaS sources (Salesforce, SAP, Zendesk)
- Operational databases (Amazon Aurora, Amazon RDS for MySQL)
- NoSQL databases (Amazon DynamoDB)
Federated Data Sources (Query in-place without moving data):
- Operational databases (PostgreSQL, MySQL, Microsoft SQL Server)
- AWS managed databases (Amazon Aurora, Amazon RDS, Amazon DynamoDB, Amazon DocumentDB)
- Third-party data warehouses (Snowflake, Google BigQuery)
When you connect federated sources in SUS, AWS automatically provisions the required infrastructure components (AWS Glue connection, Lambda functions) that act as bridges between the query engines and the federated data source.
3. Centralized Governance with Lake Formation
One permission model (AWS Lake Formation) that enforces access control consistently across:
- S3 data lakes
- Redshift data warehouses
- Federated sources
- All query engines (Athena, Redshift Query Editor v2, EMR, Glue)
Fine-grained control at table, column, row, and cell levels—defined once, enforced everywhere.
4. Project-Based Organization with DataZone
Amazon DataZone powers the project and domain management in SUS:
- Business catalog for data discovery and data product publishing
- Project boundaries for collaboration and permissions
- Cross-domain data sharing for governed data access across teams
- Domain management for organizing multiple projects
The Architecture Components:
Let me clarify what each component actually is:
- Lakehouse Architecture = The architectural pattern/approach (not a product)
- SageMaker Unified Studio = The unified platform that implements lakehouse architecture
- Athena, Redshift, Spark (EMR) = The query/compute engines that process your queries
- Glue Data Catalog = The unified metadata catalog (single source of truth for metadata)
- Lake Formation = The governance layer providing fine-grained permissions
- DataZone = The business catalog and project management layer
- Apache Iceberg = The open table format enabling cross-engine interoperability
Now that you understand how SUS implements lakehouse architecture, let’s see how it organizes your workflow.
The Three Core Sections of SUS
With lakehouse architecture providing unified data access, SUS organizes your workflow into three intuitive sections:
A. Discover
Your starting point for data exploration:
- Data Catalog (powered by AWS Glue Data Catalog) – Discover all available data across lakes, warehouses, and federated sources
- Business Catalog (powered by DataZone) – Find published data products and datasets shared across domains
- Bedrock Playground – Experiment with Generative AI models and prompts
This is where you explore what data is available across all your sources—all unified through the single catalog.
B. Build
This is where the action happens. SUS provides access to multiple analytical and ML tools:
-
Query Editors:
- Amazon Athena Query Editor – For serverless SQL queries across S3 and federated sources
- Amazon Redshift Query Editor v2 – For high-performance queries on warehouse data
-
Notebooks and Development:
- JupyterLab Notebooks – For data science, ML development, and programmatic data access
- SageMaker Training – For building and training ML models
- SageMaker Inference – For deploying models
-
Data Processing:
- AWS Glue Visual ETL – For no-code/low-code data transformations
- Amazon EMR – For big data processing with Apache Spark
-
Orchestration:
- Amazon MWAA (Airflow) – For workflow orchestration and scheduling
-
AI Assistance:
- Amazon Q Developer – For AI-assisted SQL and Python code generation
All these tools access your unified data seamlessly through different query engines, without requiring data movement.
C. Govern
Making your curated and valuable data consumable for other consumers:
- Publish data products to the business catalog
- Share datasets across projects and domains
- Enforce permissions consistently through Lake Formation
- Track data lineage and usage
- Manage data quality and compliance
Lake Formation ensures consistent permissions across all access patterns and query engines, while DataZone manages the business metadata and sharing workflows.
Understanding the Architecture
The different green boxes represent the concepts of SageMaker Unified Studio and how they interconnect:
| From | To | Relationship | Cardinality | Meaning |
|---|---|---|---|---|
| Domain | Domain units | contains | 1:M | Organizational structure |
| Domain | Projects | consist of | 1:M | Projects belong to domain |
| Projects | Users | include members | M:M | Users work in projects |
| Projects | Data | encapsulate | 1:M | Projects access data sources |
| Users | Assets | govern data via | M:M | Users create/manage assets |
| Data | Assets | is published into | M:M | Data curated into assets |
The green boxes together represent the DataZone-powered organizational framework that provides:
- ✅ Structure (Domain, Domain units)
- ✅ Collaboration (Projects, Users)
- ✅ Data management (Data, Assets)
- ✅ Governance (permissions flow through all six)
This is the foundation that enables lakehouse architecture implementation in SageMaker Unified Studio!
Notice how SUS provides the platform layer that implements lakehouse architecture, with the unified catalog at the center and multiple query engines accessing data from its original storage location.
🚀 Try It Yourself
Want hands-on experience? AWS has a practical workshop covering everything in this post:
👉 SageMaker Unified Studio Workshop
This workshop will simulate a real-world scenario through the lens of different data professionals addressing actual business challenges. You’ll experience the end-to-end process, from initial data analysis to deploying a GenAI-powered tailored student engagement
Read the Workshop and flow the screenshots to understand SUS and the workshop implemnetation.
💬 Feedback Welcome
What did you think?
- ✅ Helpful sections?
- 🤔 Confusing parts?
- 💡 Topics for next post?
Have you tried SageMaker Unified Studio? Share your experience in the comments!


