🦄 Making great presentations more accessible.
This project aims to enhances multilingual accessibility and discoverability while maintaining the integrity of original content. Detailed transcriptions and keyframes preserve the nuances and technical insights that make each session compelling.
Overview
📖 AWS re:Invent 2025 – Customize models for agentic AI at scale with SageMaker AI and Bedrock (AIM381)
In this video, Amit Modi and Shelbee demonstrate AWS SageMaker’s new capabilities for building agentic AI applications at scale. They introduce serverless model customization with broad foundation model choices and fine-tuning techniques including reinforcement learning, serverless MLflow for unified observability across models and agents, and serverless model evaluation with industry benchmarks and AI-as-a-judge metrics. The demo showcases an end-to-end workflow: customizing Qwen 2.5 for a medical triage agent using supervised fine-tuning, tracking experiments and datasets as versioned assets, evaluating against MMLU clinical knowledge benchmarks, deploying to SageMaker endpoints, and integrating with AgentCore runtime using the Strands SDK. Key features include automatic lineage tracking, SageMaker Pipelines integration with new deployment steps for Bedrock, multi-model endpoints with adapter-based inference for 50% cost savings, and speculative decoding for 2.5x latency reduction. The session addresses four critical production challenges: lack of standardized customization tools, fragmented observability, evolving ML asset tracking needs, and complex inference optimization.
; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.
Main Part
The Growing Opportunity and Key Challenges in Deploying Agentic AI Applications
Welcome everyone. This session is for data scientists and AI developers who want to customize models and deploy at scale to build high-quality and cost-effective agentic applications. In today’s session, we’ll also cover some of the new launches that were just announced in the AI keynote. So let’s dive in.
We are seeing two trends emerging in the market. First, there’s rapid adoption of agentic AI in enterprise software apps, and this adoption is expected to go from 1% in 2024 to 33% in 2028. That’s a 33x increase in just four years. Also, organizations are expecting 15% of these decisions will be made autonomously by the agents by 2028, which would require a lot of compute and models that you can use to build out high-quality, cost-effective, and fast inference.
Customers are increasingly relying on open source models to build out these applications. However, despite the large opportunity and a clear line of sight on how to build these applications, we see the majority of these applications never make it to production. Let’s take a look at four key challenges that are blocking these applications from getting deployed to production.
First, customers lack standardized tools to customize models, so they end up spending time building out these workflows. When the time comes to take these workflows to production, because they’ve been put together with glue code, they have to rewrite these workflows with productionized tools to build out these pipelines that can help you build repeatable and scalable workloads. This often delays these projects and leads to a lot of manual effort.
Second, customers lack tools that provide a unified view of model and agent observability. With fragmented tools, it becomes much harder to debug any root cause issues of failure or when the behavior of the agent or the model deviates. Third, with model customization, the need for tracking ML assets has evolved. Earlier, customers could monitor, track, and version the models and make sure they were cataloged in the right place and they could meet the governance and compliance requirements. Now they need to also think about reward functions that are used in reinforcement learning, prompts, and so on, which often further leads to building out additional tools or integrations and delays the timelines.
Lastly, customers need cost-effective, high-quality inference. Building out an inference stack can be very complex. You have to find the right instance and benchmark against different instance types to make sure you have the right cost-to-performance ratio. Then you have to do the same thing with the container or the different frameworks that you’re using, which often leads to a lot of manual work and delay in getting these applications to production. Sometimes the ROI doesn’t seem right because the inference is too expensive.
Introducing Serverless Model Customization: A Managed Experience for Foundation Models
Let’s take a look at some of the key capabilities that SageMaker offers to address these challenges. I’m Amit Modi, Senior Manager for Model Operations and Inference, and with me I have Shelbee, who is the Worldwide Specialist Senior Manager for Gen AI. Today we are going to cover the key SageMaker capabilities that will help you address some of these challenges, and then Shelbee will demo all these capabilities and bring them to life.
Today we are announcing the launch of serverless model customization. Serverless model customization offers you the broadest choice of foundation models that you can use to customize based on your domain-specific or proprietary data of your organization. Along with this broad choice of models, you also have access to a broad choice of different fine-tuning techniques that you can use to customize these models, which includes reinforcement learning. This experience is completely serverless, so you don’t need to worry about reserving capacity or finding out where the GPUs are. You just kick off the job and we take care of all the infrastructure for you.
You can now navigate to SageMaker Studio where you will find models. Under models you will see the list of all the public foundation models and you will see three different experiences. You can customize these models through UI, SDK, or through the agent experience. In today’s talk we’ll primarily focus on the UI, but we’ll see a little bit more in the demos. Once you click on the UI, you will be navigated to a managed experience where you can select the base model as well as the fine-tuning technique that you want to use for customizing the model.
Once you select that, you can upload a dataset. You can upload a dataset here or if you already have a dataset that is tracked with SageMaker, you can simply select that dataset. Then you can choose the reward function that you want to customize.
You can choose from a different variety of reward functions, and it also allows you to bring in code. You can manually type in the code or you can bring in a lambda that will do the reward function. Then you can simply select one of the lambdas that you have already registered and get started with the fine-tuning experience. Behind the scenes, SageMaker takes care of checkpointing your jobs regularly. This ensures that in case there’s a node failure in the cluster where we run these jobs, we will replace that node with a healthy node and resume your job from the last checkpoint, so it never overuses compute—just the right amount. In case the job fails for any reason, you will always get a checkpoint at the end that you can use to resume your training jobs.
Unified Observability and Evaluation with Serverless MLflow and Model Evaluation
Once you have built out these workflows, you can use SageMaker Pipelines to continue customization and deploy these applications into production. Today we’re also announcing new pipeline steps that are purpose-built for model customization and deployment—deployment not only to SageMaker endpoints but also to Bedrock if you’re using Bedrock for your inference techniques. With these new pipeline steps, you can accelerate your development by simply using the existing purpose-built pipeline steps so that you don’t have to write any glue code to leverage integrations with Serverless EMR for data processing or Bedrock for deployment or for training jobs within SageMaker. If you already have existing code that you wrote for your experiments in a notebook, you can simply annotate that code with the @step decorator or use our UI and upload this code to convert it into a fully functional pipeline. Pipelines are serverless, so you don’t need to worry about managing any infrastructure. We log all the metrics to CloudWatch where you can go and debug any issues.
Next, we also announced yesterday the launch of Serverless MLflow. Serverless MLflow solves the problem of fragmented observability that we just talked about. When you are customizing your models, you can log experiments, and you can also log evaluation metrics as well as the agent traces once they’re deployed. Serverless MLflow is fully managed. You don’t need to spin up any servers. We take care of spinning up the compute for you as the traffic on the inference or training grows. We scale up Serverless MLflow and scale down when there’s no need for the infrastructure. There’s also no additional charge for MLflow. You can simply use SageMaker and log all the metrics in Serverless MLflow without worrying about any pricing implications.
Serverless MLflow is also deeply integrated into your experience. When we looked at the fine-tuning experience where we kicked off a job, once the job is complete or started execution, you will see on the model details that you will start to see the performance metrics, and MLflow is deeply embedded in here so you can navigate from here into MLflow and see much richer metrics if you want to take a deeper look. It’s easily accessible under Applications in case you want to leverage it for other workflows as well. You can now see under a particular run all the experiments that you ran on different fine-tuning jobs. You can compare and contrast those metrics and have a deeper look, then choose the right fine-tuning model for you. After you’ve chosen the fine-tuning model here, you can start to evaluate that model.
Today, we are also announcing the launch of Serverless Model Evaluation inside Studio. As part of this experience, we offer popular benchmarks in the industry that you can use to evaluate your models. This experience is also fully serverless, so you don’t need to worry about managing any instance types. You simply kick off an evaluation job and the execution is done for you on your behalf. Let’s take a look at how this experience works. On the evaluation experience, you can simply choose one of the techniques. In this case, we’ll choose AI as a judge. It allows you to also define which metrics you want to use, so you can choose the right quality metrics. These are the most commonly used industry benchmarks.
Because responsible AI has now become such a critical aspect of shipping any models, we also offer some of these metrics out of the box that allow you to measure and put guardrails on your content. When you spin up an evaluation metric, you can provide a prompt template as well so that the model that you chose knows exactly how to evaluate these requests. Once you kick off these metrics, you will see the results for not only the fine-tuned models but also for the base model, both for the quality metrics as well as for the responsible AI content. This helps you make a decision on whether your model is actually performing better on both dimensions much more easily.
Agent Observability Through Managed MLflow and Partner AI Applications
Once you’ve deployed these models into production and if you are also building out an agentic application on top, if you’re just building an agentic application, agent observability is already integrated into CloudWatch so you can have all the dashboards to monitor the traces there. But if you’re also leveraging model customization, you can use AgentCore observability to emit metrics in OpenTelemetry format into managed MLflow or partner AI apps.
We’ll take a quick look at how you can use those metrics in managed MLflow and partner AI apps. So here’s a screenshot for what the experience would look like with managed MLflow. On the left side you will see a complete trace tree starting from invoking the agent at the top and drilling down through the workflow build process capturing each LangChain operation and showing tool calls and the calls to multiple assistant interactions, assistant 1, assistant 2. This hierarchical view gives you complete visibility into each step of the agent.
We also offer partner applications as managed capabilities on SageMaker so you can simply launch them and get started. Some customers that choose to use Comet ML for tracking their experiments and evaluating the models can also log these metrics into Comet ML and then use that information for both agent as well as ML to debug the root cause. Comet also offers certain key capabilities like the ability to optimize the prompts by giving the goal and also optimizing the agents. It takes the end goal and continues to run the evaluations for you and optimize the prompts until you have the best prompt.
We also enable Deepchecks that allows you to test, evaluate, and monitor LLM maps as well as agents. It provides a much more comprehensive view. You can also manually and automatically annotate all the interactions with the LLM and get a much more comprehensive report which allows you to much more easily root cause and debug issues. And lastly, we also have Fiddler that allows you to do agentic observability and then root cause it back down to every single model customization experience.
Cost-Effective Inference: Multi-Model Endpoints and Speculative Decoding
SageMaker also offers capabilities to deploy your models and today we’re announcing easy integration with Bedrock so you can simply from your SageMaker studio kick off a job to deploy the models on Bedrock. Along with that you can continue to leverage SageMaker endpoints for deploying multiple adapters for your model onto the same endpoint. We’ll take a deeper dive into this capability in a moment that can help you save on your costs, and you can use different techniques to further optimize the performance of this model. There are various different techniques offered and we’ll do a deeper dive in a bit on speculative decoding, which is one of the key techniques to minimize any latency for your applications.
So let’s take a look at the first key benefit for SageMaker endpoint. You can deploy multiple foundation models on the same endpoint. This ensures you have one endpoint and as your use cases continue to grow, you just continue to use one endpoint. This can help significantly reduce your cost. And as the traffic grows, SageMaker offers capability to emit metrics for each foundation model, so you can build out autoscaling policies for every single model, and then as the traffic grows, that particular foundation model scales to other instances, making sure your costs are minimized. This can help you save on the cost up to 50%. And it has recently launched certain capabilities that allow us to cache the model weights, which ensures that the autoscaling is much faster.
Next, let’s take a look at one of the key techniques that we recently launched to help you optimize the latency for your applications. So typically customer expectation for inference is that it’s fast, however, foundation models generate one token at a time, which leads to slower inference. The way speculative decoding works, you have a foundation model and a draft model. The draft model is typically a smaller model. Draft model generates some tokens for foundation model to review. Foundation model then reviews those tokens and assigns probability and then accepts certain tokens and rejects them.
So let’s take an example here on how that would work. So let’s put a prompt here. The draft model generates the response for that token. Foundation model evaluates these tokens and accepts some of them. This is a simplistic view. You can use the draft model to actually generate multiple variations of that, and the capability that we launched recently allows you to not only just generate these tokens, but also to fine tune the draft model on your traffic, which helps you get higher quality responses from the draft model.
So let’s take a look at how that capability works. This leads to latency reduction of up to 2.5x throughput without losing any accuracy because you’re able to fine tune the draft model on your traffic. So let’s take a look at how it works. You can bring your own dataset or you can use one of the SageMaker curated datasets. You can kick off the fine tuning job on the draft model. It’s an async job that runs in the background. Once the job is complete, it publishes the evaluation metrics. You can review the evaluation metrics and choose to deploy the draft model on the same endpoint, so there’s no impact on your cost.
The draft model gets deployed on the same endpoint. The draft model generates these tokens that the foundation model can evaluate and accept, so you don’t lose any accuracy and continue to see this latency improvement.
Demo Part 1: End-to-End Model Customization for a Medical Triage Agent
Lastly, we also launched new capabilities that allow you to not only track models but also generative AI assets like datasets and reward functions. Similar to models, you can now track all the datasets as well as validators that Shelby will walk through in her demo. You can not only track these assets but you can also track lineage, so you can track each version and the lineage for all of these assets.
With that, I’ll hand it over to Shelby to walk us through all these capabilities. All right, so the thing is, thank you all for joining. I think we’re probably standing in between you and lunch or maybe you’ve already had lunch, but thank you for joining anyway.
For this demo, we’re going to go through a couple of things. It’s going to be an end-to-end demo where we look at model customization and then integrate that model into an agent. We’re going to use a really simplistic use case of a medical triage agent. We’re going to take an open source model, customize it or adapt it to our specific task. In this case, it’s going to be an agent that is able to triage medical symptoms prior to allowing a patient to book an appointment or paging an on-call physician.
The demo is going to be in two parts. We’re going to start first with model customization, focusing on the newer model customization experience, and then in the second part, we’ll look at how to integrate that with agents, specifically AgentCore. In this specific demo, we’re going to focus on the end-to-end capabilities. Keep in mind that over the next few days there’s going to be a lot more sessions that dive deeper into different specifics of the new model customization experience, including the different techniques as well as the different evaluation techniques. So let’s get started.
Assets is new in being able to track and manage different assets that are part of your experiments. Datasets are a key asset. If we click on datasets, it’ll show all the datasets that you’ve uploaded that are available for use for different fine-tuning workloads. You can also see inside there that you’re maintaining multiple versions across different experiments. You may have multiple different versions of the processed data that you’ll utilize as input into your fine-tuning jobs.
To upload a dataset, you just click upload dataset, enter a descriptive name, and then you can either upload directly from your local computer or from an S3 location. If you already have your data in S3, you can upload it from S3. In this case, we’ll just do a local upload from my computer. Then you just hit save, and what that does is the dataset is now uploaded. You can upload new versions as you iterate through your experiments, and that will be used as our input into the model customization job that we will kick off.
The other thing I want to point out is the new serverless MLflow. You can see if you’re used to the interface, you’ll now see a new apps servers tab inside MLflow, and this is where all of your serverless MLflow app servers reside. You can see here I have a default one, but I also have a custom one that I made for this demo, so we’ll specifically use that one. What will end up happening, and you’ll see through the model customization demo, is that all of the model performance metrics are going to automatically log into MLflow as well as all the evaluation metrics that you run against your fine-tuned models. This makes it really easy for comparison and visualization across experiments to understand which model you eventually want to deploy and test out with your agent integrations or deploy into production environments.
So let’s go ahead into the new customization experience. You’ll see inside here there’s a range of models that you can customize, including the previous existing models.
The same applies to all machine learning and deep learning models—they’re all in one spot. All the models that you have customized will end up over here in My Models. You can see all the models that you’ve done customization with in My Models, get details about them, and we’ll go into this later. Let’s go ahead and kick off a model customization job.
As you can imagine, these jobs take a little bit of time to run, so I do have some pre-baked models and pre-baked versions that we’ll use in cooking show fashion. To start with, you basically just hit customize. You can customize through the UI, which is what we’re going to do in the demo, but you can also customize through AI Agent, which was announced this morning. It is in preview, but it provides the ability to use natural language to develop a guided workflow for fine-tuning customization. You can also customize through code as well. All of this has an SDK available for those that prefer a programmatic experience. That said, let’s go to customize through UI.
Basically, you’ll just enter a descriptive name for your model. I just call it medical triage. I have a couple in here. Here is where you can see the different customization techniques. As I mentioned, through the different sessions over today and tomorrow, they’re going to go into more detail on many of these different customization techniques. You can see the ones available out of the box today are Supervised Fine-Tuning, DPO, Reinforcement Learning with Verifiable Rewards, and Reinforcement Learning with AI Feedback. In this case, we’re just going to do Supervised Fine-Tuning for today. You click the customization technique you want to use.
You can upload your dataset here, or in this case, we’re just going to point to the dataset that we just uploaded before. Then you select the version that you want to use. We only have one version in this case. Here is where you can modify some of the hyperparameters and configurations across your different training experiments. In this case, I’m going to just go with the default parameters out of the box, but you can modify the different hyperparameters here.
Then we’ll also point to the MLflow App that you want to use. You can see here the default one that’s created, and then here is the one that I created specific for this session today. You can of course adjust the experiment name to have more meaningful titles inside there so it’s easier to find in MLflow. Here is also where you can adjust the security settings. Although it’s serverless training under the covers, you can also specify to run inside your VPC and specify the type of encryption that you want to use on the volumes. Then just submit.
What that does basically is kick off a serverless training job. You can see one in progress here. As it kicks off and starts going, you have the logs here. They’re not available yet, but the logs are there to watch, and if you want to do things like early stopping and that sort of thing, you can. That said, let’s look at some pre-baked versions. I’m going to go over here and look at this one. This one was trained on a full dataset. In this case, we’re just using an open dataset, which is trained on medical symptoms and then medical diagnosis.
Let’s view the latest version. Here is where you can see all your different versions. You can move between versions and different tabs. There’s performance, and you can see in this case it went off the rails a little bit, probably overfitting a smidge. If I’d been watching it, I could have done early stopping, but I can also tweak the hyperparameters a bit to avoid some of that. There’s also evaluations. You can see here this is where the evaluation jobs are, and there are a lot of different evaluations available. In this case, we ran some out of the box benchmarks as well as another full evaluation.
If you want to run an evaluation, basically you just click evaluate, set up a descriptive name again, and there are the three evaluation types. There’s LLM as a Judge, and within that you get to specify which model you want to have as your judge model. With that you also specify the metrics that you want to evaluate against, as well as the ability to bring your own custom metrics. The other kind is Custom Score. This is where you can bring your own custom scoring code. You can also use a couple of the built-in metrics specifically. They have some around code execution and math answers. In this case we’re going to just use some out of the box benchmarks to demo. Specifically, we’ll use Multitask Language Understanding and we’ll narrow that down more specific to the task that we’re trying to fine-tune for, which is inside the medical domain.
The other thing I’m going to click on is compare against the base model. This is important when you want to evaluate against the base model to see if it makes sense to do fine-tuning. You want to compare it against that base model. In this case, we’re using Qwen 2.5, so we want to compare all our evaluation metrics against that base model to make sure that we are actually making progress on having a more fine-tuned model. In this case, it tests across 10 subjects. It’s not super useful in our use case, so what we’ll do is narrow it down to clinical knowledge. You can also specify advanced configuration parameters within your evaluation, such as top P, top K, as well as security settings. The ability to run these evaluation jobs means they run in the background, but you still have the ability to run those inside your VPC as well as specify the type of encryption. So I’m just going to hit submit.
Once you submit, it automatically creates an evaluation pipeline for you behind the scenes. You don’t have to deal with the pipeline that gets created, but it takes care of the steps involved with passing data in, passing it out, and publishing those metrics into MLflow. We’ve done our training, and let’s assume we’ve done some evaluation. Let’s look at some of the metrics that flow over into MLflow automatically.
First, we’ll look at the model performance metrics. These are the pre-baked versions that I have from the actual training performance themselves. Let’s compare against the four versions that we have. You’ll see a table format of the different versions and some of the metrics that are captured during your training cycles, such as how many epochs you ran with, the loss, validation loss, test loss, and all of those different metrics. You can also go into those same models and do more visual comparisons, which are helpful in evaluating against each other.
Inside here is the visualization where you can visualize across the different versions to compare the loss that you’re seeing across all the different iterations of fine-tuning. This is the model performance itself, and then there’s the evaluation metrics that we just saw where we use the MMLU clinical knowledge benchmark to benchmark against. You can also compare those metrics inside here as well. I’m going to hide the model performance metrics to highlight the evaluation results.
Here are the model performance metrics, and you’ll notice I’m also highlighting the base model because we want to understand if this model is actually performing better than the base model, the base Qwen model, for the particular task that we’re trying to solve. In this case, we’re using clinical custom. You’ll see inside here there’s LLM as a judge, there’s MMLU college knowledge as well. In reality, you’re going to look at this model across a bunch of different performance metrics. This is just one example where we’re basically doing it across one of these evaluation metrics. In this particular case, this first model here, the first version is performing best against that particular benchmark, the clinical knowledge benchmark. That way you can do that across different metrics and different evaluations to ultimately decide which model you want to test out a little bit more within your agentic workflows or move into production.
Demo Part 2: Deploying and Integrating the Fine-Tuned Model with AgentCore
Now that we have done evaluation, let’s go back to our model and move into deployment. Assuming you want to deploy this and integrate it with agents, you would go here into deployment and click deploy. Here is where you can deploy to either SageMaker or Bedrock, which is a very nice feature. Let’s assume in this case that Bedrock would be through custom model import. Through SageMaker, let’s assume we’re going to create a new endpoint. Here you just enter a name, you can choose the instance type, or just accept the default recommendation.
The advanced options are available to you in terms of the max instance count, the security settings, and all of those different items. Then you just hit deploy. What that will do is deploy a SageMaker endpoint behind the scenes that is now available and ready for use and integration into any agent workflows or direct application integration.
The other thing to point out is the lineage tab, which Emmett talked about. The nice part about all of these different tasks and steps that lead up to it is that because there are a lot of moving pieces, as you can see from the training to all the evaluations that have to happen, you are able to track and maintain complete lineage. For example, if we click on the first one, model artifacts, in this case it is the base Qwen model. You can see exactly what version of the Qwen model was used for this fine-tuning. As you progress along through, you can see the training job that was used, and all the metadata is stored and collected that contains the complete lineage.
You will see it tracks all the way back through pipeline execution and deployment. In this case, the deployment is not actually finished yet, but it does track all the way through to deployment, which is really nice for traceability and maintaining that end-to-end lineage. So that being said, let us assume we have it deployed out and now we are going to integrate it with our agent workflow. These are the steps that we just took, doing all the experimentation and ultimately deploying to SageMaker AI. Again, you could deploy to Bedrock. It depends on what you are looking for. Some of the adapter-based inferences are really nice cost-compelling features, especially for fine-tuning, but you can also import into Bedrock too.
In this case, what we are going to do is take that SageMaker endpoint that is now hosting our fine-tuned model that is specifically adapted for the task inside the medical domain that we are looking for. We are going to assume this is experimentation at this point and use the Strands SDK. Keep in mind when we are going through this, you could just as easily use any of the other frameworks that are supported by AgentCore, whether it is LangGraph or Crew AI. I am just showcasing it with Strands. Here what we are going to do is create an agent and create a couple of dummy tools on the back end and see if the agent is able to make smarter decisions about whether those symptoms are urgent and potentially need to page the on-call or go to urgent care, or if they are going to let them book an appointment that is a less urgent appointment. We are also going to look at MLflow on the back end for agent traces.
One thing to keep in mind is that MLflow does do agent traces and also agent evaluations. It kind of depends because Bedrock AgentCore also has great built-in observability features as well as evaluations that were announced yesterday. So it kind of depends. MLflow may be a good option if you are dealing with models and agents, whereas if you are only dealing with agents, using AgentCore with those native built-in capabilities may make more sense for your use case. That being said, there are a bunch of dependencies and installations, but one thing I wanted to show is this is the model configuration with Strands SDK. Just like a Bedrock API endpoint, you can create a model that is based on a SageMaker endpoint as well. Here you will see, let me blow this up a little bit.
You will see there is a specific integration with Strands called SageMakerAIModel where you specify the endpoint name as well as the inference component name in the case of where you are taking advantage of that adapter-based inference. Then we are just going to create a super simple agent with no tools behind the scenes, just to see how our model performs against the task. This is the fine-tuned model without tools. Here you will just say I need to book an appointment. It does a reasonable job in responding, but it does not necessarily have the smarts to actually go and do anything about it, to actually look for availability or page an on-call doctor, that sort of thing.
So what we will do is implement some conversation management. This is just part of the Strands SDK. We will have some conversation management and async operators for streaming, so maintaining some of those conversations. Then we will try to hit our endpoint on the back end and you will see you get a response saying go to the nearest emergency room, that sort of thing. But once again, there is no actual tools to be able to take any action on anything.
So what we will do to fix that is add some tools. You will see these are some dummy implementations of tools. We have a booking availability tool, a booking tool, and then a page on-call tool. But what we really want to see is if our agent is smart enough to select the tool based on the input that is provided. You will see with the addition of tools, there is some additional prompt engineering around that.
We’ll then actually build out the agent. One thing we’re going to do before that is with our more advanced prompt, we’ll register the system prompt. One of the common challenges we hear from customers is the ability to share and manage prompts that are used as part of your applications. MLflow does have prompt management capabilities within it. So once we’ve settled on a reasonable prompt that we’re going to use for some testing cycles, we can then configure that system prompt inside MLflow. This code here is basically just registering that prompt into MLflow. You can see here’s the prompt name, and if we go over and look at it inside MLflow, this one’s the test prompt. You can see there’s only one version, but what it’ll do is maintain multiple versions of that prompt. What’s nice about that is sometimes when we get in a room and we’re building out something, everyone’s just sharing prompts through Slack or whatever the case is. This allows you to actually version it, but also share it more easily across your team.
In the production case, it’s pretty critical for tracing lineage. In this case, we’re just saying it’s a test prompt and we’re doing some versioning. So now that we’ve done that and created our prompt, we’ll now create the agent using our tools. In this case, we’re going to take those dummy tools that we created and make those available to our agent. Our agent is backed by that fine-tuned model on the back end. To do that, we’ll just basically make the tools available here. So this is our booking availability, booking, and our paging our on-call. Then we’ll invoke our agent using tools this time.
In this case, I need to book an appointment, and here’s the agent response. It’s directed right in our prompt. We’ve told it that you need to make sure that you gather the symptoms before actually letting a patient go ahead and book. So now you can see we’re getting good responses. One of the things in MLflow that you can do is also tracing of agents to be able to verify that those tools are being called, which is really important as you’re building agents today. Being able to trace the thought process, when it’s calling a tool, when it’s not, when it’s pretending to, is really important.
So if we look at some of the agent tracing inside MLflow, you can kind of see here it sounds like you’re experiencing it for days. It’s a medical issue. Here you can see where this is just the output to that. You can see in different cases booking availability somewhere in here. Sorry, I’m in the wrong experiment. Is this the one? Yeah, there we go. So here you can see executing on call. So there is a medical emergency, so you can see the tool was actually called. And here you just see where it’s interacting obviously with your model behind the scenes, but then where the tool is actually called on the back end. So the tracing is really very valuable in debugging to know when a tool is being called, when it’s not, and what the actual logic flow of your application is.
And again, as I mentioned, you can use agent core observability and that sort of thing as well now that they’ve kind of released there. This is just nice when you do need to merge together your model experimentation, all the metrics between models and agents. So that being said, we developed it on Strands first, which is pretty common for development, but now we want to deploy it so it’s scalable. So using Agent Core runtime to actually host that agent. In this case, what we’re going to do looks very similar to the last one, but with Agent Core, we’re going to use the runtime agent to host the agent, but also Agent Core identity to manage the user interactions.
To do that, we just write out our agent code, right? So it’s a Python file of our agent code. And again, we’re using the SageMaker endpoint name as you saw before. So we’re actually pointing to a SageMaker model on the back end. We’re making our dummy tools available as well and setting up some Cognito user pools and authentication. Here is where we’ll register the system prompt for production, which is something I definitely recommend for just kind of reliability and knowing what prompts, what prompt versions are attached to any given agent. So we’ll register the prompt again, but in this case we’re going to register it as a production prompt. So you’ll see here again, it’s only one version, but it can maintain multiple versions, right? So you see here, this is the agent prompt that is tied to that particular agent.
Now we’re going to deploy our agent to the Agent Core runtime. In this case, I’m using the helper tool to launch the agent. Inside here, we’re pointing to our entry point in our Python code. We’re also pointing to the agent name, which will be Agent Prod medical triage agent Prod, and then using the authorization config. Here’s where we’re actually launching that.
Once it’s actually launched, we can then test out our endpoint. One of the things I would recommend doing in terms of traceability is that most of the time this is going to be deployed on a CD pipeline in the backend in reality, versus deploying this out of a notebook when you’re ready to go to production. I would recommend tagging that SageMaker endpoint with the agent ARN and the agent ID that comes back so that way you know for any given endpoint out there, here are the agents that are using or depending on that endpoint because those dependencies are tough to maintain if you don’t have that level of tracking. So here we’re just tagging the endpoints with the agent ARN and the agent ID.
You can see the list of tags showing where we’ve tagged it. Then we’re just going to use the agent. We’ll do the same thing, invoking the agent with some invoke agent code and invoke endpoint code. We’re just going to send prompts in like we did before, but this time it’s running on scalable Agent Core runtime compute. Again, because we set it up for MLflow for agent tracing, you can go back into MLflow and see the production level traces as well.
You can see when it’s calling different tools and when it’s not, and all of that is collected automatically. This is just going through some iterations, making sure that we call it when we should and not when we shouldn’t. For example, a stuffy nose really probably shouldn’t be an urgent call. In this case, they’re basically just saying it’s not available for today, but we can book it for another time because it’s not a medical emergency.
Maintaining Lineage in Production and Session Recap
That’s basically showing an end-to-end example of using a fine-tuned model that’s adapted to a specific task within your domain to get better performance as well as integrating it with an agent workflow. One of the things to keep in mind is we talked about lineage and traceability. What you saw on model customization has the nice lineage that’s automatically captured. Some additional things to keep in mind though with your agent workflows is that I showed it in a notebook today, which is great for experimentation. In reality, like I said, you’re probably going to include that in a continuous delivery, continuous deployment pipeline when you go to production deployment for agents.
There are a lot of different versions to consider. The nice part is that the left side of the picture is all taken care of with model customization now with lineage and that full end-to-end lineage. On the right-hand side, just something to keep in mind with your CD deployment pipelines is capturing all those different versions across agents too. We did a very simple agent, let alone multi-agent orchestration workflows. Capturing all those versions and the dependencies so that you know, like we did tag the endpoint as one example, that with that endpoint we have a number of agents dependent on that endpoint.
We went through a lot in a short amount of time, but thank you all so much. I’m going to turn it back over to Amit. Actually, one more thing: if you want the code examples, I could not push those before this meeting because I had another meeting. I will push them today because there are new launch things in it that I couldn’t push. So my apologies. If you just get a screenshot of that, I’ll post the code by the end of today. All right, Amit.
Thank you, Chaubli. We’ll do a quick recap and then we’ll open the floor for any questions that you may have. SageMaker now offers a fully managed and serverless experience for customizing your models from training to evaluation and then deploying the models. With SageMaker AI inference capabilities, you can host multiple models on the same endpoint, auto-scale easily, and use optimization techniques to make it cost effective for you. You can get end-to-end observability for your model customization as well as agents with managed MLflow, serverless MLflow, or partner AI apps.
You can build scalable and repeatable workflows with SageMaker AI Pipelines using the new steps that we launched today. Lastly, you can now not only track and audit the versions of the models, but also of the AI assets that are used in the model customization. Here’s the barcode for the same storyboard. It’s not the exact demo, but it’s the storyboard that walks you through the example step by step. Feel free to check it out. If you have any questions, there are two mics over there in the center of the room. Feel free to bring up your questions.
; This article is entirely auto-generated using Amazon Bedrock.






















































































































