AWS re:Invent 2025 – AWS Trn3 UltraServers: Power next-generation enterprise AI performance(AIM3335)

🦄 Making great presentations more accessible.
This project aims to enhances multilingual accessibility and discoverability while maintaining the integrity of original content. Detailed transcriptions and keyframes preserve the nuances and technical insights that make each session compelling.

Overview

📖 AWS re:Invent 2025 – AWS Trn3 UltraServers: Power next-generation enterprise AI performance(AIM3335)

In this video, AWS introduces Trainium3, their next-generation AI chip designed for agentic workloads and reasoning models. Joe Senerchia, Ron Diamant, and Jonathan Gray from Anthropic detail how Trainium3 delivers 360 petaflops of microscaled FP8 compute with 144-chip UltraServers connected via NeuronSwitch topology. Ron explains architectural innovations like hardware-accelerated microscaling quantization and optimized softmax instructions that maximize sustained performance. The team demonstrates 5x better tokens-per-megawatt efficiency versus Trainium2 and highlights Project Rainier’s deployment of 1 million Trainium2 chips serving Claude models in production. Jonathan Gray showcases real kernel optimizations achieving 90% tensor engine utilization on Trainium3, including FP8 matrix multiplications, attention operation tuning, and SRAM-to-SRAM collectives. The presentation emphasizes PyTorch native integration, open-sourced NKI compiler, and Neuron Explorer profiling tools providing nanosecond-level observability for performance engineering.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Thumbnail 0

Introduction: AWS AI Infrastructure and the Transformative Power of AI

Welcome everyone. My name is Joe Senerchia. I’m the EC2 product manager for our Inferentia and Trainium chips, and I’m super excited to have everyone here. Just a quick show of hands, how many are familiar with Inferentia and Trainium? Okay, what about Anthropic Claude models? Okay, a few more. Well, today I’m super excited because we have two experts on both of those things. We have the chief architect of Trainium, Ron Diamant, and we have Jonathan Gray, who’s the Trainium inference lead for Anthropic, thinking about optimizing Claude models on Trainium.

Thumbnail 40

So quickly, what we have in store today. I’ll first walk through how AWS builds and thinks about building AWS AI infrastructure. Then I’ll have Ron walk through Trainium and how he built it for performance, scale, and ease of use. And then Jonathan Gray will come up and look at how he actually optimizes different kernels to run on Trainium effectively. Okay, great, so let’s get started.

Thumbnail 70

So first, why is there so much news around AI? Why is there so much excitement? And I think it comes down to it is really a tectonic shift for how we build, deploy, and interact with the world. So this isn’t just incremental change. We’re seeing new capabilities pop up because AI has enabled them. And Andy Jassy, he said this most recently in a quote, we are at the beginning of the biggest technological transformation of our lifetime.

Thumbnail 100

Thumbnail 110

Thumbnail 120

AI’s Impact on Scientific Domains and the Rise of Autonomous Software Development

And so I think one of the areas that I want to step back and take a look at, you know, where is AI really reshaping, is scientific domains. And we look at this in things like protein biology, where models can now predict and design new proteins in minutes, which traditionally took hours to do, or even in mathematics where models like AlphaGeometry are competing at an Olympiad level and also solving formal proofs. Another area that’s become its own scientific domain in and of itself is software engineering, where AI is now a breakthrough force of its own, and it can resolve, develop, and deploy its own code, solve bugs within the code, and even reason across large code bases.

And this is just the beginning. But together, these innovations are no longer just supporting scientific discovery. They’re actually becoming the engine driving that scientific discovery. And so let’s take a look at how this is happening in practice with, as I mentioned, software engineering.

Thumbnail 150

Thumbnail 160

So over time we saw traditional programming, and over the past few years we’ve seen things like code completions, chat-based programming, or even collaboration with vibe coding start to take off. But as we’ve seen these start to take off in the same time frame, we’ve also seen benchmarks exceeded on things like SWE-Bench Lite, where models are completing up to 80% of real GitHub issues, or even on harder benchmarks like SWE-Bench Verified, where some models can complete up to 50% with full correctness.

Thumbnail 180

Thumbnail 200

And so what that enables then is the next phase of AI in software engineering, which is deploying agents or agent fleets or agent clusters so that they can autonomously operate and solve software problems. And we don’t know the exact shape this future will take, but we can envision a world where we have software developers collaborating closely with an entire fleet of agents. And what that does is it opens up a new set of speed and scale for delivering software features and functionality.

Thumbnail 220

AWS’s Comprehensive AI Stack: A Decade of Silicon Innovation

So take a step back, why does this all matter? And the important part is really that none of this happens without the infrastructure underneath that’s powering all of this AI. And at AWS, we’ve spent more than a decade building the most comprehensive, deeply integrated AI stack, starting at the top with compute, where we offer a broad portfolio of accelerated GPU instances or our latest Inferentia and Trainium, which offer cost efficiency for AI workloads.

Thumbnail 240

Thumbnail 250

Thumbnail 260

At the network layer, we’re deploying UltraClusters that are capable of scaling up to tens or even hundreds of thousands of chips, all connected with low latency, low jitter Elastic Fabric Adapter. Then you have storage, where we’ve increased high throughput storage options so that you can keep those GPUs fed with FSx for Lustre or even S3 Express One Zone. You now have access to your data at ten times faster speeds than before.

Thumbnail 270

And importantly, security, which is important for all of our infrastructure here at AWS, but also important for AI, where we have the Nitro system that allows isolation of your workloads to protect customer data. And at the very bottom, we offer management services and observability tools like CloudWatch, where you can monitor and watch your nodes to make sure that they’re healthy and operating efficiently.

Thumbnail 290

Thumbnail 300

And all of this comes together as a full stack platform for training and inference frontier models. And the reason that we can do this is behind the scenes, we’ve been developing silicon for over a decade, right? Whether it’s in our Nitro system at the top here, we have over six generations available now for offloading that virtualization

Thumbnail 310

to dedicated hardware for higher performance and more isolation of your workloads, stronger isolation of your workloads. We built Graviton, which is now supporting a multitude of workloads and over tens of thousands of customers. And then we also anticipated the growth of AI, and we started building our Inferentia and Trainium chips early with the release of Inferentia in 2019, and we continue to innovate across this full stack, which is really important to drive the next phase of AI infrastructure.

Thumbnail 350

Thumbnail 360

Thumbnail 380

Trainium2: End-to-End Innovation from Chip to Cluster

Starting with last year when we announced Trainium2, we talked a lot about the chip and the specs on the chip, but we also showed that it wasn’t just about the chip. It’s also about the innovation that we’re bringing at the server level and the network level. So at the chip level, you have innovations that are pushing compute and flops like 1,300 FP8 dense teraflops. Or you have at the server level where we released our first Trainium2 UltraServer capable of scaling up to 64 chips across a NeuronLink, which has one terabyte per second connectivity. Or at the network level where we deployed tens of thousands of these chips all connected with our Elastic Fabric Adapter.

Thumbnail 390

Thumbnail 410

Thumbnail 420

And really, when you look at all that engineering and that end-to-end design, when it comes together, it enables us to do things that we had never done before. And some of that is, like you see here, shrinking the time at which we receive chips from our manufacturer to when we can put them in customers’ hands. And here you can see over the course of Trainium2’s life, we shrunk that by 70%. And the result is that it allows us to ramp Trainium2 four times faster than any other prior AWS AI instance and to a footprint that is 33 times larger in capacity than any other instance, and all of that capacity is fully subscribed.

Thumbnail 440

Emerging Trends and System Requirements for Next-Generation AI Workloads

Really, that is the end-to-end innovation that’s required to build something like Project Rainier, which is the world’s largest publicly announced compute cluster. But as we build more scale, we want to keep our eye on what is scaling next. And so we look at the trends continually with customers, and we look at the industry to see where they’re going. And here we’ll start to look through some of the trends that we saw over 2025, more emphasis on post-training. So you have reinforcement learning becoming more important as customers look to put their models in, or as model developers look to put their models in real environments and get feedback, whether they’re virtually generated or actual real environments like robotics.

Thumbnail 470

Thumbnail 480

Thumbnail 500

Then you have reasoning models where these models are taking a little bit more time. They’re reducing the latency, but they’re reasoning over multiple steps so they can generate a more accurate response to a deep question. And then last, coming back to what we talked about with software, with agentic workloads where we see multiple agents kind of collaborating autonomously, making tool calls to really drive independent solutions for a wide variety of problems. And so digging a bit deeper, what is the impact? How does this impact what we’re building next for AWS AI infrastructure?

And I think it really comes down to a few different new system requirements that we’re looking at. And I say system, not just chip, because this is about the bigger picture here. And these systems now, we see context lengths as we see reasoning models over longer contexts, context lengths reaching over a million tokens that we need to support, the systems that support that capability. We need support for mixture of expert models, which are communication heavy, where you have sparsely activated mixture of expert models communicating across the scale-up domain. You also need support for infrastructure that can be used for pre-training, post-training, and inference so customers can really optimize the compute that they have available to them as they scale each one of these independently. And then the last one is you need support for really high batch size, high throughput systems that can support lots of concurrent agents operating autonomously on their own.

Thumbnail 560

Thumbnail 580

And so the key theme here is that next AI infrastructure isn’t just about compute flops. It’s about more than that. It’s about having balanced compute, which means having more memory, having more memory bandwidth, also having a larger scale-up domain that you can support a wide range of expert parallel designs as those models scale. And that’s really why we’re happy to introduce Trainium3, which is the chip built for these next-gen agentic workloads, reasoning workloads, as well as video generation workloads that are going to drive the demand, the compute demand for these next AI systems.

Thumbnail 600

Introducing Trainium3: Built for Agentic and Reasoning Workloads

And as I mentioned before, it’s not just about the chip, it’s also about the system. And so here, if you caught Matt’s keynote, we recently announced our Trainium3 UltraServers, which scale up to 144 chips.

I won’t walk through all these stats, so I’ll leave that to Ron, who’s our chief architect here, but the key thing to remember is that there is innovation at each one of these that drives the capabilities of our next AI systems. So with that, I’ll pass it off to Ron to walk through Trainium3.

Thumbnail 640

Thumbnail 650

Building Trainium3 for Performance: Microscaling, Accelerated Instructions, and Sustained Efficiency

Alright folks, so Joe, thanks a lot, and folks, thanks for being with us today. For the next part of the talk, I’d like to go a little deeper into how we built Trainium3 and specifically how we built it to be performant, ready for scale, and easy to use. Let’s start with performance. As Joe kind of hinted, performance is actually not a single metric. It’s actually a combination of metrics. Of course there’s compute floating point operations per second, but you also care about memory bandwidth and memory capacity, and the interconnect that connects between these chips, and all of these need to be balanced in order to achieve maximum performance.

Thumbnail 700

We actually touched on that in detail in last year’s talk, and you have a QR link at the top right. By the way, throughout this talk every time there’s an opportunity for offline self-learning, there will be a QR link at the top right. Trainium3 UltraServers made significant leaps across each one of these performance dimensions. We got 360 petaflops of microscaled FP8 compute. I’ll explain exactly what that means in a second. That’s 4.4 times more than what we had with the Trainium2 UltraServers. We have 20 terabytes of HBM capacity, 3.4 times more, and 700 terabytes per second of HBM memory bandwidth, 3.9 times more than the Trainium2 UltraServers.

Thumbnail 730

We also have a 2 times faster interconnect, and I’d like to draw your attention to these switches in the middle of the rack. These are new components we call them NeuronSwitches, and they connect between the Trainium2 compute and Trainium3 compute sleds in a full mesh topology. Each sled is connected to every other sled within a single hop. And these are the sort of system optimizations that don’t come through in the top level specs, but they absolutely impact real life workload performance. That’s because they give us more flexibility to deploy different topologies. They cut down the latency between each pair of Trainium3 devices, and they give us really high performance for all-to-all communications.

Thumbnail 770

The reason we care about performance in all-to-all communications, or at least one of the reasons we care, is what we call mixture of expert models, MOE for short. In such models, MOE models, we tend to place different experts on different chips and then route a token to the relevant expert in real time in order to do the compute just in that specific chip, and that requires blazing fast all-to-all communication, which is exactly what the NeuronSwitches provide.

Thumbnail 800

That brings me to my next point, which is peak performance versus sustained performance. In real life, if you think about what I just quoted to you a slide ago, those were spec performance numbers or peak performance numbers, but in real life that’s where the performance story only begins. It’s not where it ends. And one of the nice analogies for that that I could think of is, who would you bet on winning a marathon race, a sprinter or a marathoner? Obviously a marathoner, right? But if you think about it, the sprinter has a higher spec speed or peak speed. It just can’t sustain it over the entire marathon race. So you can see that there’s at least some situations where we actually care more about sustained performance rather than short peak performance. And in AI chips, it’s actually the same. We care a lot about achievable and sustained performance, more than specifically some spec number.

Thumbnail 860

Thumbnail 870

Thumbnail 880

So when we started developing the Trainium3 chip, our software team posed a challenge to us. What would it take to build a chip where the sustained performance is as close as possible to the peak performance? You get every single floating point operation that you paid for. And that led us to a list of micro-architectural improvements that are aimed to give you every last percentage of performance, and I’d like to walk you through a couple of those just to give you a sense of how this looks like and how we’re really optimizing this workload end to end.

Thumbnail 900

Let’s start with microscaling. The motivation for low precision training and inference is very clear, right? It’s pure physics.

If you use a smaller data type or a lower precision data type, you can run the compute on smaller circuits, and you can move smaller data around the chip, which leads to higher performance and better energy efficiency. But like many good things in life, it comes with a little bit of fine print.

Thumbnail 930

For example, if you just naively cast from a high precision data type, for example, BFloat16, into a lower precision data type, for example FP8, then it turns out that you completely destroy your model. The reason for that is that BFloat16 has a much higher dynamic range, a range of numbers that it can represent compared to FP8, and that means that large numbers overflow to infinity and small numbers tend to be squashed to zero.

Thumbnail 960

Thumbnail 990

We can fix that by a technique called quantization, and here I’m showing you a quantization technique called ABSmax where we calculate the maximum absolute value in a tensor, and then we scale the entire tensor such that it exactly captures the entire dynamic range of FP8. This actually works quite well until we reach an interesting case of distribution outliers in the tensor. Imagine a case where one of the elements in the tensor is 100 times larger than the other values. That’s the green element right there.

Thumbnail 1020

So we would scale the tensor such that the green element will map to the maximum representable value in FP8, but then all other elements will be squashed to zero or near zero, so we completely lost the representation capability after casting or quantizing to FP8. We can solve that as well via a technique called microscaling. With microscaling, we do ABSmax quantization one more time, but this time we do it in small groups of elements.

Here we have one group with the green and yellow elements, and another group with the orange, pink and blue elements. You can see that the green is an outlier. The green element is much larger than any other element. What that causes is that with the first microscaling group, after we quantize, green goes to maximum representable and yellow gets squashed to zero. But in the second microscaling group, we quantize from scratch with a new distribution, and you see that the blue, pink, and orange elements are quantized quite well without any impact from the green element, from the outlier.

Thumbnail 1090

Thumbnail 1100

That’s exactly what microscaling does, and it has been shown that it’s very efficient in preserving model accuracy in low precision training and inference. But microscaling is hard to do, because you need to take a tensor, break it into groups, then compute the scaling in each group, calculate the scale, apply the scale, and then do all of this in reverse order when you dequantize. So what we did in Trainium3 is that we built hardware circuits in order to completely offload microscaling quantization and dequantization.

Thumbnail 1130

You can basically get all the accuracy benefits of microscaled quantization without any overhead on your compute engines, and that drives, even though it doesn’t appear in the peak numbers that I showed you, that absolutely improves your end-to-end workload performance. Let’s do another example. In this case, it’s accelerated softmax instructions. To give you a background, at the core of every modern or most modern AI models, there is an operator called self-attention.

It was one of the breakthroughs in the transformer architecture that makes models like Claude and others work as well as they do today. At the core of the self-attention computation, we multiply two matrices, Q and K here, and then compute softmax on this result, and finally multiply that by another matrix, V. So if I show you a timeline, we do a matrix multiplication followed by a softmax operation, followed by another matrix multiplication.

Thumbnail 1170

Thumbnail 1190

If I pipeline that over multiple tiles of computation, you can see that we can get a very clean pipeline where the tensor engine, the engine that is doing matrix multiplication, the most precious resource in the system, is constantly busy, 100% utilized. We love that. Now let’s apply the previous optimization that I told you about, microscaled FP8. So all the matrix multiplications now run way faster, but the overall self-attention computation didn’t accelerate by nearly as much, and that’s because softmax doesn’t leverage FP8. It actually runs at a higher precision. We need to do that in order to keep the accuracy of the model.

Thumbnail 1220

That’s a well-known secret of the trade in the ML space. So if we weren’t paying attention to that, we could have gotten a couple of problems here, right? First of all, despite the nice optimization that I showed you before, the end-to-end speed up is not as much as we wanted it to be, and the tensor engine, the most precious resource in the system, is now underutilized.

Thumbnail 1260

Luckily, our team saw it a mile away, and as we worked on the micro-scale FP8 optimization, we also introduced another list of optimizations to make sure that we always keep the tensor engine running. In this case, it was an accelerated SoftMax instruction that is able to run SoftMax four times faster at the same precision with zero loss of precision accuracy. So that’s how it looks like with the accelerated SoftMax instructions. Now we get the end-to-end speed up that we wanted, and now the tensor engine is constantly running at 100% utilization again. Achieved performance is as close as possible to peak performance.

Thumbnail 1280

All right, now we have a huge list of these optimizations. We actually document them and you can do self-learning online, again, a link at the top right, and all of these optimizations build on top of one another in order to make sure that you get to use every single floating point operation per second that the Trainium 3 device offers.

Thumbnail 1300

Thumbnail 1310

Let’s put it all together. Here we benchmark a model called GPT-3.5 with 120 billion parameters. This is an open-source, open-weight model by OpenAI. On the x-axis, we measure what we call interactivity. That’s the per-user experience, how quickly we can generate output tokens. And on the y-axis, we measure overall throughput. If the server is serving multiple requesters at a time, what’s the overall number of tokens that it can generate per second?

And to make it really fair, an apples-to-apples comparison, we normalize the y-axis to a megawatt. So now we’re comparing Trainium 2 and Trainium 3 on even ground. Which one is more efficient? And I think the results are beyond impressive. So we can generate five times more tokens per megawatt with Trainium 3 compared to Trainium 2, and at the same time we’re also improving interactivity. We’re really proud of these results. We think it will generate real value to you guys.

Thumbnail 1360

Thumbnail 1370

Thumbnail 1390

Designing Trainium3 for Scale: Modular Architecture and Rapid Deployment

Let’s move to scale. Last year I showed you this graph, and it shows us, it demonstrates that the adoption curves in the ML space are very different than the adoption curves that we’re used to from other technologies. This is a typical adoption curve. We have the early adopter phase, then we start ramping up, and eventually we get to mass volume. And with ML, when we introduce a new technology, just like we’re doing today with Trainium 3, we immediately get customer demand to build the giant clusters with this new technology, with this new generation.

Thumbnail 1400

And this required us to build Trainium 3 to be ready for scale from the very first day. If you think about it, that’s exactly where Annapurna and AWS meet each other and complement each other. We at Annapurna built Trainium 3 that is built for scale from day one. I’ll show you exactly why and how, and then we marry that with AWS that has decades-long expertise in deploying the massive compute clusters faster than anyone in the world, and we build projects like Project Rainier and what we’re going to build with Trainium 3.

So what you see here is a Trainium 3 compute sled. It’s a very modular design, and that’s not just an elegant design choice, that’s important. It means that we can test every component independently and then plug it into the system. And every single component is top-accessible and replaceable, and this is critical because it allows us to automate the production line and make the assembly completely robotic, and that means that we can scale much, much faster versus manual or complicated assembly. It also means that when we need to service these cards in production, we can do it very quickly and efficiently and keep your infrastructure running, which is what we all want to do.

Thumbnail 1480

Thumbnail 1490

Thumbnail 1500

Let’s break this down. At the back of this compute sled, you see four Trainium 3 devices. Then in the front, you can see two Nitro devices for scale-out networking with the EFA. And there in the middle, you see the Graviton CPU that is responsible for input, output, and management as a whole. Now all these chips were built in-house with deep expertise. We know how to optimize them, we know how to debug them, we know how to service them.

And with deep co-optimization between them. That’s critical again in order to give you maximum performance. Achievable performance needs to be as close as possible to peak performance, and we need to optimize across the entire stack to do it.

Thumbnail 1520

Thumbnail 1530

Joe showed you these graphs. With Trainium 2, we deployed 4x faster and 3x larger capacity than any other AI chip in AWS. For example, he mentioned Project Rainier. So let’s talk about Project Rainier. Last year, actually, when we were on this stage, we announced Project Rainier. We said that we were going to build a giant AI cluster for Anthropic, and now we’re 12 months later, and we have 1 million chips running, training and serving state-of-the-art Claude models in production. I’m not talking about some future announcement. This is running today, and this happened in 12 months.

Thumbnail 1570

Thumbnail 1580

Thumbnail 1590

Ease of Use Across Customer Personas: From ML Developers to Performance Engineers

With what I just showed you with Trainium 3, we expect to scale faster than we scaled with Trainium 2, actually much faster and to much larger quantities. Lastly, let’s talk ease of use. We’re building a very sophisticated infrastructure here, and we need to make sure that our customers can easily use it and get the maximum value from it. And we knew that if we want to optimize for ease of use, we needed to deeply know our customers, so we talked to them a lot, and what emerged is that we actually have three customer personas and they have different needs.

At the top here we have the ML developers that are building AI applications based on existing models, and what they value the most is very strong and robust third-party library integrations and ready-to-use pre-optimized models. Then we have researchers, and researchers are inventing new models and new operators. They want to iterate quite quickly, and they care about a robust, frictionless experience much more than they care about performance, actually. They care about developer cycles. The experimentation needs to happen very quickly.

And finally we have our performance engineers. These are folks like Jonathan Gray that breathe and live hardware optimizations. You’ll hear from Jonathan Gray in a second. He’s one of the best in the field, by the way, so I think he’ll explain very nicely how he’s optimizing for Trainium 2 at this point, Trainium 3 coming, but specifically what they value the most is tools that give them full control over the hardware. We’ll talk about that as well.

Thumbnail 1670

So let’s go one by one with ML developers, with deeply integrated Neuron with third-party libraries like PyTorch Lightning, vLLM, and Hugging Face, and you get to just take models from these libraries and seamlessly, frictionlessly run them on Trainium. We’re also engaging the community via university courses and hackathons. You can see one example there when folks take a Hugging Face model, for example, fine-tune it to do a certain task. You can see a QR link there for a hackathon that we have for fine-tuning models to play chess, and eventually we serve the models on Trainium as well, and the feedback that we’re getting so far is overwhelmingly positive.

Thumbnail 1710

Thumbnail 1730

For researchers, we have deep integration with PyTorch and JAX, and what I’m really excited to share is that Trainium is also becoming PyTorch native. So let’s talk about that. With recent advances that the PyTorch team did here where they introduced something called Private Use One that allows you to integrate a custom AI backend into PyTorch, we made Trainium natively supported by PyTorch, and that means that code that you write on PyTorch that can run on a CPU or a GPU can seamlessly run on Trainium, the same exact code, and I’ll show you that in a second.

Thumbnail 1790

That means that you get the eager execution experience that you know and love from PyTorch on the Trainium devices, and it also means that you get the automatic code optimization that PyTorch introduced via torch.compile, also running seamlessly on Trainium. A nice side effect, a positive side effect here is that all the tools and libraries that you know and love that run on top of PyTorch also come along for the ride. So if you’re using FSDP or DTensor for distributing your workload, that will run seamlessly on Trainium as well. And if you’re using libraries like Torch Titan to do large-scale training, that will run seamlessly on Trainium again as well.

Here’s how it looks like in code. On the left we have PyTorch code that runs on GPU, and on the right we have the corresponding PyTorch code that runs on Trainium. It should be hard to spot the differences because there are not many differences. It’s one word literally.

Thumbnail 1820

Instead of saying “to CUDA,” you write “to Neuron,” and we take care of the rest. It just works. Again, we wanted to give a lot of credit to the PyTorch team here. The way that they extended the PyTorch framework allowed us to do what we’re showing you here. We’re already piloting this capability with a select set of customers. We’re getting very good feedback, and we plan to make it generally available in Q1.

Thumbnail 1840

Last but not least, let’s talk about performance engineers. For this category of customers or customer personas, we introduced two new capabilities. The first is Neuron Kernel Interface, we call it NKI for short. I’m used to calling it NKI. That’s a low-level programming interface to directly program the Trainium devices. This existed last year, and we evolved it quite a bit. I’ll tell you more about it in a second. And the second piece of it is the Neuron Explorer. That’s a toolkit for doing performance optimizations on top of the Trainium devices, and that’s built on top of the Neuron Profiler and gives you really deep insight and observability into your workload running on Trainium. And with both of these together, you get full control over optimizing your workload on Trainium.

Thumbnail 1900

Let’s go through them one by one. NKI is a Python embedded DSL, but it has something that is quite unique. It combines two levels of abstraction in a single programming environment. You can implement your code in a tile-based language, just doing computation between submatrices, and this would be very easy to ramp up on, especially if you’re coming from NumPy or Triton. But then if you identify an area where you really want to optimize, you can go all the way to the assembly level with very similar semantics to the tile-based semantics. And that combination allows you the ability to ramp up very quickly and to optimize very deeply.

Thumbnail 1940

This year we’re introducing a couple of new capabilities in NKI, including scheduling and allocation API that allows you to do fine-grained control over the scheduling of different instructions that are running on the machine, as well as where we allocate the different tensors. And that allows you to build the very structured pipeline that I showed you before in the self-attention example. This is actually a feature request from some customers. We listened. This is already available. You can start using it.

In addition, we also introduced a new front end with much improved error messaging that basically allows you to self-serve and iterate much more quickly on the NKI programming experience and improve your time in order to get an optimized kernel on Trainium. And last but not least, I’m actually pretty excited about that one. We decided to open source the NKI compiler. It’s coming in the next couple of months, and the reason we decided to do it is because NKI is all about giving you control and observability. So now we give you full transparency on how your code is compiling to the Trainium hardware, and we also welcome industry and community contribution across the entire Trainium stack.

Thumbnail 2020

Here’s one nice example. So this is a company called Descartes. They have a cool application where they do real-time generative AI video generation. They can ingest the video, edit it, and generate the video back to you. You can see examples here. They decided to build their entire model based on NKI, and they achieved phenomenal utilization numbers, actually beyond what I expected the team could do in three to four months.

Thumbnail 2050

Thumbnail 2060

Next, let’s talk about the Neuron Explorer. If you ever wrote highly optimized code, you know that your best friend is a strong profiler or tracer that tells you what’s running on the hardware and where the bottlenecks are. With Trainium, we have the industry’s leading Neuron Profiler that allows you to get instruction-level trace of what’s running on the hardware with zero performance hit on the actual workload running. This is nanosecond-level observability without slowing down your workload.

So we extended the Neuron Profiler a lot and built a suite of tools that we call the Neuron Explorer on top of it. First of all, it’s four times more interactive, and that means that you can just debug much faster and get a better overall debugging experience.

But on top of that, we made it available via web applications for easy sharing between developers, and we also deeply integrated it with IDEs like VS Code. That’s actually quite important. What you see on the screen here is that I highlighted one of the lines in my Neuron Kernel code, and the Neuron Explorer automatically highlighted the relevant instructions in the profiler. This gives you a much tighter connection between the code that you’re writing and what’s actually running on the hardware, and gives you a sense of what’s worth optimizing.

We’re also introducing system level profiling. This is not ready yet, it will come in a month. We’re introducing system level profiling where we allow you to see a full run on multiple devices, and you can see if they’re tightly synchronized or if there’s one slow machine. It really helps you when you debug highly distributed code like a big training run.

Thumbnail 2160

We did a couple of more things. We introduced hierarchical view, so when the Neuron Explorer is brought up, it shows you framework level operators, stuff like self-attention or fully connected layer, and then you can click and drill down all the way to the instructions. That makes your debug experience much more incremental. You can start at high level and try to understand where the bottlenecks may lie, and then when you really want to zoom into something, you can just drill down through it. It makes the debug experience much nicer in my view.

Thumbnail 2190

We’ll also give you a summary page that shows you how the different engines are utilized. Here you can see the tensor engine at the left. The tensor engine is utilized very well here, 60 or 70% MFU, and the other engines are kind of lightly utilized, so that shows you how the workload is running. At the top right you can see how we’re utilizing our memory bandwidth, what portion is used for reads, what portion is used for writes, and what portion of the time the memory is actually sitting there idle. When you look at this top level view, you can really get a sense of how well your workload is running on the hardware.

Thumbnail 2230

We also give you stats and visualizations. The one on the bottom right is the one that I particularly like. Here we’re showing collective communication throughout the execution of the inference run in this case, and we’re showing you a scatter plot of them. What you see here is good. Most of the communications are happening almost exactly for the same duration, which means the performance is very consistent and predictable. But if you ever see a large spread here, you kind of get a sense that there’s an outlier and you need to go debug and try to understand what happened.

Thumbnail 2260

And lastly, this is a cool one. We introduced something that we called Performance Insights. On the summary page you’ll see a bunch of boxes that show you where we think the performance bottlenecks are and what you can actually do to solve them. We do it via a combination of AI-based techniques and just human-based techniques. If we debug something a couple of times, we’ll introduce a rule here and try to give you a hint that this might help improve performance.

Thumbnail 2290

All right, we showed this to the folks at Anthropic. There’s a brilliant performance engineer there named Tristan, and when he saw that, he said this is the dream of every performance engineer. That’s one of the quotes that I love the most in recent years, especially from someone like Tristan, by the way.

Thumbnail 2320

All right, wrapping up, we provide this ease of use across different customer personas, and most of what I showed you today is going to get open sourced. I talked about the Neuron Kernel compiler. In addition, the Torch native training backend is going to be open sourced, and we’re also open sourcing a Neuron Kernel library, which is a suite of pre-optimized kernels that we built for our use cases that we want to make available to the world.

Thumbnail 2340

Just before I pass it to Jay Gray, as you can imagine, we’re deep into implementing Trainium3. It’s a little early to share the detailed facts, but we’re just accelerating over time what we’re shooting for, and we’ll probably exceed 6X performance uplift in FP8, 4X memory bandwidth uplift, and 2X memory capacity uplift. The energy efficiency uplift is going to be tremendous, but I’m not ready to share that just yet. Jay Gray, why don’t we talk about how we’re actually using these chips? Thank you guys.

Thumbnail 2390

Anthropic’s Claude on Trainium2: Scaling Inference at Unprecedented Rates

Awesome, thanks Ron, thanks Joe. It’s a pleasure to share the stage with you guys. I’m super stoked to be here. Hi everyone, my name is Jay Gray and I’m the Training Inference Lead at Anthropic. Anthropic is the fastest growing business at its scale in history. Our Claude 4 and Claude 4.5 models are the most trusted AI by enterprises all over the world, and especially with our release just last week of Claude Opus 4.5, Claude is the best coding model in the world and the best model for agentic workflows.

Thumbnail 2440

And the key to all of this is that across all of our product surfaces, across our first-party API and AWS Bedrock, every usage of Claude code, our web apps, our mobile apps, the majority of our traffic today is served on Trainium 2. So what enables us to scale model inference like this? My team’s job is to provide the core inference technology on Trainium that enables us to scale at such an unprecedented rate. And today we’re going to take a deep dive into the kind of performance engineering work we do that enables the scale.

Thumbnail 2460

So, what is it that we actually do? Our work is fundamentally about running our models as fast as possible while serving an exponentially growing set of customers as efficiently as we can. Simple job. Every time we shave 10% off the pre-fill time of our models, it opens up new product use cases. More ergonomic uses of longer context so you can put your entire code base in the context, faster response times to enable more ergonomic interactive use cases, and every time we increase the token generation speed, it enables Claude to think a little longer, your code to get written a little faster, or perhaps it enables us in the back end to increase the sampling batch size and silently serve your traffic a little more efficiently.

Thumbnail 2500

Thumbnail 2510

Deep Dive into Kernel Optimization: Flash Attention Performance Engineering on Trainium

At Anthropic, every operation and every kernel of our model inference is designed to get the best performance out of Trainium chips. So today I thought it would be fun to take you all the way deep on a dive into the kind of performance work we do on a day-to-day basis. So to start, let’s have an overview of Anthropic’s custom model architectures and custom kernels. Okay, this is a bit of a joke. I’m not, we still do have some trade secrets, and I’m not going to literally run you through our model architectures, but I am going to take you through some real optimizations that we’ve done on a realistic large-scale LLM inference kernel.

Thumbnail 2520

Thumbnail 2530

Thumbnail 2540

So this is going to be our playground for the next 5 or 10 minutes. This is a real fused flash attention kernel in three parts. It starts with a large-scale matrix multiplication that generates the queries, keys, and values that are the inputs to the self-attention operation that Ron described earlier. There’s the actual self-attention operation, and then it ends with another big matrix multiplication that projects the outputs of attention back into the residual stream space.

Thumbnail 2550

Before I really get into it, I’m just going to give a very quick overview of the Neuron architecture. If you’re already programming in Trainium, this is a review for you. If you’re more familiar with programming other architectures, then this is hopefully just a quick and interesting overview of the NeuronCore architecture. So at the core of every Trainium chip is a set of NeuronCores, and in each core are a number of different engines which specialize in different linear algebra operations.

So at the heart of this is the Tensor Engine which does small tiles of matrix multiplication. And if you take just one takeaway from an ML performance or a kernel optimization talk, it should be this: the goal of a kernel and the goal of a kernel engineer is to make sure the Tensor Engine is always doing matrix multiplications. Everything else is essentially auxiliary data movement and extra operations to ensure that when the Tensor Engine is done with one matrix multiplication, the data needed for the next one is ready to go in, and we densely pack our matrix multiplications.

The Vector Engine is an engine which specializes in doing reductions and processing over streams of data like a summation on a vector, and the Scalar Engine specializes in doing element-wise operations like activation functions or the exponent part of a softmax. The last engine here is a fun innovation on the Trainium architecture called the GPSIMD or the General Purpose SIMD Engine, which basically lets us write arbitrary C code to operate on our data and basically fit in whatever weird operation into your custom architecture that doesn’t fit into the other engines.

All of these engines read and write to a set of fast SRAM memory banks near the engines called SBUF and PSUM, and I won’t get into the difference between the two of them here. And there are a set of DMA engines which shuttle data back and forth between the fast SRAM memory close to the engines and the larger HBM on chip.

Thumbnail 2670

Thumbnail 2680

So back to our Flash Attention kernel, what you’re seeing here is the actual profiler view of a real kernel, and every row here corresponds to one of the engines that I just described, and every line is an actual operation happening on one of those engines. And what you can see here without even diving into the numbers is that we’re doing pretty well here. Visually you can just tell the Tensor Engine is densely packed with these blue matrix multiplication operations. This is looking pretty good, but how did we get there?

Thumbnail 2700

So for the first optimization we’re going to dive into the first of the big matrix multiplications which is the QKV. And let’s start by looking at a single operation happening on the Tensor Matrix, and I’m going to pause here for a moment and really belabor this

Thumbnail 2710

point because I think this is maybe my favorite thing about programming on Trainium, is that what you’re seeing here is the actual ISA readout of a single 128 by 128 matrix multiplication operation, one of many that happens within a kernel, within a full forward pass. What you’re seeing is the full readout here down to the nanosecond, the individual bytes of memory space that are being read from and written to. This is exactly what is happening and, I mean if you’re used to programming on other accelerator architectures, you understand immediately how cool this is. This is a level of visibility into the performance of your kernels that you really just don’t get anywhere else.

Thumbnail 2760

Thumbnail 2780

Every flop, every nanosecond, every byte of memory in every operation of every kernel can be traced to this level of detail, and this is what enables us to get the maximum performance out of Trainium chips. So here we’re starting with a well densely packed matrix multiplication in the standard BFloat16 format, but a lot of modern LLM inference, especially in decode, is about using smaller, more efficient data formats which Ron alluded to. And Trainium2 is designed to get twice the speed out of the smaller FP8 formats that you get out of a full width BFloat16. And so by moving these operations from the slower BFloat16 into a faster, in this case FP8 E4M3 format, we immediately get twice the speed up or 2x speed up on this matrix multiplication.

Thumbnail 2800

Thumbnail 2810

Next, let’s dive into the actual self attention operation. So this is, you can already tell, a bit more of a complex kernel, and optimizing attention is one of the most interesting problems, I think, in modern LLM inference. It’s a much more complex optimization than working with a single matrix multiplication because there’s just a lot more optimizations, or a lot more operations in there and a lot more opportunities for bottlenecks that stop your kernel from spending all of its time in matrix multiplications. So what you can see here visually is unlike the matrix multiplication that we were just looking at, if you look at the green tensor matrix, the third row from the top, we are not densely packed doing matrix multiplications the entire time like we want to be.

Thumbnail 2860

Thumbnail 2880

What we see is these bursts of matrix multiplications, but then interspersed with these gaps where it seems like we’re actually doing a large number of these small vector core operations. And when we dive in here using the profiler view that I was just showing you and reading the ISA view, what we can see is that the bottleneck here is not doing matrix multiplications like we want. The bottleneck here is actually in shuttling the results of the matrix multiplications between one memory bank and another using an inefficiently large number of these small vector core operations. And when we realized that, we rewrote the tiling such that we move memory from one bank to another using a smaller number of larger vector core operations, amortizing the instruction launch overhead and making better use of the instructions so we spend more of our time in the matrix multiplications. And just by touching this, we get a 13% speed up in attention.

Thumbnail 2900

Thumbnail 2910

Maybe it doesn’t sound like a lot, but at the scale at which we operate, this is a huge amount of chips saved, a huge amount of extra traffic that we conserve. Let’s talk about communications. So it’s been many years now that modern LLMs are large enough that they don’t fit on a single chip, and a lot of the interesting design space that we have as performance engineers is how to split up and shard the data and the computation of a full LLM forward pass across multiple chips and then communicate between them using collectives to arrive at the correct results.

So Trainium, like most chip architectures, operates, as I said earlier, with a smaller amount of fast SRAM memory bank that communicates with a larger amount of HBM. And by default what happens here is in order to do a collective operation you take the result of one of your operations, you shuttle it from the fast SRAM down to HBM, you do a collective from HBM to HBM of different chips, and then shuttle the result of that back up to SRAM. And especially in token decode when you’re trying to stream tokens as fast as possible, this three step memory movement is terrible for latency and if you’re unable to overlap your communications with other computation, spending time in communications like this is just the death of a low latency kernel.

Thumbnail 3000

So what Trainium allows us to do in this cool optimization is take use of one of the cool hardware features that allows us to do direct collectives from SRAM to SRAM along different chips and saving the extra hops of memory between SRAM and HBM. And so what you can see here, it’s not super obvious, but I’ve notated with the red circles the GPSIMD operation which on the left is spending all of its time writing descriptors for the memory movement DMAs between SRAM and HBM.

This goes away on the right, and with the faster SRAM to SRAM collectives, the amount of time that we spend in communications is lower and the latency of our decode is faster.

Thumbnail 3020

Thumbnail 3070

The last optimization, of course, is to run this kernel on Trainium3. Every operation that I’ve described today gets faster on Trainium3. The double speed FP8 matrix multiplications that we looked at in the first optimization are four times faster on Trainium3 and make use of the microscaling architecture that Ron described to do more efficient blockwise quantization and dequantization. The vector and scalar operations that can so easily become the bottleneck of a complex real workload like attention are made faster in Trainium3. The communications are made faster. The amount of HBM capacity per ICI domain is larger, which lets us serve larger models on a single ICI domain. I could go on and on. In this case, the kernel that we’ve been working with for the last five or ten minutes, which achieves after the optimizations about 60% tensor engine utilization on Trainium2, gets to over 90% on Trainium3.

Thumbnail 3080

And so I’ll leave us there. A year ago we announced Project Rainier and Anthropic announced its initial use of Trainium chips. A year later, we’re serving Anthropic models on nearly a million Trainium2 chips, and we’re so excited to see in 2026 and beyond where we can get with Trainium3. Back to you, Joe.

Thumbnail 3120

Conclusion: Trainium3 Availability and Getting Started Resources

Great, just another round of thanks for both Jay Gray and Ron. They did a great job really deep diving into the details on how they think about optimizing Trainium just broadly across the stack. A few takeaways before we leave. Trainium3 is generally available. The chip was announced yesterday at Matt’s keynote. Think about Trainium3, not just in the chip, but also the system. You saw how important the system is for the optimizations as Jay Gray talked through, not just the flops, but also the ICI bandwidth or what we call NeuronLink. We’re building systems that really scale with 144 chips, the Trainium3 Trainium3 UltraServer.

Thumbnail 3150

And then the other part, it’s easy to get started. We’re making it really easy, so we have a lot of information available about Neuron SDK, and we want to make sure that you can get out there and learn. So if you have time tomorrow and you’re interested in learning more, there’s a few workshops available on Thursday, so definitely recommend checking it out. You can scan this and see the full list to learn more. And if you don’t have time on Thursday, you can always scan one of these, get started, and quickly ramp up on your own independently. We have a lot of tutorials available, and we’re really excited about having folks develop on Trainium across the board.

Thumbnail 3170

So with that, just to thank you again to everyone for being here. We really appreciate it, and if you have some time, please complete the survey.

; This article is entirely auto-generated using Amazon Bedrock.

Leave a Reply