🦄 Making great presentations more accessible.
This project aims to enhances multilingual accessibility and discoverability while maintaining the integrity of original content. Detailed transcriptions and keyframes preserve the nuances and technical insights that make each session compelling.

Overview

📖 AWS re:Invent 2025 – Building scalable applications with text and multimodal understanding (AIM375)

In this video, Amazon AGI introduces Amazon Nova 2.0 multimodal foundation models that process text, images, videos, audio, and speech natively. The session covers three key areas: document intelligence with optimized OCR and key information extraction, image and video understanding with temporal awareness and reasoning capabilities, and Amazon Nova Multimodal Embeddings for cross-modal search across all content types. Box’s Tyan Hynes demonstrates real-world applications, including automated materials testing report analysis for engineering firms and continuity checks for production studios, showcasing how the 1 million token context window and native multimodal processing eliminate the need for separate models and manual annotation workflows.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Enterprise Challenges with Multimodal Data: The Untapped Potential

Good morning, everyone. First of all, thank you for being here. As a quick introduction, I’m Dinesh Rajput, a Principal Product Manager with Amazon AGI. AGI stands for Artificial General Intelligence, an organization within Amazon that builds first-party foundation models called Amazon Nova. Today’s session will discuss how you can utilize data beyond just text, such as images, documents, videos, audio, and call recordings to build accurate, context-aware applications using Amazon Nova Foundation models. I’m also joined by my colleague Brandon Nair, who will discuss image and video understanding, and we have one of our customers, Tyan Hynes, who represents Box and will share how they use Amazon Nova models to improve their AI workflows.

This is the broad agenda for today. First, we’ll discuss the enterprise needs when it comes to multimodal data and the different challenges that customers face. Then we’ll provide a quick overview of Amazon Nova, two models that we introduced yesterday. We’ll do a deep dive on how we’ve optimized these models for document intelligence use cases as well as visual reasoning use cases. Finally, we’ll discuss multimodal embeddings and how they can help you search and retrieve multimodal data across your enterprise, and we’ll conclude with a customer success story of how Box is using Amazon Nova models to power its AI workflows.

Let’s talk about data first. Today, organizations have immense amounts of data: text, structured data, contracts in shared drives, videos, and call recordings of customers. However, if we’re honest, we use a very small portion of that data today. It’s mostly either text or structured data that you have in your database tables. This is the only thing that we practically use in most of our AI workflows. A lot of these other multimodal data somehow gets unused and doesn’t really contribute to our AI applications. Multimodal foundation models are really changing that by letting you see what’s inside an image, what’s happening within the different frames of a video sequence, and what a customer is saying or feeling within a support call. You can use all of this data together to reason over it and deliver customer insights and improve your AI workflows.

We’ve been working with customers, and when they try to use multimodal data, there are three key challenges that they face. First, there are many separate models and tools. You need one model to process text, another model to process structured data, a third model to process images, and then you might need a fourth model to process videos. This leads to multiple problems. First, you’re forced to stitch together these different tools, which makes the entire process very costly and complicated. The second key problem is that because you have these different models across different modalities, it’s very difficult to put together all of that context and reason over all of these modalities together to actually deliver customer insights. Think about what you read in a document and a customer call and putting that together to solve the real customer problem. Because of these different models, that’s not really feasible today. The third thing is that a lot of these models are not super accurate, which forces you to have a human in the loop. This manual checking doesn’t really scale, which leads to cost issues and efficiency issues when you’re trying to deploy these AI workflows.

Introducing Amazon Nova 2.0: Natively Multimodal Foundation Models

We launched Amazon Nova 1.0 models at re:Invent last year. We’ve been working with many of our customers over the last year and have had an amazing response with tens of thousands of customers using us. We’ve heard similar feedback from all of our customers, and to solve these customer challenges, we’ve introduced Amazon Nova 2.0 models that were just launched yesterday. One of the key fundamental things that we designed Amazon Nova 2.0 models for is to treat all the modalities as a first-class citizen.

When we designed Amazon Nova 2.0 models, we made them natively multimodal, and they are able to process text, images, videos, audio, and speech, as well as generate text and images. We have a variety of models to cater to your different cost, latency, and accuracy profiles.

We have Amazon Nova 2 Lite, which is our fast, cost-effective reasoning model for most of your AI workloads. We have Nova 2 Pro, which is our most intelligent model for your highly complex tasks. We have Nova 2 Omni, which is our unified model that can not only understand text, images, videos, and audio, but can also generate text as well as images. We have Nova 2 Sonic, which is a conversational speech-to-speech low-latency model. Finally, we have Nova multimodal embeddings, which is a model that can create embeddings across all of your modalities so that you are able to implement any sort of search or retrieval on all of your enterprise data.

Let me quickly go over some of the salient aspects of these models. First, we designed these models to have 1 million tokens of context window. To put that in perspective, it means the model can process 90 minutes of video, hours of audio, and hundreds of pages of documents all in one go. We have made these models multilingual in nature so they can process 200+ languages, and on the speech side we have optimized them for 10+ languages so that your solution truly scales globally.

One final aspect is that we have also included reasoning within these models. Nova Lite, Pro, and Omni all come with reasoning enabled so that you can reason over all of the data, whether it is text or multimodal data together. We have also optimized these models for tool calling so that you can implement agentic workflows using all of your multimodal data.

When launching these models, we made sure that they achieve frontier intelligence on multimodal tasks. Here are some benchmarks that we have presented. These are mostly multimodal benchmarks. TripleMU Pro is a complex visual reasoning task on images. We have document OCR, whether you are doing optical character recognition or extracting documents of handwritten text, slides, and so on. We have real KIE, which is key information extraction for tables where you want to extract values. We have QB highlights, which is our video understanding benchmark, and then we have Screenpot, which is essentially any kind of UI or browser tasks.

As you can see, Nova Lite and Omni perform at the frontier compared to their peers in this category. Nova 2 Pro is also an extremely competitive model that we have introduced in preview. We will potentially GA this, and it also shows extremely competitive performance on multimodal tasks compared to its peers.

Document Intelligence Deep Dive: OCR and Key Information Extraction

Now let us get into a deep dive around document intelligence. When it comes to document intelligence, we have heard from a lot of our customers that the two primitives that they care about are optical character recognition and key information extraction. We have made sure that Amazon Nova multimodal models deliver absolutely the best performance on these two key primitives so that developers as well as enterprises can build on top of them to enable their different AI workflows.

Here is a quick example of a Nova output from an OCR task. As you can see, we have optimized these models around three key parameters. One is robust real-world OCR. What I mean by robust real-world OCR is that a lot of documents you get in enterprises may be handwritten. The scan quality is not always great. Sometimes the documents are a little tilted, so we have made sure that we have optimized around all of these real-world scenarios so that the information you are able to extract from documents is super accurate and requires minimal additional work.

Second is mixed context understanding. It is very simple for models to just extract a chart or a table or text separately, but most of our real-world documents have all of these things put together. We have made sure that the model also excels in understanding these different sets of context, whether it is charts, tables, or any sort of text, so that you are able to deliver this performance together. Finally, we have structured output.

We’ve also made sure that our model is able to produce the right structured output whether it’s in JSON, HTML, or XML format, so that it’s machine readable. You can then extract all of this and put it in your databases so that you can further process it. These are the three key things that we made sure our models really excel at.

The second task is key information extraction. Here we have tried to optimize the models around three key things. First is schema-driven extraction. Any sort of schema that you specify to the model—maybe you want to extract only five rows, or maybe you want to extract all the rows, or maybe you want to label certain rows in a very specific manner—we have made sure that the model does instruction following around what schema you want to define, what sort of indentation you want, and what output you want.

Second is layout-aware text extraction. We have seen in a lot of use cases that the layouts, especially in complex layouts, require a lot of interpretation that the model needs to do. We have made sure that we are covering all of these long tail use cases of really complex real-world tables which require a lot of human interpretation to actually extract the data and make sense of it.

Finally, all of our models also have reasoning capabilities within them, so you could also extract all of this information from these documents and reason over these documents to see if the model has extracted it correctly or even fundamentally if the data is correct or not. As a quick example, you might extract a zip code from a document, but you need some model intelligence to know that a zip code should be five digits and not four digits, and that might be an error in the form. So you don’t just want extraction, but you want intelligent extraction within those documents.

Image and Video Understanding: Visual Perception, Reasoning, and Temporal Analysis

Thank you so much. I’m going to hand it over to my colleague Brandon, who will talk a little bit more about image as well as video understanding. Hey folks, how’s it going? My name is Brandon Nair, and I am a Senior Product Manager on Amazon AGI, similar to the team that Dinesh is from.

Dinesh has given us a pretty great deep dive on how Nova 2 models can be applied for understanding use cases, but I’d like to expand that a little bit further and showcase how those capabilities can also solve for image and video understanding use cases. Vision understanding is a foundational capability that enables language models to really understand the world not just in the form of pixels, but similar to the way that humans do—really understanding the context and the meaning behind those images and videos.

Nova 2 models process all images as well as video content in their unstructured format, so it processes it natively and converts it into structured output that could be text written in JSON format. That text can be used by customers to easily search, categorize, and generate rich business insights from that visual content.

Some examples of what these insights could look like include identifying key objects in an image or a video, identifying and extracting text overlays in a video through optical character recognition, understanding temporal relationships or causal relationships within a video, or generating rich captions that can be used by customers to really understand what content is contained within that particular video asset.

We are super excited about the Nova 2 models that have come out, and from a vision understanding perspective, you can break these down into improvements that we are bringing through with the Nova 2 models. You can break those down into three capabilities that we are trying to solve for based on our customer needs. The first is on vision perception. The second is on reasoning and scene semantics. Thirdly, we have temporal understanding, or really the understanding of how time influences the understanding of a video. Let’s dive through those one by one.

Visual perception can be seen as the model’s eyesight. It represents how adept a model is at taking in a particular image or video and understanding all the elements within those videos or images. This includes understanding attributes associated with each element, such as color, shape, and count, as well as understanding geospatial relationships between the different elements. Vision perception forms the basis of common computer vision tasks such as object detection.

Object detection is a way that you prompt a model to identify a particular object and potentially extract the coordinates of that object that can be passed downstream into a system for further processing. An example of this could be logo detection, where you identify a logo and pass it downstream to do an ad attribution use case. In the example on screen, we have a standard living room, and I’ve prompted Nova 2 to identify and detect plants, cushions, table, and a TV. In the next slide, I’ve overlaid the bounding box coordinates corresponding to each of those particular objects.

As you can see, Nova has identified all the objects on the screen, even the little plants that are in the bookshelf in the back, and even the black TV, which is literally represented as a black rectangle. The model has an understanding of what to expect in a living room space and is able to detect that as a TV. The second finding from this particular image is how tight these bounding boxes are. This gives you a sense of how well the model is able to focus on the particular object that you are trying to detect, and it is a measure of how accurately the model is able to detect those objects.

Visual perception is a mechanism to test how well the model can visualize, and this capability can be extended further to having the model generate captions of what it’s actually seeing. In this example, it could be something like “modern living room” or “bright open spaces with natural sunlight coming through.” The second capability I’d like to discuss is that each of the Nova 2 models has reasoning capabilities. This allows the model to extend its uses beyond just visual perception and identifying what it can see, but to actually contribute different elements together to make a logical deduction or inference of what is actually happening.

The model consumes reasoning tokens in order to generate this kind of thinking, and within the API we provide parameters to developers that allow them to control the depth or budget of the degree of thinking in order to solve for their particular use case. It’s another way to prioritize what makes the most sense for your use case. The third big upgrade that we have is around video and temporal understanding. Video understanding is a critical capability for a bunch of use cases, from media asset management to semantic search to personalization and recommendations to contextual ad placements to video question answering and detection. The list goes on.

The challenge, however, is that videos are a really complex asset or modality to deal with. As you can see, videos have frames, shots, scenes, and chapters. You may have audio transcripts in there, and you have textual screen overlays. It’s a really complex multimodal asset. But this is compounded by the fact that we also need to consider the time dimension. Time is super important to understanding the context of a particular video, what’s actually transpiring in a scene, and what’s actually taking place. Current solutions have basically two options that they can utilize.

The first option is to have manual annotation of the videos. This involves a team of people watching a video and literally noting down metadata that is annotated to the video. Due to its manual nature, this approach is really unscalable. It depends on the depth at which someone takes down these annotations, and because there’s a human in the loop, you get a degree of variability that can differ from teammate to teammate.

The second option that customers can pursue is to extract frames from the video and send those frames to a vision language model to generate metadata. However, this option too is flawed. Firstly, it requires customers to build out complex pipelines to preprocess their images and videos into frames or images, and then integrate them into a vision language model. Secondly, because you are extracting frames and processing them separately, when you consider the sheer volume of video archives you might want to go through, this could become cost inefficient.

The third aspect of processing frames is that you don’t have that element of temporal understanding, which is really important for understanding what’s actually happening in a video. Nova 2 models support video natively. We’ve trained the model to understand the temporal aspects of the video so it understands what’s happening across the length of frames to have a deeper understanding of what’s happening in the particular video.

Temporal understanding is super important because with it, you also get the ability to ask the model to generate chapters where it could provide a description from time A to time B, indicating what has transpired within those two timeframes. Or you could ask it to process a video and extract the timestamps that correspond to a particular event that might be interesting to you. I have a demo here which showcases this. This is a video, a sped-up version of a documentary that features Werner Vogels, the CTO of AWS. In this video, we have a number of occurrences where someone is standing on a boat. There are actually four occurrences in this particular video.

I provided that video to the Nova 2 model and prompted it to extract each of the timestamps that correspond to when someone is standing on a boat. Nova 2 was able to identify all four of the particular events when someone was standing on a boat, and not only that, but it was also able to localize that timestamp to within one to two seconds of when the start and end time was, which is a pretty powerful capability for identifying different events happening within your video.

Amazon Nova Multimodal Embeddings: Unified Representation Across All Content Types

Now I’m going to switch gears and talk about Amazon Nova model embeddings. Amazon Nova Multimodal Embeddings is a separate model from the Nova 2 models. It takes in text, images, documents, video, and audio as inputs, and it outputs an embedding that represents any of the components that are passed in as an input. Before we go any further, let’s define what an embedding is. An embedding, simply put, is a representation of the input that you provided to the model.

You can think about this lovely picture of a Labrador. The Labrador is sitting on a beach with an ocean in the background and a blue handkerchief tied around its neck. All of those elements within that image are important to understand the overall context. When you convert this image into an embedding, you’re really trying to represent it in a machine-readable format, which we call an embedding, that captures all of those intricate details and all of that information represented within the image. This is super important because it helps enable semantic search applications where you don’t have to rely on metadata.

You can rely just on the embedding itself to retrieve the correct image. It also helps for RAG applications. As you’re thinking about building out deep AI workflows that manage to retrieve important information that might be proprietary to your business, with an embedding model, you don’t have to have the metadata. You can actually retrieve it just based on the embedding itself.

Amazon Nova multimodal embeddings is a state-of-the-art embedding model that takes in text, images, documents, video, and audio and outputs an embedding. It is the first model in the industry to process all these different content types and to process them within the same embedding space, so it has the same level of understanding. In other words, if you have a text string of the word “dog,” if you have an image of a dog, or if you have a video of a dog, they are all represented in the exact same way. This allows you to expand applications to cross-modal capabilities such as doing text plus image to image search or video to video search, or trying to retrieve visual documents that contain both text and images.

Nova Embeddings offers great price performance at approximately 10 to 20 percent lower cost than other alternative embedding models. Some of the key features of the embedding model include unmatched modality coverage. It also provides a very long context length of 8,000 tokens for an embedding model, which is a pretty high amount. This refers to how much context you can bake within a single embedding and still have a meaningful representation of your input. It also provides segmentation capabilities within the API, so if you have longer text, video, or audio, you can first split those into smaller, manageable pieces and then generate embeddings for each of those pieces.

The model comes equipped with both synchronous and asynchronous APIs. The synchronous API handles your latency-sensitive workflows where it might impact your user experience. You can think about something like someone searching for a document where you want that to be retrieved pretty quickly. It also supports asynchronous capabilities, so if you have a very large video file that you want to process and it’s not latency-sensitive, you’re able to pass it through the asynchronous API and get a response at a later time once the job has been completed.

Lastly, the model comes with a choice of four embedding dimensions, which really gives you the option to trade off the level of compression within an embedding against your storage cost implications. The model is trained with Matrioshka representation learning, which essentially means that it bakes in the most important context earlier into the embedding dimension. So if you truncate the embedding from the native 3,000 dimensions, you can still maintain a very high degree of accuracy. In our benchmarks, we see a pretty minimal accuracy drop when we move from a 3,000 embedding all the way to a 256 embedding.

In the slide, I also talked through some of the benchmarks that we’ve developed to compare Amazon Nova multimodal embeddings against alternative models. The first thing you should notice is that because Amazon Nova multimodal embeddings is so comprehensive, we have to pick select models to compare it to because other models tend to be more specialized toward a particular modality or particular domain such as images, documents, or maybe just videos. As you can see from the slide, across video retrieval, visual documents, as well as text, the model delivers great accuracy in terms of your retrieval tasks. It’s a great model that we’ve been really proud about, and we’ve started to get some great feedback over the past few weeks.

Box’s Journey: Unlocking Unstructured Content with Nova Multimodal Embeddings

We’re excited to have more folks test it out and give us a sense of how it is being applied within your business use cases. Now I will transition over to Tyan, who will take us through a bit more about how Amazon Nova multimodal embeddings is unlocking new use cases at Box. I don’t think the click is working.

Welcome everybody. I’m excited to talk to you about how Box is specifically using these new models. For those of you who are not familiar with Box, we are the leading intelligent content management company. What does that mean? Companies trust us to store, manage, and share their information so that they have the ability to interact with that information, share it securely with other customers or with other people in their organization, and actually get really useful value out of that information. Over 115,000 organizations trust Box, and these are across many different industries, including highly regulated industries like government, finance, and life sciences.

Obviously, it’s really critical for Box to be able to be secure about how we manage that content, but we also need to make sure that we’re providing access to the full breadth of the content. There’s a whole bunch of use cases that our customers have that they want to use this information for across a wide variety of different verticals and industries, all the way from digital asset management and insurance claims management to real product design and development. One of the big challenges with this kind of information is that most of it is in unstructured data. We’re talking about PDFs, Word documents, Excel spreadsheets, all kinds of things, video files, and traditionally it’s been really hard to get access to that information.

Previous paradigms have really focused on structured data and being able to do database queries. That’s not super useful to a lot of our customers because so much of the information that they need is in that 90 percent of unstructured content. There’s a lot of really untapped value that we want to unlock, and obviously AI is a huge way to do that. That’s where Box AI comes in, and that’s the particular product that I work on. This is the platform that we built that allows Box customers to use AI and apply it to their content. We’ve had things out for a few years now, but we’re continually improving and making sure that we can provide even more value to that content, not just being able to find things and ask questions, but also take that information that you get out of that content and use it to create new assets or power workflows.

A couple of key use cases: the very first thing that people like to do is really just being able to generate instant insights from content. How many of you have had a time where you have a really long document that you probably need to read before a meeting and you just need to know a few things? You don’t have time to read the 50-page document, so being able to just ask questions really quickly on that is super helpful. Once you have that information, you can do interesting things with it. For example, if you have a long document and you know that you need to be able to search for it later, you can extract that information and save it as metadata. You can save that back to the document, or you can take that information—for example, if you have a loan document—you can pull that information out and then take that metadata and put it in another system as well.

Really, we’re being able to take that information and spread it across not just the Box ecosystem but across your entire ecosystem so that you can really get more value. Of course, the real thing that we’re really focused on is being able to use that information to automate workflows. It’s great to be able to go and read through a document or look through a video and find the information that you want, but what we really want to do is empower our customers to actually take that information and use it to power new processes so that people don’t have to do that manual step and it just gets done automatically.

This is not a sales pitch. I just wanted to set some context for what Box is and why we care about this stuff. Let’s actually get into the details of what we’re trying to do. Let’s go back to that 90 percent of content that is unstructured. That’s a lot of content across many different file sizes and file types. In order to access all of that, we need to use RAG, or Retrieval Augmented Generation.

If we have a 10-hour video, we can’t just plug that into a model and get the information. We need to use RAG to do that, especially when we’re talking about comparing across different documents. The real challenge is that current models tend to just be very text-focused. There are some models that do images as well, and we talked a little bit earlier in the presentation about that.

But it’s a real challenge to access anything beyond text. This is where we’re trying to figure out how to solve this problem. For our Box customers, this is really important. There’s a lot of content—think PDFs, CAD files, presentations—where we can access the text, but we can’t necessarily access all the embedded images and charts. We’re losing a lot of context because we’re only looking at one dimension of that file. These kinds of files make up a huge percentage of the files that our customers store in Box, so we’re losing a lot of information.

We also have a lot of video files, audio presentations, and PDFs. These tend to be disproportionately large files in Box. We actually have a customer with a video that is more than 10 years long. I don’t quite know how they’ve managed that, but as a result, we have some very large files within Box that we have to manage. The only way to get value out of those really large files is using RAG.

We’ve been trying to solve this problem for a while. We’ve done a couple of different things. Obviously, the first thing is to convert audio from your video and audio files into text. There are very good existing models that can do that. I think this is what a lot of multimodal models do right now—they’re really just extracting text and then doing the embedding on the text. That allows them to search across text files, audio files, and video files. The other thing is human annotation, which is a very traditional approach. This is what we did even before AI. If you wanted to get information out of a document, you hired a person, told them to find these things, and they would annotate it. We’ve tried both approaches, but there are some big limitations to those.

Especially when you’re talking about human annotation, it’s super difficult to scale. It’s very expensive. You have to hire people who are experts in a particular field, so it’s quite challenging. Both approaches are quite slow. Transcription is getting better and faster, but human annotation takes a really long amount of time. No matter what approach you use, your potential for search is really limited to whatever keywords you’re extracting from human annotation or what is specifically in that transcript from the audio or video. As a result, really important context is lost as part of that process.

Up to this point, even though we’ve been trying to solve this particular problem, we haven’t really found a great solution. The whole goal is to look at more than just text. We want to look at all the information within a document together at the same time. We want to look across all different file types. My team’s goal is simple: if you store it in Box, you should be able to use AI on it. That’s not just that you can use AI on a file—you should be able to use AI on the entire content of that file. That’s what we’re working towards.

This is where we get to the new Nova multimodal embeddings model, and this really has been a game changer for us. This is going to allow us to unlock a lot of what we’ve been trying to do. We finally have a single multimodal embeddings model that handles all content types, not just text. It’s not pulling a transcript and doing it—it’s actually pulling the text, the image, the video, everything from whatever that file is. So we have all that information, and that means we’re not losing that critical information that might come from the non-text pieces in that file. For us, that is huge.

We can really look at documents with different kinds of file types within a document as well as looking across the entire depth and breadth of the files stored within Box. Of course, that unlocks a ton of new use cases that we just couldn’t do before. It’s an added bonus for us that it’s very fast, very scalable, and very affordable, which is great for us.

Real-World Impact: Customer Success Stories and Future Directions

I’m going to give you some real examples of customers who have actually used this technology. We have a leading engineering and architecture firm that does a lot of materials testing. They receive these reports every month that are extremely long—around 80 pages—with lots of in-depth technical information about the results of all their tests. Previously, they had to hire someone at their company to go through and read the entire report, pulling out project-specific information for different projects to identify anything in those results that might need additional action. This was incredibly time-consuming to extract the information from the report.

Additionally, there are often attached videos or images. For example, if someone is trying to look at a particular room that’s been built, they might want to take pictures of what things look like and then see if that actually matches the specific requirements they have. All of this required a human to do manually. However, with this new embeddings model, this customer can now create an agent that can go through all that information and pull out the insights they need in project-specific summaries as well as summaries for executives, along with a list of action items they need to take based on that report so they know what to do next. This is huge for them and saves days and days of work just by being able to do this.

Let me talk about a fun one too. We definitely have some really cool media and entertainment use cases. You can imagine they use a lot of videos and audio. We have a very large production studio that is a customer of ours, and one of the things that’s a real challenge for them is continuity checks. When you’re filming something, you have many people working on different sets at different times, which is quite complex. You might film a scene or a few scenes at one time with everything set up in your location, and then for whatever reason you might need to tear that set down, go somewhere else, and then come back to that location at a different time.

You need to set everything up to be exactly like it was before. In the past, that meant someone had to go through the video from that previous time they were on that set and look at where all the objects were and what all the context was. Then they would go and set the set up to match exactly what it was last time. This was super time-consuming, especially because depending on where your camera is, you can’t just look at the last shot—you have to go and look at all the different shots so that you have the whole set set up the correct way. Now they can use this to search the video and really be able to find that complex information right down to finding where that coffee cup was located and what direction the particular writing was oriented so that they can get everything set up looking exactly like it was when they finished. This saves them hours and hours of work.

We’re really excited about this one as well. What’s next? We’re doing a lot of really great stuff with the teams here at Amazon to really start getting this in the hands of our customers. We’ve proven that it works and solves our use cases. So now we really need to make sure that all our customers have access to this. That’s a big effort for us now—to really start applying this embeddings model at scale in our production environment. We’re also working on testing and integrating those Nova 2 models that were just announced. My team is actually actively working on testing them right now and looking at them on our use cases so we can see where they really do well.

One thing that’s really great is we’re now able to extend the kinds of things that Box AI can do to new use cases. Every time we have some new capability that comes out, there’s always new use cases that come up that we just couldn’t do before. So we want to be able to really start figuring out now that we have this new capability with these new embeddings models to be able to look at all this different content, what new kinds of things we can do. There’s a lot of really active work happening right now to see what new use cases we can apply these models to. That’s it. Thank you so much everybody for joining us. We really appreciate you coming out and listening. Please make sure that you complete the session survey—it’s in the mobile app. For those of you who have questions, we’ll be off to the side over here to answer questions for a few more minutes. Thanks so much everyone.

; This article is entirely auto-generated using Amazon Bedrock.