Building Production-Ready RAG in FastAPI with Vector Databases

From Prompting to Production-Ready RAG

Retrieval-Augmented Generation (RAG) is often presented as a prompting technique or a lightweight runtime enhancement for LLMs. While this may work for demos, it breaks down quickly once you try to build a production-ready AI backend system with FastAPI..

The moment you want persistence, reproducibility, scalability, and clear separation of responsibilities, RAG inevitably leads to a vector database, because similarity-based retrieval cannot be treated as a stateless runtime concern.. Not as an optional optimization, but as the central infrastructure component that makes retrieval reliable and operational.

This article focuses exactly on that transition by integrating RAG into a FastAPI backend and treating the vector store as a first-class backend dependency that is configured, injected, and consumed like any other production system component.

RAG as a Backend Component

In current literature, Retrieval-Augmented Generation is not described as a single step or a simple pipeline, but as a composition of multiple responsibilities that together enable grounded generation.

At its core, RAG combines:

  • persistent knowledge storage,
  • similarity-based retrieval,
  • and context-aware generation.

While these responsibilities are often collapsed into a single conceptual flow, in a production backend they naturally separate into distinct backend concerns. Knowledge must first be ingested, transformed, embedded, and stored in a way that allows efficient semantic access. Retrieval then becomes a query-time operation that selects relevant information based on vector similarity. Only after this retrieval step does the generative model come into play.

Seen this way, RAG is not a monolithic process but an architectural pattern that explicitly connects storage, retrieval, and generation. The vector database forms the backbone of this pattern, acting as the system of record for knowledge and as the execution layer for retrieval. This perspective makes it clear why RAG belongs in the backend infrastructure and not inside the AI logic itself.

Scope of RAG in This Article

To keep the focus sharp, this article deliberately excludes:

  • PDF parsing and document ingestion pipelines
  • Chunking strategies
  • Embedding model comparisons
  • Advanced retrieval or ranking techniques

The goal here is not to explain how to generate embeddings, but how to integrate a vector-based RAG component cleanly into a FastAPI backend.

With the architectural role of RAG clarified, the next step is to materialize it as an actual backend dependency.

Introducing the Vector Store Dependency

At the center of the RAG setup sits the vector store. In this project, it is implemented using Qdrant via LangChain. The QdrantVectorStore used here is a LangChain-provided abstraction that encapsulates all communication with the underlying vector database. It is responsible for embedding queries, executing similarity searches, and mapping results back into document objects. For simplicity and clarity, this project relies on this existing LangChain implementation rather than introducing a custom database layer. By returning the QdrantVectorStore as a VectorStore dependency, the application stays decoupled from Qdrant-specific details while still leveraging a production-ready vector database.

def init_qdrant_vector_store(settings: Settings = Depends(get_settings)) -> VectorStore:
    """
    Initialize and return the vector store used for retrieval.
    """
    embeddings = get_openai_embeddings(
        settings.qdrant_vector_store.embedding_model,
        settings.openai_model.api_key
    )

    client = QdrantClient(path=settings.qdrant_vector_store.path)

    if not client.collection_exists(
        collection_name=settings.qdrant_vector_store.collection_name
    ):
        client.create_collection(
            collection_name=settings.qdrant_vector_store.collection_name,
            vectors_config=VectorParams(
                size=settings.qdrant_vector_store.vector_size,
                distance=settings.qdrant_vector_store.distance
            )
        )

    return QdrantVectorStore(
        client=client,
        collection_name=settings.qdrant_vector_store.collection_name,
        embedding=embeddings,
    )

This dependency is responsible for:

  • ensuring the vector collection exists,
  • configuring embeddings,
  • returning a ready-to-use vector store abstraction.

From the rest of the application’s perspective, this behaves exactly like a database connection.

Uploading Data into the Vector Store

Before retrieval can happen, the vector store must be populated. This is handled via a deliberately minimal upload endpoint /upload/chunks.

@router.post("/chunks", response_model=UploadResponse)
def upload_chunks(
    documents: DocumentChunks,
    vector_store: QdrantVectorStore = Depends(init_qdrant_vector_store)
):
    if len(documents.chunks) == 0:
        raise HTTPException(status_code=400, detail="No chunks to upload found")

    uuids = [str(uuid4()) for _ in range(len(documents.chunks))]

    chunk_ids_added = vector_store.add_documents(
        documents=documents.chunks,
        ids=uuids
    )

    if len(chunk_ids_added) == 0:
        raise HTTPException(
            status_code=500,
            detail="Uploading chunks to vector store failed"
        )

    return UploadResponse(
        success=True,
        message=f"{len(documents.chunks)} chunks uploaded"
    )

This endpoint assumes:

  • chunks are already prepared,
  • embeddings are generated implicitly via the vector store,
  • and data is persisted for future queries.

This reinforces the idea that RAG starts with data ingestion, not with prompting.

Querying with RAG via Dependency Injection

With data in place, the query endpoint becomes straightforward.

@router.post(path="/query", response_model=Insight)
def create_insight(
    request: InsightQuery,
    settings: Settings = Depends(get_settings),
    llm: BaseChatModel = Depends(init_openai_chat_model),
    vector_store: VectorStore = Depends(init_qdrant_vector_store)
):

Here, the vector store is injected alongside the LLM. Neither depends on the other. They are simply resources orchestrated by the endpoint.

The RAG chain then pulls context from the retriever and passes it into the prompt

The important point is not the retrieval logic itself, but where it lives:

  • Retrieval is fully owned by the vector store
  • The LLM only receives already-prepared context
  • No AI component depends on Qdrant, embeddings, or storage details

This clean separation is what allows RAG to exist as a backend component rather than bleeding into AI logic.

def run_rag_insight_chain(
    prompt_messages: ChatModelPrompt,
    llm: BaseChatModel,
    retriever: VectorStoreRetriever,
    question: str
) -> Insight:
    context = retriever.invoke(question)

    prompt_template = ChatPromptTemplate([
        ("system", prompt_messages.system),
        ("human", prompt_messages.human)
    ])

    parser = PydanticOutputParser(pydantic_object=Insight)

    chain = prompt_template | llm | parser

    return chain.invoke({
        "format_instruction": parser.get_format_instructions(),
        "question": question,
        "context": context
    })

The agent itself remains completely unaware of how the context was created — exactly as it should be.

Why This Architecture Matters

By modeling the vector database as a dependency:

  • RAG becomes configurable and replaceable
  • The AI layer stays clean and testable
  • Knowledge management is fully decoupled from generation

Once treated that way, it naturally fits into dependency injection, lifecycle management, and clean system boundaries.

In a production AI backend, RAG works through a vector store rather than through ad-hoc logic embedded in the AI layer. The vector store is the RAG system. Everything else simply consumes it.

Final Thoughts

RAG does not need to complicate your AI backend. When implemented via a vector store and injected like any other backend resource, it becomes predictable, scalable, and maintainable.

By separating ingestion, retrieval, and generation, you gain the freedom to evolve each part independently without turning your AI code into a tightly coupled system.

In the end, RAG is just another backend component. And treating it that way is exactly what makes it powerful.

💻 Code on GitHub: hamluk/fastapi-ai-backend/part-3

Leave a Reply