Building a Multimodal Food Analysis System on Qubrid AI

NutriVision AI is an example application from the Qubrid AI Cookbook that demonstrates how to build a multimodal vision-language nutrition analyzer from the ground up. It uses a multimodal model to provide comprehensive nutritional insights from a food image, then lets users query those insights conversationally.

This app is more than just a playful tool; it serves as a reference implementation that demonstrates how to seamlessly integrate authentic multimodal inference into a practical interface. It features structured outputs that you can further develop and expand upon.

Why NutriVision Matters

A lot of nutrition and diet tracking applications still rely on manually entered text. NutriVision removes that friction by letting users take or upload a photo and receive a meaningful, structured analysis automatically.

Behind the scenes, a multimodal model analyzes the image and generates a clean representation of calories, macronutrients, health score, dish name, and more. Then that structured data is used for both display and grounded follow-up conversation.

This pattern, strict structured inference + grounded chat, is powerful and generalizable beyond nutrition. It shows how vision + language models can be applied to everyday tasks.

Overview

NutriVision supports two core capabilities:

Image-based nutritional analysis using a multimodal model
Context-aware follow-up conversation grounded in structured nutrition data

The system enforces strict JSON output during analysis and uses streaming for conversational interaction.

Prerequisites

Before running the application, ensure you have:

Install Python version 3.9 or higher to run the code.
Download and install pip
Get your API key from the Qubrid dashboard to access and use the models.

Clone the Repository

git clone https://github.com/QubridAI-Inc/qubrid-cookbook.git
cd qubrid-cookbook/Multimodal/nutri_vision_app

Create Virtual Environment

python -m venv venv
source venv/bin/activate      # macOS/Linux
venvScriptsactivate         # Windows

Install Dependencies

pip install -r requirements.txt

Configure Environment Variables

Set your Qubrid API key so the app can authenticate inference requests:

export QUBRID_API_KEY="your_api_key_here"

Windows:

setx QUBRID_API_KEY "your_api_key_here"

Run the Application

Once the environment and key are set:

streamlit run app.py

The application will launch locally in your browser.

Multimodal API Integration

NutriVision integrates Qubrid’s multimodal endpoint for image-based nutrition analysis.

Image Analysis Call (Non-Streaming)

This function wraps the Qubrid API call:

import os
import requests

QUBRID_API_KEY = os.getenv("QUBRID_API_KEY")
BASE_URL = "https://platform.qubrid.com/v1/chat/completions"

def call_qubrid_api(messages):
    payload = {
        "model": "your-multimodal-model-name",
        "messages": messages,
        "temperature": 0.2
    }

    headers = {
        "Authorization": f"Bearer {QUBRID_API_KEY}",
        "Content-Type": "application/json"
    }

    response = requests.post(BASE_URL, json=payload, headers=headers)
    response.raise_for_status()

    return response.json()["choices"][0]["message"]["content"]

Inside app.py, the request is constructed as:

messages = [{
    "role": "user",
    "content": DETAILED_NUTRITION_PROMPT,
    "image": st.session_state.image_base64
}]

response_text = call_qubrid_api(messages)

This call returns structured JSON containing dish name, calories, macronutrients, and health score.

Streaming Chat Integration

After analysis, the structured nutrition data is injected into the system prompt and streamed for conversational reasoning.

Recommended Model: Qwen3-VL-30B, which is a high-capacity vision-language model optimized for advanced image understanding, structured extraction, OCR, and multimodal reasoning tasks.

def call_qubrid_api_stream(messages):
    payload = {
        "model": "your-chat-model-name",
        "messages": messages,
        "temperature": 0.4,
        "stream": True
    }

    headers = {
        "Authorization": f"Bearer {QUBRID_API_KEY}",
        "Content-Type": "application/json"
    }

    with requests.post(BASE_URL, json=payload, headers=headers, stream=True) as response:
        response.raise_for_status()
        for line in response.iter_lines():
            if line:
                decoded = line.decode("utf-8")
                if decoded.startswith("data: "):
                    chunk = decoded.replace("data: ", "")
                    if chunk != "[DONE]":
                        yield eval(chunk)["choices"][0]["delta"].get("content", "")

Used in the chat layer:

for chunk in call_qubrid_api_stream(api_messages):
    full_response += chunk

This enables real-time conversational responses grounded in previously parsed nutrition data.

Design Approach

NutriVision follows a deterministic inference pipeline:

Structured constrained generation for reliable JSON output
Dedicated parsing layer for validation
Context injection to reduce hallucination
Streaming for conversational UX

The model performs multimodal reasoning, while the application layer ensures reliability and usability.

Real-World Applications

Although NutriVision focuses on nutrition, the general pattern it implements vision input + structured generation + context-aware chat can be applied to many domains:

Health and fitness tracking tools
Diet coaching assistants
Industrial quality inspection
Medical image interpretation
Educational visual assistants

The Qubrid Cookbook contains other multimodal examples that apply this same pattern to different use cases.

Where to Learn More

This app is part of a broader set of cookbooks provided by Qubrid AI, offering examples ranging from OCR agents to reasoning chatbots.

👉 Explore the full source code and related projects in our cookbooks.

👉 Watch implementation tutorials and walkthroughs on YouTube for step-by-step demos and model integrations.

Thanks for Reading!

If you found this helpful, feel free to like the post 👍 and star ⭐ the repository, try the app, and experiment with your own multimodal builds using Qubrid AI. We’d love to see what you create!