NutriVision AI is an example application from the Qubrid AI Cookbook that demonstrates how to build a multimodal vision-language nutrition analyzer from the ground up. It uses a multimodal model to provide comprehensive nutritional insights from a food image, then lets users query those insights conversationally.
This app is more than just a playful tool; it serves as a reference implementation that demonstrates how to seamlessly integrate authentic multimodal inference into a practical interface. It features structured outputs that you can further develop and expand upon.
Why NutriVision Matters
A lot of nutrition and diet tracking applications still rely on manually entered text. NutriVision removes that friction by letting users take or upload a photo and receive a meaningful, structured analysis automatically.
Behind the scenes, a multimodal model analyzes the image and generates a clean representation of calories, macronutrients, health score, dish name, and more. Then that structured data is used for both display and grounded follow-up conversation.
This pattern, strict structured inference + grounded chat, is powerful and generalizable beyond nutrition. It shows how vision + language models can be applied to everyday tasks.
Overview
NutriVision supports two core capabilities:
- Image-based nutritional analysis using a multimodal model
- Context-aware follow-up conversation grounded in structured nutrition data
The system enforces strict JSON output during analysis and uses streaming for conversational interaction.
Prerequisites
Before running the application, ensure you have:
- Install Python version 3.9 or higher to run the code.
- Download and install pip
- Get your API key from the Qubrid dashboard to access and use the models.
Clone the Repository
git clone https://github.com/QubridAI-Inc/qubrid-cookbook.git
cd qubrid-cookbook/Multimodal/nutri_vision_app
Create Virtual Environment
python -m venv venv
source venv/bin/activate # macOS/Linux
venvScriptsactivate # Windows
Install Dependencies
pip install -r requirements.txt
Configure Environment Variables
Set your Qubrid API key so the app can authenticate inference requests:
export QUBRID_API_KEY="your_api_key_here"
Windows:
setx QUBRID_API_KEY "your_api_key_here"
Run the Application
Once the environment and key are set:
streamlit run app.py
The application will launch locally in your browser.
Multimodal API Integration
NutriVision integrates Qubrid’s multimodal endpoint for image-based nutrition analysis.
Image Analysis Call (Non-Streaming)
This function wraps the Qubrid API call:
import os
import requests
QUBRID_API_KEY = os.getenv("QUBRID_API_KEY")
BASE_URL = "https://platform.qubrid.com/v1/chat/completions"
def call_qubrid_api(messages):
payload = {
"model": "your-multimodal-model-name",
"messages": messages,
"temperature": 0.2
}
headers = {
"Authorization": f"Bearer {QUBRID_API_KEY}",
"Content-Type": "application/json"
}
response = requests.post(BASE_URL, json=payload, headers=headers)
response.raise_for_status()
return response.json()["choices"][0]["message"]["content"]
Inside app.py, the request is constructed as:
messages = [{
"role": "user",
"content": DETAILED_NUTRITION_PROMPT,
"image": st.session_state.image_base64
}]
response_text = call_qubrid_api(messages)
This call returns structured JSON containing dish name, calories, macronutrients, and health score.
Streaming Chat Integration
After analysis, the structured nutrition data is injected into the system prompt and streamed for conversational reasoning.
Recommended Model: Qwen3-VL-30B, which is a high-capacity vision-language model optimized for advanced image understanding, structured extraction, OCR, and multimodal reasoning tasks.
def call_qubrid_api_stream(messages):
payload = {
"model": "your-chat-model-name",
"messages": messages,
"temperature": 0.4,
"stream": True
}
headers = {
"Authorization": f"Bearer {QUBRID_API_KEY}",
"Content-Type": "application/json"
}
with requests.post(BASE_URL, json=payload, headers=headers, stream=True) as response:
response.raise_for_status()
for line in response.iter_lines():
if line:
decoded = line.decode("utf-8")
if decoded.startswith("data: "):
chunk = decoded.replace("data: ", "")
if chunk != "[DONE]":
yield eval(chunk)["choices"][0]["delta"].get("content", "")
Used in the chat layer:
for chunk in call_qubrid_api_stream(api_messages):
full_response += chunk
This enables real-time conversational responses grounded in previously parsed nutrition data.
Design Approach
NutriVision follows a deterministic inference pipeline:
- Structured constrained generation for reliable JSON output
- Dedicated parsing layer for validation
- Context injection to reduce hallucination
- Streaming for conversational UX
The model performs multimodal reasoning, while the application layer ensures reliability and usability.
Real-World Applications
Although NutriVision focuses on nutrition, the general pattern it implements vision input + structured generation + context-aware chat can be applied to many domains:
- Health and fitness tracking tools
- Diet coaching assistants
- Industrial quality inspection
- Medical image interpretation
- Educational visual assistants
The Qubrid Cookbook contains other multimodal examples that apply this same pattern to different use cases.
Where to Learn More
This app is part of a broader set of cookbooks provided by Qubrid AI, offering examples ranging from OCR agents to reasoning chatbots.
👉 Explore the full source code and related projects in our cookbooks.
👉 Watch implementation tutorials and walkthroughs on YouTube for step-by-step demos and model integrations.
Thanks for Reading!
If you found this helpful, feel free to like the post 👍 and star ⭐ the repository, try the app, and experiment with your own multimodal builds using Qubrid AI. We’d love to see what you create!


