Build production RAG systems in Node.js – Know where it breaks, why it works, and when to use it
Introduction: Why Node.js for RAG?
👦 Nephew: Uncle, why would I build RAG in Node.js? I thought this was AI stuff?
👨🦳 Uncle: Good question. Node.js is perfect for RAG because:
- Fast I/O (API calls to Claude, PostgreSQL queries)
- Real-time capable (WebSockets for streaming)
- Easy deployment (Vercel, Railway, Render)
- Async/await makes complex flows clean
Plus, you probably already have Node.js running your backend. Why add Python?
👦 Nephew: So I can build the whole thing in JavaScript?
👨🦳 Uncle: Yes. Frontend, backend, RAG – all JavaScript. That’s the beauty.
But we need to be honest about limitations. Let’s talk about that too.
PART 1: PROJECT SETUP
Initialize Your Project
# Create project
mkdir rag-system
cd rag-system
# Initialize Node.js
npm init -y
# Install dependencies
npm install express dotenv @anthropic-ai/sdk pg pg-promise cors body-parser
npm install --save-dev nodemon typescript @types/node
# Optional but recommended for production
npm install winston helmet compression
Project Structure
rag-system/
├── src/
│ ├── config/
│ │ ├── database.ts # PostgreSQL + pgvector setup
│ │ └── embedding.ts # Claude embeddings
│ ├── services/
│ │ ├── retrieval.ts # Vector search logic
│ │ ├── reranking.ts # Two-stage ranking
│ │ ├── queryProcessing.ts # Query expansion
│ │ └── safety.ts # Hallucination prevention
│ ├── routes/
│ │ └── rag.ts # API endpoints
│ ├── utils/
│ │ ├── logger.ts # Logging (critical for debugging)
│ │ └── metrics.ts # Track recall, precision
│ └── index.ts # Main server
├── .env # Secrets
├── package.json
└── tsconfig.json
TypeScript Configuration
{
"compilerOptions": {
"target": "ES2020",
"module": "commonjs",
"lib": ["ES2020"],
"outDir": "./dist",
"rootDir": "./src",
"strict": true,
"esModuleInterop": true,
"skipLibCheck": true,
"forceConsistentCasingInFileNames": true
}
}
PART 2: DATABASE SETUP (PostgreSQL + pgvector)
1. Create PostgreSQL Database with pgvector
👨🦳 Uncle: This is your foundation. Get it wrong, everything breaks.
-- Connect to PostgreSQL
psql -U postgres
-- Create database
CREATE DATABASE rag_system;
-- Connect to the database
c rag_system
-- Install pgvector extension
CREATE EXTENSION IF NOT EXISTS vector;
-- Create resumes table
CREATE TABLE resumes (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
tenant_id UUID NOT NULL,
candidate_name VARCHAR(255) NOT NULL,
raw_text TEXT NOT NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
-- CRITICAL: tenant isolation
CONSTRAINT tenant_isolation UNIQUE(tenant_id, id)
);
-- Create chunks table (where vectors live)
CREATE TABLE resume_chunks (
id SERIAL PRIMARY KEY,
resume_id UUID NOT NULL REFERENCES resumes(id) ON DELETE CASCADE,
tenant_id UUID NOT NULL,
chunk_text TEXT NOT NULL,
chunk_index INTEGER NOT NULL,
-- The vector: 1536 dimensions for Claude embeddings
embedding vector(1536) NOT NULL,
-- Metadata for debugging
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
-- CRITICAL: Always check tenant
CONSTRAINT tenant_isolation_chunks
FOREIGN KEY (tenant_id) REFERENCES tenants(id)
);
-- Create indexes
-- 1. Vector index for fast search (MOST IMPORTANT)
CREATE INDEX idx_resume_chunks_embedding
ON resume_chunks USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);
-- 2. Tenant index (security)
CREATE INDEX idx_resume_chunks_tenant
ON resume_chunks(tenant_id, resume_id);
-- 3. Text search index (keyword search)
CREATE INDEX idx_resume_chunks_text
ON resume_chunks USING GIN (to_tsvector('english', chunk_text));
-- Create tenants table (multi-tenancy)
CREATE TABLE tenants (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
name VARCHAR(255) NOT NULL,
api_key VARCHAR(255) UNIQUE NOT NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
-- Create query logs (for metrics)
CREATE TABLE query_logs (
id SERIAL PRIMARY KEY,
tenant_id UUID NOT NULL REFERENCES tenants(id),
query TEXT NOT NULL,
latency_ms INTEGER NOT NULL,
recall DECIMAL(3,2),
precision DECIMAL(3,2),
cost_cents INTEGER,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
-- Create index on query logs for analytics
CREATE INDEX idx_query_logs_tenant
ON query_logs(tenant_id, created_at DESC);
2. Database Connection Service
👨🦳 Uncle: This is where your first failure point lives.
// src/config/database.ts
import pgPromise from 'pg-promise';
import dotenv from 'dotenv';
import logger from '../utils/logger';
dotenv.config();
const initOptions = {
// Detailed error info (critical for debugging)
error(error: any, context: any) {
logger.error('Database Error', {
error: error.message,
query: context.query,
params: context.params
});
},
// Connection events
connect(client: any) {
logger.info('Database connected');
},
disconnect(client: any) {
logger.info('Database disconnected');
}
};
const pgp = pgPromise(initOptions);
const db = pgp({
host: process.env.DB_HOST || 'localhost',
port: parseInt(process.env.DB_PORT || '5432'),
database: process.env.DB_NAME || 'rag_system',
user: process.env.DB_USER || 'postgres',
password: process.env.DB_PASSWORD,
// Connection pooling
max: 20,
// Timeout after 5 seconds
connectionTimeoutMillis: 5000,
// Idle timeout
idleTimeoutMillis: 30000,
});
// Test connection on startup
export async function initializeDatabase() {
try {
await db.one('SELECT 1');
logger.info('✓ Database connection verified');
} catch (error) {
logger.error('✗ Database connection failed', { error });
process.exit(1);
}
}
export default db;
3. Environment Variables
# .env
DB_HOST=localhost
DB_PORT=5432
DB_NAME=rag_system
DB_USER=postgres
DB_PASSWORD=your_secure_password
ANTHROPIC_API_KEY=sk-ant-...
ANTHROPIC_MODEL=claude-3-5-sonnet-20241022
NODE_ENV=production
PORT=3000
# Security
ADMIN_API_KEY=super-secret-key-change-this
PART 3: EMBEDDING SERVICE
Create Embeddings with Claude
👨🦳 Uncle: This is where the first real cost happens. Know what can fail here.
// src/config/embedding.ts
import Anthropic from '@anthropic-ai/sdk';
import logger from '../utils/logger';
const client = new Anthropic({
apiKey: process.env.ANTHROPIC_API_KEY
});
interface EmbeddingResult {
text: string;
embedding: number[];
}
/**
* Get embeddings for text chunks.
*
* ⚠️ FAILURE POINTS:
* 1. API rate limit (429) - implements exponential backoff
* 2. Token too long (4096 tokens max) - chunks pre-validated
* 3. Network timeout - retry logic built in
* 4. Cost tracking - logs cost per embedding
*/
export async function getEmbeddings(texts: string[]): Promise<EmbeddingResult[]> {
const startTime = Date.now();
try {
// VALIDATION: Prevent token overrun
// Claude's text embeddings: ~1 token = 4 chars average
const validTexts = texts.map(text => {
if (text.length > 16000) { // ~4000 tokens
logger.warn('Text truncated for embedding', {
originalLength: text.length,
truncatedTo: 16000
});
return text.substring(0, 16000);
}
return text;
});
// Call Claude API for embeddings
const response = await client.messages.create({
model: 'claude-3-5-sonnet-20241022',
max_tokens: 1024,
messages: [{
role: 'user',
content: `Generate embeddings for the following texts. Return ONLY valid JSON array with "embeddings" key containing array of number arrays.
Texts:
${validTexts.map((t, i) => `${i}: ${t}`).join('nn')}
Return format: {"embeddings": [[...], [...], ...]}`
}]
});
// Parse response
const responseText = response.content[0].type === 'text'
? response.content[0].text
: '';
let embeddings: number[][];
try {
const parsed = JSON.parse(responseText);
embeddings = parsed.embeddings || [];
} catch (parseError) {
logger.error('Failed to parse embeddings response', {
response: responseText.substring(0, 500)
});
throw new Error('Invalid embeddings response format');
}
// Validate embeddings
if (embeddings.length !== validTexts.length) {
throw new Error(
`Embedding count mismatch: got ${embeddings.length}, expected ${validTexts.length}`
);
}
// Calculate cost (Claude 3.5 Sonnet: $0.003 per 1M input tokens)
const inputTokens = response.usage.input_tokens;
const costCents = (inputTokens / 1_000_000) * 0.003 * 100;
const latency = Date.now() - startTime;
logger.info('Embeddings generated', {
count: embeddings.length,
latency,
inputTokens,
costCents: costCents.toFixed(4)
});
return validTexts.map((text, i) => ({
text,
embedding: embeddings[i]
}));
} catch (error: any) {
logger.error('Embedding API error', {
error: error.message,
status: error.status
});
// Retry logic for rate limits
if (error.status === 429) {
logger.warn('Rate limited. Waiting before retry...');
await new Promise(resolve => setTimeout(resolve, 5000));
return getEmbeddings(texts); // Exponential backoff in real system
}
throw error;
}
}
/**
* Embed a single text (convenience function)
*/
export async function embedText(text: string): Promise<number[]> {
const results = await getEmbeddings([text]);
return results[0].embedding;
}
Chunking Function
👨🦳 Uncle: Remember: 1000-1500 tokens, 200-token overlap.
// src/utils/chunking.ts
import logger from './logger';
interface Chunk {
text: string;
index: number;
tokens: number;
}
/**
* Break text into chunks with sliding window overlap.
*
* ⚠️ FAILURE POINTS:
* 1. Overlap larger than chunk size
* 2. Single chunk can't hold meaningful text
*/
export function chunkText(
text: string,
windowTokens: number = 1000,
overlapTokens: number = 200
): Chunk[] {
// Simple tokenization (1 token ≈ 4 chars for English)
const estimatedTokens = Math.ceil(text.length / 4);
if (estimatedTokens < windowTokens) {
// Text is smaller than chunk size
logger.debug('Text smaller than chunk window', {
estimatedTokens,
windowTokens
});
return [{
text,
index: 0,
tokens: estimatedTokens
}];
}
// Calculate character window (1 token ≈ 4 chars)
const charWindow = windowTokens * 4;
const charOverlap = overlapTokens * 4;
const step = charWindow - charOverlap;
const chunks: Chunk[] = [];
let index = 0;
for (let i = 0; i < text.length; i += step) {
let end = i + charWindow;
// Find sentence boundary to avoid splitting mid-sentence
if (end < text.length) {
const periodIndex = text.lastIndexOf('.', end);
const newlineIndex = text.lastIndexOf('n', end);
const boundaryIndex = Math.max(periodIndex, newlineIndex);
if (boundaryIndex > i + (charWindow * 0.8)) {
// Found good boundary
end = boundaryIndex + 1;
}
} else {
end = text.length;
}
const chunk = text.substring(i, end).trim();
if (chunk.length > 0) {
chunks.push({
text: chunk,
index,
tokens: Math.ceil(chunk.length / 4)
});
index++;
}
// Stop if we've reached the end
if (end >= text.length) break;
}
logger.debug('Text chunked', {
originalLength: text.length,
chunkCount: chunks.length,
avgChunkTokens: Math.round(
chunks.reduce((sum, c) => sum + c.tokens, 0) / chunks.length
)
});
return chunks;
}
PART 4: RETRIEVAL SERVICE
Vector Search with Hybrid (Vector + Keyword)
👨🦳 Uncle: This is the heart. Where everything lives or dies.
// src/services/retrieval.ts
import db from '../config/database';
import { embedText } from '../config/embedding';
import logger from '../utils/logger';
interface RetrievalResult {
chunkText: string;
chunkIndex: number;
vectorDistance: number;
keywordScore: number;
combinedScore: number;
}
/**
* Retrieve relevant chunks using hybrid search.
*
* ⚠️ FAILURE POINTS:
* 1. Missing tenant_id check → DATA BREACH
* 2. Vector index not built → Slow queries (10s+ instead of 100ms)
* 3. Query too long → API error
* 4. No results → Need to handle gracefully
* 5. Typos in query → Keyword search might fail
*/
export async function hybridSearch(
tenantId: string,
resumeId: string,
query: string,
topK: number = 5
): Promise<RetrievalResult[]> {
const startTime = Date.now();
try {
// Validate inputs
if (!tenantId || !resumeId) {
throw new Error('tenant_id and resume_id are required');
}
if (query.length === 0) {
throw new Error('Query cannot be empty');
}
if (query.length > 500) {
logger.warn('Query truncated', { originalLength: query.length });
query = query.substring(0, 500);
}
// Step 1: Get query embedding
logger.debug('Embedding query', { query });
const queryEmbedding = await embedText(query);
// Step 2: Vector search (fast)
// Convert embedding to PostgreSQL format: [0.1, 0.2, ...]
const embeddingString = `[${queryEmbedding.join(',')}]`;
const vectorResults = await db.manyOrNone(`
SELECT
chunk_text,
chunk_index,
embedding <=> $1::vector AS vector_distance
FROM resume_chunks
WHERE
tenant_id = $2
AND resume_id = $3
ORDER BY vector_distance ASC
LIMIT $4
`, [embeddingString, tenantId, resumeId, topK * 2]); // Get 2x to filter
if (vectorResults.length === 0) {
logger.warn('No vector results found', { query, resumeId });
return [];
}
// Step 3: Keyword filter (precision)
// Only keep chunks that also match the query
const keywordResults = await db.manyOrNone(`
SELECT
chunk_text,
chunk_index,
ts_rank(
to_tsvector('english', chunk_text),
plainto_tsquery('english', $1)
) AS keyword_score
FROM resume_chunks
WHERE
tenant_id = $2
AND resume_id = $3
AND to_tsvector('english', chunk_text) @@
plainto_tsquery('english', $1)
ORDER BY keyword_score DESC
LIMIT $4
`, [query, tenantId, resumeId, topK]);
// Step 4: Combine results
// Chunks that appear in both vector AND keyword search are best
const combined = vectorResults
.map(vr => {
const kr = keywordResults.find(k => k.chunk_text === vr.chunk_text);
return {
...vr,
keywordScore: kr ? kr.keyword_score : 0,
// Weighted score: 60% vector, 40% keyword
combinedScore: (1 - vr.vector_distance) * 0.6 + (kr?.keyword_score || 0) * 0.4
};
})
.sort((a, b) => b.combinedScore - a.combinedScore)
.slice(0, topK);
const latency = Date.now() - startTime;
logger.info('Hybrid search complete', {
query,
resultsCount: combined.length,
latency,
vectorResultsCount: vectorResults.length,
keywordResultsCount: keywordResults.length
});
// Log for metrics
if (combined.length > 0) {
await db.none(`
INSERT INTO query_logs (tenant_id, query, latency_ms)
VALUES ($1, $2, $3)
`, [tenantId, query.substring(0, 255), latency]);
}
return combined as RetrievalResult[];
} catch (error: any) {
logger.error('Retrieval error', {
error: error.message,
query,
resumeId,
tenantId
});
throw error;
}
}
/**
* Multi-query retrieval - search with multiple variations.
*
* Better recall, but slower and more expensive.
*/
export async function multiQueryRetrieval(
tenantId: string,
resumeId: string,
queries: string[],
topK: number = 5
): Promise<RetrievalResult[]> {
try {
const allResults: RetrievalResult[] = [];
for (const query of queries) {
const results = await hybridSearch(tenantId, resumeId, query, topK * 2);
allResults.push(...results);
}
// Deduplicate by chunk text, keep highest score
const unique = Array.from(
allResults
.reduce((map, item) => {
const existing = map.get(item.chunkText);
if (!existing || item.combinedScore > existing.combinedScore) {
map.set(item.chunkText, item);
}
return map;
}, new Map<string, RetrievalResult>())
.values()
);
return unique
.sort((a, b) => b.combinedScore - a.combinedScore)
.slice(0, topK);
} catch (error) {
logger.error('Multi-query retrieval error', { error });
throw error;
}
}
PART 5: RERANKING SERVICE
👨🦳 Uncle: Two-stage is where quality happens. First stage is fast, second is accurate.
// src/services/reranking.ts
import Anthropic from '@anthropic-ai/sdk';
import logger from '../utils/logger';
interface RerankedResult {
text: string;
score: number;
rank: number;
}
/**
* Rerank chunks using Claude (more accurate but slower).
*
* ⚠️ FAILURE POINTS:
* 1. Claude API timeout (fix with timeout wrapper)
* 2. Chunks too long (truncate before sending)
* 3. Response parsing fails
* 4. Cost explosion (reranking costs money - track it)
*/
export async function rerank(
query: string,
chunks: string[],
topK: number = 5
): Promise<RerankedResult[]> {
const startTime = Date.now();
try {
if (chunks.length === 0) {
return [];
}
const client = new Anthropic({
apiKey: process.env.ANTHROPIC_API_KEY,
timeout: 30 * 1000, // 30 second timeout
});
// Truncate chunks to prevent token overflow
const truncatedChunks = chunks.map(c =>
c.length > 2000 ? c.substring(0, 2000) + '...' : c
);
// Build reranking prompt
const chunksFormatted = truncatedChunks
.map((chunk, i) => `[${i}] ${chunk}`)
.join('nn---nn');
const prompt = `You are a search relevance expert. Rank the following chunks by relevance to the query.
Query: "${query}"
Chunks to rank:
${chunksFormatted}
Return ONLY valid JSON with this format:
{
"rankings": [
{"index": 0, "relevance_score": 0.95},
{"index": 1, "relevance_score": 0.72}
]
}
Relevance score: 0.0 (irrelevant) to 1.0 (highly relevant)
Sort by relevance_score descending.`;
const response = await client.messages.create({
model: 'claude-3-5-sonnet-20241022',
max_tokens: 1024,
messages: [{
role: 'user',
content: prompt
}]
});
// Parse response
const responseText = response.content[0].type === 'text'
? response.content[0].text
: '';
let rankings: any[];
try {
// Extract JSON from response (might be wrapped in markdown)
const jsonMatch = responseText.match(/{[sS]*}/);
const jsonStr = jsonMatch ? jsonMatch[0] : responseText;
const parsed = JSON.parse(jsonStr);
rankings = parsed.rankings || [];
} catch (parseError) {
logger.error('Failed to parse reranking response', {
response: responseText.substring(0, 500)
});
// Fallback: return original order
return chunks.slice(0, topK).map((text, i) => ({
text,
score: 1.0 - (i * 0.1),
rank: i + 1
}));
}
// Convert to results
const results = rankings
.filter(r => r.index >= 0 && r.index < chunks.length)
.map((r, rank) => ({
text: chunks[r.index],
score: r.relevance_score,
rank: rank + 1
}))
.slice(0, topK);
const latency = Date.now() - startTime;
logger.info('Reranking complete', {
query,
inputCount: chunks.length,
outputCount: results.length,
latency,
topScore: results[0]?.score
});
return results;
} catch (error: any) {
logger.error('Reranking error', {
error: error.message,
chunksCount: chunks.length,
query: query.substring(0, 100)
});
// Fallback: return original order
return chunks.slice(0, topK).map((text, i) => ({
text,
score: 1.0 - (i * 0.1),
rank: i + 1
}));
}
}
PART 6: QUERY PROCESSING
👨🦳 Uncle: Expand the query so you find more relevant chunks.
// src/services/queryProcessing.ts
import Anthropic from '@anthropic-ai/sdk';
import logger from '../utils/logger';
/**
* Expand a query into related search terms.
*
* ⚠️ FAILURE POINTS:
* 1. LLM generates irrelevant expansions
* 2. Original query lost in expansion
* 3. Too many expansions → slow retrieval
*/
export async function expandQuery(originalQuery: string): Promise<string[]> {
try {
const client = new Anthropic({
apiKey: process.env.ANTHROPIC_API_KEY
});
const response = await client.messages.create({
model: 'claude-3-5-sonnet-20241022',
max_tokens: 200,
messages: [{
role: 'user',
content: `Given this query about a job candidate, generate 2-3 alternative phrasings or related concepts that would help find relevant information.
Original query: "${originalQuery}"
Return ONLY a JSON array of strings:
["alternative1", "alternative2", "alternative3"]
These should help find the same information using different keywords.`
}]
});
const responseText = response.content[0].type === 'text'
? response.content[0].text
: '';
let expansions: string[];
try {
expansions = JSON.parse(responseText);
} catch (e) {
logger.warn('Failed to parse query expansion', { response: responseText });
return [originalQuery];
}
// Always include original query
const allQueries = [originalQuery, ...expansions].filter(Boolean);
logger.debug('Query expanded', {
original: originalQuery,
expansions: allQueries.length
});
return allQueries;
} catch (error) {
logger.error('Query expansion error', { error });
return [originalQuery]; // Fallback
}
}
/**
* Normalize query (remove typos, standardize terms).
*/
export function normalizeQuery(query: string): string {
return query
.toLowerCase()
.trim()
// Remove extra spaces
.replace(/s+/g, ' ')
// Remove special characters (keep alphanumeric and spaces)
.replace(/[^ws]/g, '');
}
PART 7: SAFETY & HALLUCINATION PREVENTION
👨🦳 Uncle: This is where you prevent the AI from lying. Critical.
// src/services/safety.ts
import Anthropic from '@anthropic-ai/sdk';
import logger from '../utils/logger';
interface SafeAnswer {
answer: string;
confidence: number;
evidence: string[];
isSafe: boolean;
reason?: string;
}
/**
* Get answer from AI with multiple safety layers.
*
* ⚠️ FAILURE POINTS:
* 1. AI answer not in JSON format
* 2. Confidence calculation wrong
* 3. Evidence doesn't exist in chunks
* 4. Excessive cost for failed attempts
*/
export async function safeAnswer(
query: string,
chunks: string[],
confidenceThreshold: number = 0.7
): Promise<SafeAnswer> {
const startTime = Date.now();
try {
if (chunks.length === 0) {
return {
answer: 'No relevant information found.',
confidence: 0,
evidence: [],
isSafe: false,
reason: 'No source chunks provided'
};
}
const client = new Anthropic({
apiKey: process.env.ANTHROPIC_API_KEY
});
// Layer 1: Retrieval boundaries
// Show ONLY the chunks, nothing from training
const chunksText = chunks
.map((c, i) => `[Chunk ${i}]n${c}`)
.join('nn---nn');
const prompt = `You are evaluating a candidate resume based on specific chunks.
INSTRUCTIONS:
1. Answer ONLY based on the provided chunks
2. Do NOT use any knowledge from training data
3. If information is not in chunks, say "Unknown"
4. Always cite which chunk supports your answer
5. Return valid JSON ONLY - no other text
Query: "${query}"
Chunks provided:
${chunksText}
Return JSON in this exact format:
{
"answer": "your answer here",
"confidence": 0.0 to 1.0,
"evidence_chunks": [0, 1, 2],
"explanation": "why you're confident"
}`;
// Layer 2: Structured output
const response = await client.messages.create({
model: 'claude-3-5-sonnet-20241022',
max_tokens: 500,
messages: [{
role: 'user',
content: prompt
}]
});
const responseText = response.content[0].type === 'text'
? response.content[0].text
: '';
// Parse response
let parsed: any;
try {
const jsonMatch = responseText.match(/{[sS]*}/);
const jsonStr = jsonMatch ? jsonMatch[0] : responseText;
parsed = JSON.parse(jsonStr);
} catch (e) {
logger.error('Failed to parse safety response', {
response: responseText.substring(0, 300)
});
return {
answer: 'Error processing answer',
confidence: 0,
evidence: [],
isSafe: false,
reason: 'Invalid response format'
};
}
// Layer 3: Validation
// Check evidence chunks actually exist
const validEvidenceIndices = (parsed.evidence_chunks || [])
.filter((i: number) => i >= 0 && i < chunks.length);
if (validEvidenceIndices.length === 0 && parsed.answer !== 'Unknown') {
logger.warn('No valid evidence for answer', {
answer: parsed.answer,
requestedIndices: parsed.evidence_chunks,
chunksCount: chunks.length
});
}
const evidence = validEvidenceIndices.map((i: number) => chunks[i]);
// Layer 4: Confidence gating
const isSafe = parsed.confidence >= confidenceThreshold;
if (!isSafe) {
logger.warn('Low confidence answer', {
answer: parsed.answer,
confidence: parsed.confidence,
threshold: confidenceThreshold
});
}
const latency = Date.now() - startTime;
logger.info('Safe answer generated', {
query: query.substring(0, 50),
confidence: parsed.confidence,
isSafe,
latency,
evidenceCount: evidence.length
});
return {
answer: parsed.answer,
confidence: parsed.confidence,
evidence,
isSafe,
reason: isSafe ? 'Confident' : 'Low confidence'
};
} catch (error: any) {
logger.error('Safety check error', { error: error.message });
return {
answer: 'Error',
confidence: 0,
evidence: [],
isSafe: false,
reason: error.message
};
}
}
/**
* Validate that answer is faithful to evidence.
* Post-check: does answer match the chunks?
*/
export async function validateFaithfulness(
answer: string,
evidence: string[],
threshold: number = 0.8
): Promise<{ isFaithful: boolean; score: number }> {
try {
// Simple check: are key terms from answer in evidence?
const answerTerms = answer.toLowerCase().split(/s+/);
const evidenceText = evidence.join(' ').toLowerCase();
const matchedTerms = answerTerms.filter(term =>
term.length > 3 && evidenceText.includes(term)
);
const score = answerTerms.length > 0
? matchedTerms.length / answerTerms.length
: 0;
return {
isFaithful: score >= threshold,
score
};
} catch (error) {
logger.error('Faithfulness validation error', { error });
return { isFaithful: false, score: 0 };
}
}
PART 8: API ROUTES & ENDPOINTS
👨🦳 Uncle: This is what the client calls. Make it robust.
// src/routes/rag.ts
import express, { Router, Request, Response } from 'express';
import db from '../config/database';
import { hybridSearch, multiQueryRetrieval } from '../services/retrieval';
import { rerank } from '../services/reranking';
import { expandQuery } from '../services/queryProcessing';
import { safeAnswer, validateFaithfulness } from '../services/safety';
import { chunkText } from '../utils/chunking';
import { getEmbeddings } from '../config/embedding';
import logger from '../utils/logger';
const router = Router();
// Middleware: Check authentication
function authMiddleware(req: Request, res: Response, next: Function) {
const apiKey = req.headers['x-api-key'] as string;
if (!apiKey) {
return res.status(401).json({ error: 'Missing API key' });
}
// In production, validate against database
if (apiKey !== process.env.ADMIN_API_KEY) {
return res.status(401).json({ error: 'Invalid API key' });
}
next();
}
router.use(authMiddleware);
/**
* Upload and process a resume.
* POST /rag/upload
*/
router.post('/upload', async (req: Request, res: Response) => {
try {
const { tenantId, candidateName, resumeText } = req.body;
if (!tenantId || !candidateName || !resumeText) {
return res.status(400).json({
error: 'Missing required fields: tenantId, candidateName, resumeText'
});
}
// Step 1: Save resume
const resumeResult = await db.one(`
INSERT INTO resumes (tenant_id, candidate_name, raw_text)
VALUES ($1, $2, $3)
RETURNING id
`, [tenantId, candidateName, resumeText]);
const resumeId = resumeResult.id;
// Step 2: Chunk the resume
const chunks = chunkText(resumeText, 1000, 200);
logger.info('Resume chunked', { resumeId, chunkCount: chunks.length });
// Step 3: Get embeddings for all chunks
const chunkTexts = chunks.map(c => c.text);
const embeddingResults = await getEmbeddings(chunkTexts);
// Step 4: Save chunks with embeddings
for (let i = 0; i < chunks.length; i++) {
const chunk = chunks[i];
const embedding = embeddingResults[i].embedding;
const embeddingArray = `[${embedding.join(',')}]`;
await db.none(`
INSERT INTO resume_chunks
(resume_id, tenant_id, chunk_text, chunk_index, embedding)
VALUES ($1, $2, $3, $4, $5::vector)
`, [resumeId, tenantId, chunk.text, chunk.index, embeddingArray]);
}
logger.info('Resume uploaded successfully', { resumeId, chunkCount: chunks.length });
res.json({
success: true,
resumeId,
chunkCount: chunks.length,
message: `Resume for ${candidateName} processed successfully`
});
} catch (error: any) {
logger.error('Upload error', { error: error.message });
res.status(500).json({ error: error.message });
}
});
/**
* Query a resume.
* POST /rag/query
*/
router.post('/query', async (req: Request, res: Response) => {
try {
const { tenantId, resumeId, question, useExpansion = false } = req.body;
if (!tenantId || !resumeId || !question) {
return res.status(400).json({
error: 'Missing required fields: tenantId, resumeId, question'
});
}
const startTime = Date.now();
// Step 1: Expand query if requested
let queries = [question];
if (useExpansion) {
queries = await expandQuery(question);
logger.debug('Query expanded', { count: queries.length });
}
// Step 2: Retrieve chunks (multi-query if expanded)
const retrieved = useExpansion
? await multiQueryRetrieval(tenantId, resumeId, queries, 10)
: await hybridSearch(tenantId, resumeId, question, 10);
if (retrieved.length === 0) {
return res.json({
answer: 'No relevant information found in resume.',
confidence: 0,
evidence: [],
isSafe: false,
latency: Date.now() - startTime
});
}
// Step 3: Rerank for accuracy
const chunks = retrieved.map(r => r.chunkText);
const reranked = await rerank(question, chunks, 5);
const topChunks = reranked.map(r => r.text);
// Step 4: Get safe answer with evidence
const safeAns = await safeAnswer(question, topChunks, 0.7);
// Step 5: Validate faithfulness (optional)
const faithfulness = await validateFaithfulness(safeAns.answer, safeAns.evidence);
const latency = Date.now() - startTime;
res.json({
success: true,
answer: safeAns.answer,
confidence: safeAns.confidence,
evidence: safeAns.evidence,
isSafe: safeAns.isSafe,
faithfulness: faithfulness.score,
latency,
chunksRetrieved: retrieved.length,
chunksReranked: reranked.length
});
} catch (error: any) {
logger.error('Query error', { error: error.message });
res.status(500).json({ error: error.message });
}
});
/**
* Get metrics for a tenant.
* GET /rag/metrics/:tenantId
*/
router.get('/metrics/:tenantId', async (req: Request, res: Response) => {
try {
const { tenantId } = req.params;
// Query logs aggregation
const metrics = await db.one(`
SELECT
COUNT(*) as query_count,
AVG(latency_ms) as avg_latency,
MAX(latency_ms) as max_latency,
MIN(latency_ms) as min_latency,
AVG(recall) as avg_recall,
AVG(precision) as avg_precision,
SUM(cost_cents) / 100.0 as total_cost_dollars
FROM query_logs
WHERE tenant_id = $1
`, [tenantId]);
res.json({
success: true,
metrics
});
} catch (error: any) {
logger.error('Metrics error', { error: error.message });
res.status(500).json({ error: error.message });
}
});
export default router;
PART 9: MAIN SERVER
// src/index.ts
import express from 'express';
import cors from 'cors';
import compression from 'compression';
import helmet from 'helmet';
import dotenv from 'dotenv';
import ragRoutes from './routes/rag';
import { initializeDatabase } from './config/database';
import logger from './utils/logger';
dotenv.config();
const app = express();
const PORT = process.env.PORT || 3000;
// Middleware
app.use(helmet()); // Security headers
app.use(compression()); // Compress responses
app.use(cors());
app.use(express.json());
// Health check
app.get('/health', (req, res) => {
res.json({ status: 'ok', timestamp: new Date().toISOString() });
});
// RAG routes
app.use('/rag', ragRoutes);
// Error handler
app.use((err: any, req: express.Request, res: express.Response, next: express.NextFunction) => {
logger.error('Unhandled error', { error: err.message });
res.status(500).json({ error: 'Internal server error' });
});
// Start server
async function start() {
try {
// Initialize database
await initializeDatabase();
app.listen(PORT, () => {
logger.info(`Server running on port ${PORT}`);
});
} catch (error) {
logger.error('Failed to start server', { error });
process.exit(1);
}
}
start();
Logging Service
// src/utils/logger.ts
import winston from 'winston';
const logger = winston.createLogger({
level: process.env.LOG_LEVEL || 'info',
format: winston.format.combine(
winston.format.timestamp(),
winston.format.json()
),
transports: [
// Console in development
new winston.transports.Console({
format: winston.format.combine(
winston.format.colorize(),
winston.format.printf(({ timestamp, level, message, ...meta }) => {
return `${timestamp} [${level}] ${message} ${
Object.keys(meta).length ? JSON.stringify(meta, null, 2) : ''
}`;
})
)
}),
// File for production
new winston.transports.File({
filename: 'logs/error.log',
level: 'error'
}),
new winston.transports.File({
filename: 'logs/combined.log'
})
]
});
export default logger;
PART 10: FAILURE POINTS & SOLUTIONS
Summary Table
| Failure Point | Symptom | Root Cause | Solution |
|---|---|---|---|
| Missing tenant_id | Data leak between companies | No isolation check | Add WHERE tenant_id = X to EVERY query |
| Vector index missing | Queries take 10+ seconds | Sequential scan of 500K vectors | Create IVFFLAT index on embedding column |
| Query too long | API error 4096 tokens exceeded | Question >16000 chars | Truncate queries to 500 chars |
| No results | Empty array returned | Chunks don’t exist or embeddings wrong | Check if chunks were saved, verify vector distance threshold |
| Hallucination | AI invents information | No retrieval boundaries in prompt | Use safety layers (5 layers as described) |
| Rate limit (429) | API call fails | Too many requests to Claude | Implement exponential backoff, queue requests |
| Database connection lost | “Cannot connect to server” | Network issue, DB down, wrong credentials | Add retry logic, connection pooling, health checks |
| Embedding dimension mismatch | “Vector dimension 1536 != 768” | Using different embedding model | Ensure consistent model (claude-3-5-sonnet) |
| Memory overload | Node.js crashes | Trying to embed entire 100MB file | Chunk before embedding, process in batches |
| Cost explosion | Unexpected $10k bill | Each embedding/rerank/answer costs money | Track costs, log them, set spending limits |
PART 11: PROS AND CONS OF RAG
PROS ✓
1. Accuracy
- ✓ AI answers from evidence, not training data
- ✓ No hallucinations (with proper safety layers)
- ✓ Can trace every answer to source
Without RAG: "Does John know Docker?" → Guess → "maybe, looks like it"
With RAG: "Does John know Docker?" → Evidence → "Yes. His resume says: 'Docker, Kubernetes, 4 years'"
2. Up-to-Date
- ✓ Works with information from last month
- ✓ No retraining needed
- ✓ Can add 1000 new resumes per day
3. Explainability
- ✓ Users see why the system answered
- ✓ Can audit decisions
- ✓ Legally defensible
4. Cost
- ✓ Cheaper than retraining models
- ✓ Only pay for what you use
- ✓ PostgreSQL is free (open source)
5. Flexibility
- ✓ Can use with ANY language
- ✓ Works with specialized domains
- ✓ Easy to update documents
CONS ✗
1. Complexity
- ✗ More moving parts: embeddings, database, retrieval, reranking, safety
- ✗ More things to break
- ✗ More things to debug
Single LLM: Simple
RAG: Embeddings → Vector DB → Retrieval → Reranking → Safety checks
Each stage can fail independently
2. Retrieval Quality
- ✗ If wrong chunks retrieved, answer is wrong
- ✗ Embeddings are fuzzy (not perfect matches)
- ✗ Rare topics might not be in training data
Example problem:
Resume: "Worked with distributed ledger technology"
Search: "blockchain"
Result: Miss (DLT ≠ blockchain in embeddings)
3. Cost of Operations
- ✗ Each embedding: $0.00001
- ✗ Each rerank: $0.0001
- ✗ Each LLM answer: $0.001
- ✗ Adds up fast with scale
100K queries/month = $100/month just for operations
(Not including infrastructure, salaries, etc)
4. Latency
- ✗ Vector search: 100ms
- ✗ Reranking: 300-500ms
- ✗ Safety check: 200ms
- ✗ Total: 600-800ms
User expects <200ms. RAG adds delay.
5. Data Quality Matters
- ✗ If source documents are bad, answers are bad
- ✗ Garbage in = garbage out
- ✗ No amount of engineering fixes bad data
Resume full of typos: "React" → "Rreact" → Embeddings confused
→ System can't find React skills
→ Answer is wrong
6. Scaling is Hard
- ✗ Vector index maintenance complex
- ✗ Multi-tenancy requires isolation at every level
- ✗ Database queries become slow with 10M+ chunks
When to Use RAG
| Scenario | Use RAG? | Why |
|---|---|---|
| Customer support | YES | Up-to-date, explainable, no hallucinations |
| Medical diagnosis | YES | Safety-critical, needs evidence |
| Resume screening | YES | Domain-specific, needs accuracy |
| General chatbot | NO | Training data sufficient, latency matters |
| Quick facts | NO | Simple lookup is faster |
| Creative writing | NO | Hallucinations are features, not bugs |
| Code search | MAYBE | Depends on code freshness |
| Legal documents | YES | Must cite sources, no mistakes |
When NOT to Use RAG
1. Simple lookup → Use database
2. Conversational → Use base LLM
3. Speed critical → Too slow (600ms+)
4. Data quality poor → Garbage in/out
5. Training data sufficient → No value-add
6. Cost-sensitive → Each query costs money
PART 12: COST ANALYSIS
Break Down of Costs
Per-Query Costs (approximate):
1. Embedding query:
- 50 tokens @ $0.000003/token = $0.00015
2. Vector search + keyword filter:
- Database operation ≈ $0 (hosted: ~$0.00001)
3. Reranking (Claude):
- 500 tokens input + 100 output @ $0.003/$0.015 = $0.0018
4. Final answer:
- 500 tokens input + 200 output @ $0.003/$0.015 = $0.004
Total per query: ≈ $0.0075 (~0.75 cents)
At scale:
- 1K queries/month = $7.50
- 100K queries/month = $750
- 1M queries/month = $7,500
How to Reduce Costs
-
Cache results
- Same question asked twice? Use Redis cache
- Saves: 80% of cost
-
Batch processing
- Don’t embed one by one
- Embed 100 at a time
- Saves: 20% (batch discounts)
-
Smart reranking
- Only rerank if needed
- Don’t rerank for obvious matches
- Saves: 30-50%
-
Cheaper models
- Use Claude Haiku for simple tasks
- Claude Opus only for complex reasoning
- Saves: 80%
PART 13: DEPLOYMENT & PRODUCTION
Docker Deployment
# Dockerfile
FROM node:18-alpine
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
COPY dist ./dist
EXPOSE 3000
CMD ["node", "dist/index.js"]
Docker Compose (Local Dev)
# docker-compose.yml
version: '3.8'
services:
postgres:
image: postgres:15-alpine
environment:
POSTGRES_PASSWORD: postgres
POSTGRES_DB: rag_system
volumes:
- postgres_data:/var/lib/postgresql/data
ports:
- "5432:5432"
pgvector:
build:
context: .
dockerfile: Dockerfile.pgvector
environment:
POSTGRES_PASSWORD: postgres
depends_on:
- postgres
app:
build: .
environment:
DB_HOST: postgres
DB_USER: postgres
DB_PASSWORD: postgres
DB_NAME: rag_system
ANTHROPIC_API_KEY: ${ANTHROPIC_API_KEY}
ports:
- "3000:3000"
depends_on:
- postgres
volumes:
postgres_data:
Deployment Checklist
- [ ] Environment variables set correctly
- [ ] Database indexes created
- [ ] API key securely stored (use AWS Secrets Manager)
- [ ] Logging configured (CloudWatch or similar)
- [ ] Monitoring set up (CPU, memory, latency)
- [ ] Rate limiting enabled
- [ ] Database backups enabled
- [ ] Error alerts configured
- [ ] Cost monitoring enabled
- [ ] Security scan passed
- [ ] Load test passed (1000 req/sec)
- [ ] Graceful shutdown implemented
PART 14: EXAMPLE QUERIES
Complete End-to-End Example
# 1. Upload a resume
curl -X POST http://localhost:3000/rag/upload
-H "Content-Type: application/json"
-H "X-API-Key: super-secret-key"
-d '{
"tenantId": "company-1",
"candidateName": "John Doe",
"resumeText": "John has 5 years React experience, built e-commerce platforms with Node.js..."
}'
# Response:
# {
# "success": true,
# "resumeId": "uuid-123",
# "chunkCount": 8
# }
# 2. Query the resume
curl -X POST http://localhost:3000/rag/query
-H "Content-Type: application/json"
-H "X-API-Key: super-secret-key"
-d '{
"tenantId": "company-1",
"resumeId": "uuid-123",
"question": "Does John have React experience?",
"useExpansion": true
}'
# Response:
# {
# "success": true,
# "answer": "Yes, John has 5 years of React experience.",
# "confidence": 0.95,
# "evidence": ["John has 5 years React experience, built e-commerce..."],
# "isSafe": true,
# "faithfulness": 0.92,
# "latency": 720,
# "chunksRetrieved": 10,
# "chunksReranked": 5
# }
# 3. Get metrics
curl -X GET http://localhost:3000/rag/metrics/company-1
-H "X-API-Key: super-secret-key"
# Response:
# {
# "success": true,
# "metrics": {
# "query_count": 42,
# "avg_latency": 680,
# "max_latency": 1200,
# "avg_recall": 0.91,
# "avg_precision": 0.88,
# "total_cost_dollars": 0.32
# }
# }
FINAL CHECKLIST: BEFORE YOU SHIP
Functionality
- [ ] Embeddings working (call Claude API successfully)
- [ ] PostgreSQL storing vectors correctly
- [ ] Hybrid search returning results
- [ ] Reranking improving quality
- [ ] Safety layers preventing hallucinations
- [ ] Multi-tenancy isolated (tenant1 can’t see tenant2)
Performance
- [ ] Vector search <200ms
- [ ] Full query <1000ms
- [ ] No N+1 queries
- [ ] Database queries indexed
- [ ] Caching working
Security
- [ ] API keys not logged
- [ ] Tenant isolation verified (manual test)
- [ ] SQL injection prevented (parameterized queries)
- [ ] Rate limiting enabled
- [ ] HTTPS enforced
Operations
- [ ] Logging configured
- [ ] Error alerts set up
- [ ] Cost tracking enabled
- [ ] Database backups tested
- [ ] Graceful error handling
Monitoring
- [ ] Query latency tracked
- [ ] Recall/precision measured
- [ ] Cost per query logged
- [ ] Error rate tracked
- [ ] Uptime monitored
Quick Reference: Common Issues & Fixes
Issue: Queries Take 10+ Seconds
Cause: Missing vector index
Fix:
CREATE INDEX ON resume_chunks
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);
Issue: Same Query Returns Different Results
Cause: Stale embeddings or non-deterministic reranking
Fix: Always embed with same model, use fixed random seed for reranking
Issue: AI Answers with Wrong Information
Cause: Insufficient safety layers
Fix: Add confidence thresholding, require citations, validate faithfulness
Issue: “Tenant1 sees Tenant2 data”
Cause: Missing tenant_id check in WHERE clause
Fix: Add WHERE tenant_id = $X to EVERY query
Issue: API calls get rate limited (429)
Cause: Too many rapid requests to Claude
Fix: Implement exponential backoff, queue requests, batch operations
Issue: Cost explodes unexpectedly
Cause: No cost tracking, inefficient queries, excessive reranking
Fix: Log cost per operation, implement budgets, use cheaper models for easy tasks
The Bottom Line
👨🦳 Uncle’s Final Word:
RAG is powerful but complex. Every layer serves a purpose:
- Embeddings = understand meaning
- Vector DB = store and search fast
- Retrieval = find relevant information
- Reranking = improve quality
- Safety = prevent lies
- Operations = track costs and metrics
You don’t need all of this on day one. Start simple:
Day 1: PostgreSQL + embeddings + basic search
Week 1: Add reranking
Month 1: Add safety layers
Month 3: Add monitoring and optimization
Each layer buys you something. Know what you’re buying.
👦 Nephew: When should I NOT use RAG?
👨🦳 Uncle: When:
- Data is stable and doesn’t change
- You need sub-200ms latency
- You’re paying $10/month and can’t afford $100+
- You’re building a simple chatbot
- Your source data has low quality
Otherwise? RAG is the way.
Now go build. Start simple. Measure everything. Ship fast.
Good luck. —SurajK
