What is RAG and Why Does It Matter
Standard GPT-4o doesn't know about your company's internal docs, your product catalog, or your customer data. RAG fixes this: you store your content as vector embeddings in a database, and at query time, you find the most relevant chunks and send them as context to the LLM.
The result: an AI assistant that answers questions about your specific data, accurately and with citations.
RAG is now the standard architecture for enterprise AI assistants, internal knowledge bases, and customer support bots. The alternative, fine-tuning a model on your data, is 10× more expensive, slower to update, and less reliable for factual retrieval. For 95% of business use cases, RAG with a good retrieval system outperforms fine-tuning.
Step 1: Enable pgvector in Supabase
Supabase comes with pgvector built in. In the Supabase SQL editor, run: CREATE EXTENSION IF NOT EXISTS vector;
Then create your documents table: CREATE TABLE documents ( id bigint primary key generated always as identity, content text, embedding vector(1536), metadata jsonb, created_at timestamptz default now() );
Create an index for fast similarity search: CREATE INDEX ON documents USING ivfflat (embedding vector_cosine_ops);
This is the only SQL you need to write.
Step 2: Generate Embeddings with Xano
In Xano, create an endpoint that accepts a text string, calls the OpenAI Embeddings API (text-embedding-3-small), and stores the result in your Supabase documents table.
The function stack: (1) Get text input, (2) Call OpenAI API POST /v1/embeddings, (3) Extract the embedding array from the response, (4) Insert into Supabase via the Supabase API connector.
Run this for every document, article, or FAQ entry you want your AI to know about.
Step 3: Semantic Search in Xano
Create a second endpoint that: (1) Accepts a user query string, (2) Generates an embedding for the query (same OpenAI call), (3) Runs a vector similarity search in Supabase.
The Supabase RPC function for this: SELECT content, 1 - (embedding <=> query_embedding) as similarity FROM documents ORDER BY embedding <=> query_embedding LIMIT 5;
This returns the 5 most semantically relevant chunks to the user's question.
Step 4: Generate the Answer with GPT-4o
With the retrieved chunks, call OpenAI's Chat Completions API. The system prompt:
"You are a helpful assistant. Answer the user's question using ONLY the context provided below. If the answer is not in the context, say so.
Context: [INSERT RETRIEVED CHUNKS HERE]"
This grounds the LLM's response in your actual data, preventing hallucinations.
Step 5: Build the Chat UI in WeWeb
In WeWeb, create a chat interface with a message list and input field. On send: call your Xano search endpoint, then stream the GPT response using Xano's streaming support or a direct OpenAI call from WeWeb's custom code.
For production, add: message history (store in Supabase), source citations (show which documents were retrieved), and a feedback mechanism (thumbs up/down to improve retrieval quality).
Choosing the Right Embedding Model
OpenAI offers three embedding models in 2025: text-embedding-3-small (1536 dimensions, $0.02/1M tokens), text-embedding-3-large (3072 dimensions, $0.13/1M tokens), and the legacy text-embedding-ada-002. For most RAG applications, text-embedding-3-small is the right choice: it balances cost, speed, and retrieval quality for standard document search.
text-embedding-3-large is worth considering when retrieval precision is critical, legal contract analysis, medical literature search, or applications where a missed relevant passage has meaningful consequences. In our projects, we benchmarked 3-small vs 3-large on a 10,000-document knowledge base and found 3-small achieved 94% of 3-large's retrieval accuracy at 15% of the cost.
For multilingual applications (common in our European client base), both OpenAI models handle French, German, Swedish, and Dutch reasonably well. However, for specialized industry terminology in non-English languages, consider testing against multilingual-specific models like Cohere's multilingual-22-12 via the Cohere API as a benchmark.
Chunking Strategies: The Hidden Variable in RAG Quality
How you split your documents into chunks is arguably the most important variable in RAG quality, more impactful than model choice for most applications. Too large and your retrieved chunks contain irrelevant content that confuses the LLM. Too small and you lose the context the LLM needs to answer coherently.
In our production RAG projects, we use a hybrid chunking strategy: split on paragraph boundaries first (natural semantic units), then enforce a maximum chunk size of 512 tokens and a minimum of 100 tokens. Chunks that are too small are merged with their neighbor. This preserves semantic coherence while preventing oversized chunks. For structured documents (FAQ pages, product specs), we chunk at the Q&A or section level rather than by token count.
One often-overlooked technique: add overlap between chunks. If chunk N ends at sentence 10 and chunk N+1 starts at sentence 11, context that bridges the boundary is lost. We add a 50-token overlap, sentences 9–10 from chunk N are repeated at the start of chunk N+1. This 10% overhead significantly improves retrieval for queries that reference concepts spanning chunk boundaries.
Vector Similarity Thresholds and Retrieval Tuning
Not all retrieved chunks are equally relevant. The cosine similarity score returned by pgvector ranges from 0 (no similarity) to 1 (identical). In practice, relevant chunks typically score above 0.75, and you should filter out anything below 0.60 to avoid feeding irrelevant context to the LLM.
In your Supabase RPC function, add a similarity threshold: WHERE 1 - (embedding <=> query_embedding) > 0.70 ORDER BY embedding <=> query_embedding LIMIT 5. Start with 0.70 and tune based on your data. If your assistant says "I don't have information about that" for queries you know it should answer, lower the threshold. If it's giving answers from weakly related chunks, raise it.
For advanced retrieval, implement a hybrid search combining vector similarity with keyword search (PostgreSQL's built-in full-text search). This hybrid approach, often called "reciprocal rank fusion", improves precision for queries with specific proper nouns, product names, or technical terms that embeddings sometimes handle poorly. Supabase supports both vector and full-text search natively, making hybrid retrieval achievable without custom infrastructure.
Cost Optimization for Production RAG
The main cost drivers in a RAG system are: embedding generation (when you add new documents), embedding queries (each user message), and LLM inference (each GPT-4o call). For a typical internal knowledge base with 5,000 documents and 500 daily queries, the monthly cost breakdown is approximately: embedding ingestion (one-time) $0.20, daily query embeddings $0.30/month, GPT-4o answers at 3,000 tokens average context $45/month. Total: roughly $50/month for 500 daily users.
To reduce LLM costs, implement a caching layer for common queries. Store the top 100 questions and their answers in Supabase. Before calling GPT-4o, check if a semantically similar question has been answered recently (cosine similarity > 0.95 to a cached query embedding). Cache hits serve the stored answer instantly, no LLM call needed. In our production RAG apps, 30–40% of queries are cache hits, reducing LLM costs by a third.
For very high-volume applications, consider switching the answer generation from GPT-4o to GPT-4o-mini for low-complexity queries. Implement a routing layer that classifies queries as simple (factual lookup, single-source answer) vs complex (multi-source synthesis, analysis), and routes to the cheaper model for simple queries. This hybrid LLM approach reduces inference costs by 60–70% on typical knowledge base workloads.
Production Considerations: Reliability, Monitoring, and Data Freshness
Production RAG systems have operational requirements that prototype tutorials ignore. Document freshness is critical: if your knowledge base is out of date, your AI gives wrong answers with confident-sounding language. Implement an automated re-ingestion pipeline, when a document is updated in your CMS or database, trigger a Xano background task to re-embed the affected chunks and update the vector index.
Monitoring in production means tracking: query latency (embedding + retrieval + LLM combined, should be under 3 seconds for good UX), retrieval quality (log the similarity scores of retrieved chunks, a drop in average score signals your document set may not cover recent queries), and user satisfaction (thumbs up/down feedback stored in Supabase, reviewed weekly). In our projects, weekly review of low-rated answers is the single most effective quality improvement activity.
For data isolation in multi-tenant RAG apps, add a workspace_id column to your documents table and add it to every RLS policy and retrieval query. Each tenant's knowledge base is completely isolated, a user can only retrieve and receive answers from their own documents. This is non-negotiable for any app handling confidential business data, and Supabase's RLS makes it straightforward to implement correctly.