AI-Powered Video RAG Search

The Problem

We built a substantial library of how-to videos — hundreds of videos covering everything from electrical systems to water heater maintenance. Our customers watched these videos all the time. The problem: finding the right one felt impossible.

The UI was a dense grid of thumbnails with cryptic titles. No search. No indexing. Want to know how to reset the water heater? Scroll through 30 video cards and hope you guess correctly. Want to verify something about your electrical system? Same problem.

I built a retrieval-augmented generation (RAG) pipeline: ingest transcripts (and generate them if not provided), chunk them by semantic boundaries (with preserved timecodes), store them in a vector database, and return grounded answers with a direct link to the exact second in the source video where the answer lives.

Why Not Just Feed Everything to an LLM?

The obvious approach: use an LLM's context window, feed it all transcripts, ask a question, get a summary. This fails for three reasons:

Context limits. hundreds of thousands of words across many videos (and growing) doesn't fit comfortably in one context window, especially maintaining conversation history.

Hallucinations kill trust. A customer asking about their water heater needs reliable information, not a plausible-sounding answer that's half right. They need to verify by watching the source.

Authority matters. "An AI told me" doesn't carry the same weight as "here's the video our team made about this exact topic." Grounding answers in actual sources builds trust.

RAG solves this by retrieving the actual documentation, letting the model answer based on that, not on training data or confabulation.

How It Works

The system is built in three phases: ingestion, indexing, and retrieval + response.

Ingestion: Getting transcripts at scale

I pulled transcripts from two sources: YouTube (YouTube Transcript API) and Vimeo (captions API). For videos without captions, I used Deepgram's speech-to-text API. This meant every video in the library — regardless of source — could be indexed.

Transcripts are messy: timestamps, filler words, structural artifacts. I normalized them using the sbd (sentence boundary detection) library to split on actual sentence boundaries rather than whitespace or timecode markers. Cleaner baseline for what "a thought" is.

Chunking: Intelligence at the core

Raw transcripts are too long for vector search. A 30-minute video is thousands of words. Send that to an embedding model and you lose the specificity of individual topics.

I chunk by semantic boundaries, not arbitrary token counts. Using Gemini's embedding model, I identify logical breaks in the transcript while keeping related information together. A chunk might be "how to reset the water heater" (200 words) or "troubleshooting the fridge" (400 words). Each chunk is self-contained enough to answer a specific question without losing context.

Critically: I preserve the original timecode for every chunk. When search returns a result, the user gets a direct link to the exact second in the video where the answer starts. No scrolling through 30 minutes to find the relevant 2-minute segment.

Vector storage and search with diversity re-ranking

Chunks and embeddings live in PostgreSQL with pgvector. When a user asks a question:

Embed their query (same model: Gemini)
Search for similar vectors using cosine similarity
Apply maximum marginal relevance (MMR) re-ranking to avoid clustering around a single source

That third step matters more than it sounds. The naive approach returns the single most similar chunk, then the second-most (often from the same video), then the third. You get lopsided answers—three chunks from the water heater video, zero from elsewhere. MMR balances relevance with diversity, ensuring users get perspectives from multiple videos rather than myopic answers from one source.

The system also respects a minimum similarity threshold. If the highest-scoring chunk scores below ~0.6, the system admits it doesn't know rather than returning a weak match. Confidence thresholds maintain trust.

Response generation

Retrieval returns top N chunks (usually 3-5, depending on thresholds). These chunks get fed to OpenAI's API as context. The model is instructed to:

Answer only based on provided context
Cite which video(s) the answer comes from
Be honest if context doesn't fully answer the question

Responses stream to the user in real-time. With each message, we attach source metadata: video ID, title, timecode, deep link to exact moment.

Technical Decisions

Gemini for embedding, OpenAI for generation. Gemini's embedding model is solid and cost-effective. OpenAI's language model is more reliable for coherent, grounded responses. This meant integrating two API providers—minor operational complexity—but results justify it.

No fine-tuning. Both models run as-is. Fine-tuning requires labeled training data, which I didn't have. The system works well enough out-of-the-box that chasing 3-5% improvements didn't justify the effort or cost.

Serverless database. Using Supabase (serverless Postgres) instead of self-hosted. Elastic scaling, no infrastructure management. Trade-off: each query incurs cold-start latency, but acceptable for this use case.

Conversation history separate from retrieval. Every message gets stored in the database, but retrieval only looks at the current query, not full conversation history. This is deliberate—including full history in retrieval introduces noise and hallucinations. The model can reference earlier parts of the conversation through its context window.

Next Steps to Level Up

I'd like to experiment with more aggressive semantic chunking. Current approach works, but smaller, more focused chunks would improve specificity.

I'd also invest more in prompt engineering. The system's default instructions are relatively generic. Tailoring to LTV context—emphasizing safety, model-specific answers, etc.—would make responses more useful.

The MMR re-ranking works, but tuning is somewhat arbitrary. A/B testing different lambda values and threshold settings would dial in the optimal balance between relevance and diversity.

Why This Matters

This project is fundamentally about respecting user time and attention. Customers don't want to sift through a library. They want an answer. They want to verify it. They want to move on.

The AI part is the mechanism—the efficient way to find the right video in a haystack of 115. The real value is that a customer with a problem gets a grounded, verified answer in seconds and can jump straight to the source to see it in context.