June 19, 2026

How to build a RAG application: a practical guide from a production build

Artem Panasiuk

Chief of Delivery at Brocoders

14 min

A buyer is comparing compressor specs at midnight. They type a question into a chat box on the site, and a few seconds later they get a precise answer with a link to the exact spec sheet it came from.

The support team never sees the question, the buyer gets the answer at the moment they need it, and they stay on the site instead of leaving for a search engine. That assistant is a Retrieval-Augmented Generation (RAG) application, and we built it. We indexed 4,090 documents for Compressor World and shipped it to production. Wiring a model to a vector store and getting a passable demo takes a weekend. Getting an assistant that holds up in production is a different job, and most of that job lives in three places: retrieval quality, the grounding constraint, and the pipeline that keeps your index fresh.

This guide walks through how to build a RAG application end to end, and shows the decisions we actually made on a real build.

TL;DR: A RAG application retrieves relevant chunks from your own documents and feeds them to a language model so answers stay grounded in your data. The pipeline has seven stages: ingest, chunk, embed, store, retrieve, generate with citations, and refresh. The demo is easy. Production quality depends on retrieval, the rule that the model answers only from retrieved context, and an ingestion loop that keeps the index current. We proved this building AskAC.ai on 4,090 documents.

Table of Contents

What a RAG application actually is
RAG architecture: the seven stages
How to build a RAG application step by step
Proof: how we built AskAC.ai on 4,090 documents
Going beyond Q&A: hybrid search and an action layer
Where RAG builds break in production
When this architecture is worth building
Your RAG build checklist
Frequently asked questions

What a RAG application actually is

A RAG application answers questions using your documents instead of the model's general training. It retrieves the most relevant passages from a knowledge base you control, then asks a language model to write an answer from those passages.

This matters because a general-purpose chatbot pulls from the open web and its training data. For industrial equipment, medical devices, or regulated products, a wrong number has real cost. RAG grounds the output in verifiable source material, which reduces hallucination and improves factual accuracy (IBM). The model still cannot be made error-proof, but constraining what it draws from changes the risk profile entirely.

The business case is straightforward. Your proprietary data is the part of your stack a competitor cannot copy, and RAG is how you give a model access to it without retraining anything.

RAG architecture: the seven stages

Every production RAG application moves through the same seven stages. Understanding them is the difference between a demo that impresses and a system that survives real traffic.

  1. Document ingestion: parse source files (PDF, DOCX, CSV, spec sheets) from wherever they already live.
  2. Chunking: split each document into focused passages a model can reason over.
  3. Embedding: convert each chunk into a vector with an embedding model.
  4. Storage: store those vectors in an index built for semantic retrieval.
  5. Retrieval: embed the user's query and pull the most relevant chunks.
  6. Grounded generation: the model writes an answer using only the retrieved chunks.
  7. Citations and refresh: link every answer to its source, and re-index on a schedule so the knowledge base stays current.

The first four stages run ahead of time and build your index. The last three run on every query. Keeping those two pipelines separate is one of the choices that makes a build production-ready rather than a prototype.

How to build a RAG application step by step

Here is how to build a RAG system in practice, with the decisions that actually move the needle at each step.

Step 1: Prepare your documents. Collect the source material and parse it into clean text. We use LlamaIndex to handle PDFs, CSVs, and Excel files, since those make up the bulk of most product documentation. Image-heavy PDFs like scanned manuals need an extra OCR layer, which adds processing time but works fine.

Step 2: Chunk for retrieval, not for storage. Split documents into passages small enough that a single chunk carries one coherent idea. Start with page-level chunking for accuracy and adjust from there. Chunks that are too large dilute the match; chunks that are too small lose context.

Step 3: Generate embeddings. Run each chunk through an embedding model to produce a vector. Every time you change your embedding model you re-embed the whole corpus, which costs money, so pick deliberately.

Step 4: Store the vectors. Insert the embedded chunks into a vector index. A query for "how do I size a compressor for sandblasting" then matches the right spec sections by meaning rather than exact wording.

Step 5: Retrieve on each query. Embed the incoming question and pull the top relevant chunks. For precise domains, combine semantic vector search with keyword search so exact part numbers and model names match cleanly.

Step 6: Generate a grounded answer. Send the retrieved chunks plus the question to the model and instruct it to answer only from that context. If the answer is not in the retrieved material, the assistant says so rather than inventing one.

Step 7: Cite and refresh. Return a link to the source document with every answer, and schedule the index to rebuild so new and corrected documents flow in automatically.

Proof: how we built AskAC.ai on 4,090 documents

Compressor World has sold industrial compressors for over 20 years. Their catalog covers rotary screw and reciprocating compressors, dryers, filters, and accessories from brands like Quincy, Atlas Copco, and Ingersoll Rand. Their support and sales team spent the day answering documented questions: which compressor for a 200 CFM application, where the manual lives, how to wire 208V to a unit. Answers existed, buried inside thousands of PDFs and spec sheets.

We built AskAC.ai, a source-grounded assistant on a RAG architecture, and launched it to production with 4,090 documents indexed for semantic retrieval.

The stack we chose for reliability and the demands of a document-heavy application:

  • NestJS for the API, the document indexing pipeline, and admin endpoints
  • Next.js for the public chat page, the embeddable widget, and the admin interface
  • PostgreSQL for user records, query logs, and per-account request limits
  • AWS S3 for PDF and manual storage
  • LlamaIndex for parsing, chunking, and semantic retrieval
  • OpenAI GPT-4o for generating answers from retrieved context

Every answer carries a citation back to the specific source document, so a mechanic can verify a wiring diagram and the team can audit accuracy over time. The indexing pipeline runs on a schedule: when Compressor World updates a spec sheet in Google Drive, the assistant picks up the new content on the next cycle with no manual re-indexing. For a product recall or a corrected spec, the pipeline can be triggered immediately.

We also built lead capture in from the start. After answering a technical query, AskAC.ai surfaces a contextual follow-up like "Want a quote for this unit?" at the moment a buyer has confirmed a part fits. Initial Q&A testing against real user queries showed high answer accuracy, and post-launch feedback on tone and length is informing prompt work for the next phase.

A small team carried it: a PM/BA for coordination, backend and frontend developers, and an AI engineer on index architecture, semantic search, and prompt tuning. You can read the full build in our write-up on the AI assistant for technical documentation.

Going beyond Q&A: hybrid search and an action layer

Basic RAG answers questions. The advanced version retrieves better and then does something with the answer. We productized that pattern in Bridge, our platform for deploying grounded AI agents.

Two capabilities separate an advanced build from a basic one:

  • Hybrid search: Bridge combines vector search for semantic understanding with keyword search for precise matching, so a query lands on the right document by meaning and by exact term. Grounded generation keeps every response tied to the provided dataset, with links to the source documents used.
  • An action layer: Bridge is built on the Model Context Protocol (MCP), which lets an agent read real-time data and execute changes. An agent can check a CRM for lead status, then book a meeting or update an order, with human approval required on critical actions. Our write-up on MCP AI agents for SaaS goes deeper on this pattern.

Bridge stays vendor-agnostic, so you can switch the underlying model between OpenAI, Anthropic, Google, or open-source Llama based on cost and performance (OpenAI alternatives). For automated support built on internal documentation, the platform targets a 40 to 60% reduction in ticket volume.

Where RAG builds break in production

Most RAG demos look great and then disappoint in production. The failure points are predictable.

Retrieval is the usual culprit. If the system pulls the wrong chunks, the model writes a confident answer from bad context, and citations make that error look trustworthy. Chunking strategy and hybrid retrieval are where you fix this, and they deserve real attention early.

A stale index is the second. A pipeline that indexes once and never updates slowly drifts away from the truth as documents change. Schedule the refresh and let the system maintain itself.

Skipping the grounding constraint is the third. If the model is allowed to fall back on general knowledge when retrieval comes up empty, you have rebuilt a generic chatbot with extra steps. The assistant should say it does not know.

Security is the fourth, and it is easy to defer. You are pointing a model at proprietary documents, so access control, data isolation, and encryption belong in the design, not in a later sprint. Our notes on SaaS security cover the baseline.

When this architecture is worth building

A source-grounded RAG application earns its cost when several conditions hold at once:

  • Your support load is documentation queries. If most tickets are "where is X" or "what is the spec for Y," you are paying people to be a search engine.
  • You already have documentation. Spec sheets, manuals, FAQs, and compatibility guides are the raw material. Thin or scattered docs mean content work comes first.
  • Accuracy carries consequences. Regulated or technical domains are where the citation layer pays for itself.
  • You lose visitors to competitor search. An on-site assistant keeps buyers on your page and captures intent before they leave.

When four or five of these apply, the build is worth scoping. When fewer than two apply, better site search will cover most of the gap at lower cost.

Your RAG build checklist

Before you write a line of code, run through this:

  • Inventory your documents and confirm they are clean enough to parse
  • Decide your chunking strategy and start page-level
  • Choose an embedding model deliberately, since switching means re-embedding everything
  • Plan hybrid retrieval from the start if precise matching matters
  • Write the grounding rule explicitly: answer only from retrieved context, otherwise say so
  • Add a citation layer so every answer is verifiable and auditable
  • Schedule index refresh, with a manual trigger for urgent updates
  • Design access control, data isolation, and encryption into the architecture
  • Ship an MVP first, then layer in billing, integrations, and advanced features

If you are scoping a build like this, our AI development and integration services team can help you size the work and the stack. We have shipped this pattern to production and can tell you where the effort really goes.

Frequently asked questions

What is a RAG application?

A RAG application retrieves relevant passages from a defined set of documents and uses a language model to generate an answer from them, with a citation to the source. The model answers from your data rather than from its general training, which keeps responses grounded and verifiable.

How do you build a RAG application step by step?

You ingest and parse your documents, chunk them into focused passages, embed each chunk into a vector, store the vectors in a searchable index, retrieve the most relevant chunks for each query, generate an answer constrained to those chunks, and schedule re-indexing so the knowledge base stays current.

How is RAG different from fine-tuning a model?

RAG retrieves from your live document set at query time, so updating the assistant means updating documents. Fine-tuning bakes patterns into the model weights and requires retraining to change. For knowledge that changes often, RAG is cheaper to keep current and easier to audit.

How do you stop a RAG application from hallucinating?

Constrain generation to the retrieved chunks, instruct the model to say it does not know when the answer is not present, and attach a citation to every response. RAG reduces hallucination by anchoring answers to trusted sources, though it does not eliminate the risk entirely.

How much does it cost and how long does it take to build?

Cost and timeline depend on corpus size, documentation quality, and the number of integrations. Embedding and vector storage are ongoing costs that scale with your data volume. We structure builds in phases, and a working MVP with a real document corpus is typically achievable within a few months.

What do you need before building one?

Existing documentation worth indexing, a domain where accuracy matters, and a support or search problem that is mostly documentation queries. With those in place, the raw material for a RAG build is ready.

4.98
Thank you for reading! Leave us your feedback!
6500 ratings

Read more on our blog