← Back to Blog

Building a RAG Pipeline from Scratch

LLMs Are Confident Liars

Ask an LLM to cite a source and it will invent a plausible-sounding paper that has never existed. Ask it about your company's internal docs and it will hallucinate policies with the unwavering confidence of a toddler explaining how the moon works.

The root problem: an LLM's knowledge is frozen at training time. It has no memory of your data, no way to verify its claims, and no mechanism to distinguish "I know this" from "I'm pattern-matching and hoping for the best." It's a brilliant improviser with no fact-checker.

Retrieval-Augmented Generation (RAG) is the fix. Instead of hoping the model memorized the answer somewhere in its billions of parameters, we give it the answer — retrieved from your actual documents — and ask it to synthesize a response grounded in real evidence.

In this post, we'll build a complete RAG pipeline from scratch in Python. No LangChain, no LlamaIndex — just raw code so you understand every piece. By the end, you'll have a working system that chunks documents, embeds them, retrieves relevant passages, and generates grounded answers.

The Pipeline at 10,000 Feet

Every RAG system follows the same five-step pattern. The first three happen at ingestion time (when you load your documents). The last two happen at query time (when someone asks a question).

Documents Chunk Embed Store ⎯⎯⎯ ingestion ↑ · query ↓ ⎯⎯⎯ Query Retrieve Augment Generate

If you've read our earlier posts, some of this will feel familiar. In the embeddings post, we saw how training on word co-occurrence produces vectors where similar meanings cluster together. In the vector search benchmarks, we showed that brute-force NumPy handles 100K vectors in milliseconds. RAG ties these ideas together into a practical system.

Let's build each piece.

Step 1: Chunking Documents

You can't embed an entire 50-page document as one vector — the embedding would dilute every topic into a single blurry point. Instead, we split documents into chunks: passages small enough to carry specific meaning, but large enough to preserve context.

The industry default is recursive character splitting. It tries to split on paragraph breaks first (\n\n), falls back to line breaks (\n), then sentences (. ), then words. This preserves natural document structure where possible.

Here's a from-scratch implementation:

def chunk_text(text, max_chars=1000, overlap=200):
    """Split text into overlapping chunks using recursive separators."""
    separators = ["\n\n", "\n", ". ", " "]
    chunks = []

    def split_recursive(text, sep_idx=0):
        # Base case: text fits in one chunk
        if len(text) <= max_chars:
            if text.strip():
                chunks.append(text.strip())
            return

        # Try current separator
        sep = separators[sep_idx]
        parts = text.split(sep)

        current_chunk = ""
        for part in parts:
            candidate = current_chunk + sep + part if current_chunk else part
            if len(candidate) > max_chars and current_chunk:
                chunks.append(current_chunk.strip())
                # Start next chunk with overlap from end of current
                overlap_text = current_chunk[-overlap:] if overlap else ""
                current_chunk = overlap_text + sep + part
            else:
                current_chunk = candidate

        if current_chunk.strip():
            # If this chunk is still too big, try finer separator
            if len(current_chunk) > max_chars and sep_idx + 1 < len(separators):
                split_recursive(current_chunk, sep_idx + 1)
            else:
                chunks.append(current_chunk.strip())

    split_recursive(text)
    return chunks

Two parameters matter most:

Let's test it on a real passage:

sample = """Solar panels convert sunlight into electricity through the
photovoltaic effect. When photons hit silicon cells, they knock electrons
loose, creating an electrical current.

Installation requires careful roof assessment. South-facing roofs with
15-40 degree pitch are ideal in the northern hemisphere. Shading from
trees or neighboring buildings can reduce output by 10-25%.

A typical residential system is 6-10 kW, requiring 15-25 panels. At
average US electricity rates, payback period is 6-10 years. Federal tax
credits currently cover 30% of installation costs."""

chunks = chunk_text(sample, max_chars=300, overlap=60)
for i, chunk in enumerate(chunks):
    print(f"--- Chunk {i} ({len(chunk)} chars) ---")
    print(chunk[:80] + "..." if len(chunk) > 80 else chunk)
    print()
--- Chunk 0 (175 chars) ---
Solar panels convert sunlight into electricity through the
photovoltaic effect. ...

--- Chunk 1 (248 chars) ---
reduce output by 10-25%. Installation requires careful roof assessment. South-f...

--- Chunk 2 (268 chars) ---
A typical residential system is 6-10 kW, requiring 15-25 panels. At
average US ...

Notice how Chunk 1 starts with the tail end of the previous section — that's the overlap doing its job. If someone asks about shading and installation requirements, both Chunk 0 and Chunk 1 contain relevant context.

Step 2: Embedding Chunks

Next, we convert each text chunk into a dense vector that captures its meaning. As we explored in the embeddings post, words (and passages) that appear in similar contexts end up as nearby points in vector space. A question about "solar panel installation" will land close to chunks about solar panel installation — even if they use different words like "PV system setup."

For a local, free embedding model, all-MiniLM-L6-v2 from Sentence Transformers is the workhorse. It produces 384-dimensional vectors and runs on CPU in milliseconds:

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("all-MiniLM-L6-v2")

# Embed all chunks at once — returns shape (N, 384)
embeddings = model.encode(chunks, normalize_embeddings=True)

print(f"Embedded {len(chunks)} chunks → {embeddings.shape}")
# Embedded 3 chunks → (3, 384)

The normalize_embeddings=True flag is important — it L2-normalizes each vector so cosine similarity reduces to a simple dot product. This makes retrieval a single matrix multiplication.

How does this compare to other models?

Model Dims Speed Quality Cost
all-MiniLM-L6-v2 384 Very fast Good Free (local)
nomic-embed-text 768 Fast Very good Free (local)
text-embedding-3-small 1536 API call Very good $0.02/1M tokens
text-embedding-3-large 3072 API call Excellent $0.13/1M tokens

For prototyping, all-MiniLM-L6-v2 is perfect. For production with large document collections, consider OpenAI's v3 models — they support Matryoshka embeddings, meaning you can truncate a 3072-dim vector to 256 dims and still outperform older models at full dimensionality.

Step 3: Store and Retrieve

With chunks embedded, we need to find which chunks are most relevant to a given question. As the vector search benchmarks showed, brute-force NumPy handles small-to-medium datasets beautifully — no FAISS, no database, just matrix math:

def retrieve(query, chunks, embeddings, model, k=3):
    """Find the k most relevant chunks for a query."""
    # Embed the query with the same model
    query_vec = model.encode([query], normalize_embeddings=True)

    # Dot product against all chunk embeddings (cosine sim for normalized vecs)
    scores = (embeddings @ query_vec.T).squeeze()

    # Get top-k indices
    top_k = np.argsort(-scores)[:k]

    return [(chunks[i], float(scores[i])) for i in top_k]

That's it. Three lines of math. Since we normalized our embeddings, the dot product is cosine similarity — no extra computation needed.

Let's test retrieval with a real question:

results = retrieve(
    "How long until solar panels pay for themselves?",
    chunks, embeddings, model, k=2
)

for chunk, score in results:
    print(f"Score: {score:.3f}")
    print(chunk[:100] + "...")
    print()
Score: 0.612
A typical residential system is 6-10 kW, requiring 15-25 panels. At
average US electricity rates, pa...

Score: 0.384
Solar panels convert sunlight into electricity through the
photovoltaic effect. When photons hit sili...

The retriever correctly surfaces the chunk about payback period and costs, even though the query says "pay for themselves" while the document says "payback period." Embeddings handle this semantic equivalence automatically.

When to graduate from NumPy: If your chunk count exceeds ~50K, or you need metadata filtering, consider FAISS or ChromaDB. For most RAG prototypes and even many production systems, brute-force search is more than enough. See the vector search benchmarks for exact numbers.

Step 4: Augmented Generation

This is where the "G" in RAG earns its keep. We take the retrieved chunks, stuff them into a prompt, and ask the LLM to answer only from the provided context.

The prompt template is the most important piece of engineering in the entire pipeline:

def build_rag_prompt(question, retrieved_chunks):
    """Assemble retrieved context into a grounded prompt."""
    context_block = "\n\n".join(
        f"[Document {i+1}]\n{chunk}"
        for i, (chunk, score) in enumerate(retrieved_chunks)
    )

    return f"""You are a helpful assistant that answers questions using
ONLY the provided context documents. Follow these rules strictly:

1. Answer based ONLY on the context below. Do not use prior knowledge.
2. If the context doesn't contain enough information, say:
   "I don't have enough information to answer this question."
3. Cite which document(s) you used, e.g. [Document 1].
4. Keep answers concise and direct.

Context:
{context_block}

Question: {question}

Answer:"""

Two design choices here are critical:

The "Lost in the Middle" Problem

Research by Liu et al. (2023) found that LLMs exhibit a U-shaped attention pattern: they focus heavily on information at the beginning and end of the context window, but underweight text in the middle. This means if you retrieve 10 chunks, the model may effectively ignore chunks 4–7.

The fix is simple: retrieve fewer, better chunks. Three to five highly relevant passages beat fifteen sort-of-relevant ones. If you must include more, put the most relevant chunks first and last.

The Full Pipeline: Putting It Together

Let's assemble everything into a single class you can copy, paste, and run:

from sentence_transformers import SentenceTransformer
import numpy as np
import anthropic  # pip install anthropic

class RAGPipeline:
    def __init__(self, embed_model="all-MiniLM-L6-v2"):
        self.model = SentenceTransformer(embed_model)
        self.chunks = []
        self.embeddings = None
        self.client = anthropic.Anthropic()  # uses ANTHROPIC_API_KEY env var

    def ingest(self, text, max_chars=1000, overlap=200):
        """Chunk and embed a document."""
        self.chunks = chunk_text(text, max_chars, overlap)
        self.embeddings = self.model.encode(
            self.chunks, normalize_embeddings=True
        )
        print(f"Ingested {len(self.chunks)} chunks")

    def query(self, question, k=3):
        """Retrieve relevant chunks and generate a grounded answer."""
        # Retrieve
        query_vec = self.model.encode(
            [question], normalize_embeddings=True
        )
        scores = (self.embeddings @ query_vec.T).squeeze()
        top_k = np.argsort(-scores)[:k]
        retrieved = [(self.chunks[i], float(scores[i])) for i in top_k]

        # Build prompt
        prompt = build_rag_prompt(question, retrieved)

        # Generate
        response = self.client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=512,
            messages=[{"role": "user", "content": prompt}]
        )

        return {
            "answer": response.content[0].text,
            "sources": retrieved,
        }

That's about 40 lines for a working RAG system. Let's use it:

rag = RAGPipeline()
rag.ingest(solar_panel_guide)  # our sample document from earlier

result = rag.query("What tax incentives are available for solar panels?")
print(result["answer"])
Based on the provided context, federal tax credits currently cover 30%
of solar panel installation costs [Document 1]. The context does not
mention state-level incentives or other tax benefits beyond this
federal credit.

The model answers from context and explicitly flags what it doesn't know. Now let's ask something the documents can't answer:

result = rag.query("What's the best brand of solar panel?")
print(result["answer"])
I don't have enough information to answer this question. The provided
context discusses solar panel installation, system sizing, and costs,
but does not mention specific panel brands or manufacturers.

That refusal is the most important feature of a well-built RAG system. A vanilla LLM would have happily recommended SunPower or LG — RAG makes it honest.

When RAG Goes Wrong

RAG isn't magic. Here are the failure modes you'll hit and how to fix them:

Failure Mode Symptom Fix
Irrelevant retrieval Retrieved chunks don't match the question's intent Add a reranking step (cross-encoder), try hybrid search (BM25 + vectors)
Chunk boundary splits Answer requires info split across two chunks Increase overlap, try semantic chunking, or use parent-document retrieval
Hallucination despite context Model adds facts not in the retrieved text Stronger grounding prompts, lower temperature, add faithfulness evaluation
Context overflow Too many chunks dilute attention Retrieve fewer chunks (3–5), rerank, compress context before prompting
Domain mismatch Embedding model doesn't understand jargon Fine-tune embeddings on domain data, or use a domain-specific model

The most common beginner mistake is retrieving too many chunks. More context feels safer, but it costs 2–3x more tokens, adds latency, and paradoxically reduces answer quality due to the lost-in-the-middle effect. Start with k=3 and increase only if you're missing information.

Evaluation tip: The RAGAS framework provides automated metrics: faithfulness (is the answer grounded in context?), answer relevancy (does it address the question?), and context recall (did retrieval find everything needed?). Scores above 0.8 indicate a solid pipeline.

Try It: RAG Pipeline Explorer

Paste some text (or use the sample), then ask a question. Watch the pipeline chunk, retrieve, and answer step by step.

This demo uses word-overlap similarity as a browser-friendly stand-in for real embeddings. A production system would use a proper embedding model like all-MiniLM-L6-v2.

Step 1: Chunk ✂
Step 2: Retrieve 🔍
Step 3: Generate ✨

What's Next

You now have a working RAG pipeline in ~60 lines of Python. From here, there are several paths to make it better:

This post completes a trilogy. We taught words to become vectors. We benchmarked how to search those vectors. And now we've connected retrieval to generation — turning an LLM from a confident improviser into a grounded research assistant.

RAG isn't magic. It's plumbing. But good plumbing makes everything downstream work.

References & Further Reading