Building a RAG Pipeline from Scratch
LLMs Are Confident Liars
Ask an LLM to cite a source and it will invent a plausible-sounding paper that has never existed. Ask it about your company's internal docs and it will hallucinate policies with the unwavering confidence of a toddler explaining how the moon works.
The root problem: an LLM's knowledge is frozen at training time. It has no memory of your data, no way to verify its claims, and no mechanism to distinguish "I know this" from "I'm pattern-matching and hoping for the best." It's a brilliant improviser with no fact-checker.
Retrieval-Augmented Generation (RAG) is the fix. Instead of hoping the model memorized the answer somewhere in its billions of parameters, we give it the answer — retrieved from your actual documents — and ask it to synthesize a response grounded in real evidence.
In this post, we'll build a complete RAG pipeline from scratch in Python. No LangChain, no LlamaIndex — just raw code so you understand every piece. By the end, you'll have a working system that chunks documents, embeds them, retrieves relevant passages, and generates grounded answers.
The Pipeline at 10,000 Feet
Every RAG system follows the same five-step pattern. The first three happen at ingestion time (when you load your documents). The last two happen at query time (when someone asks a question).
If you've read our earlier posts, some of this will feel familiar. In the embeddings post, we saw how training on word co-occurrence produces vectors where similar meanings cluster together. In the vector search benchmarks, we showed that brute-force NumPy handles 100K vectors in milliseconds. RAG ties these ideas together into a practical system.
Let's build each piece.
Step 1: Chunking Documents
You can't embed an entire 50-page document as one vector — the embedding would dilute every topic into a single blurry point. Instead, we split documents into chunks: passages small enough to carry specific meaning, but large enough to preserve context.
The industry default is recursive character splitting. It tries to split on paragraph breaks first (\n\n), falls back to line breaks (\n), then sentences (. ), then words. This preserves natural document structure where possible.
Here's a from-scratch implementation:
def chunk_text(text, max_chars=1000, overlap=200):
"""Split text into overlapping chunks using recursive separators."""
separators = ["\n\n", "\n", ". ", " "]
chunks = []
def split_recursive(text, sep_idx=0):
# Base case: text fits in one chunk
if len(text) <= max_chars:
if text.strip():
chunks.append(text.strip())
return
# Try current separator
sep = separators[sep_idx]
parts = text.split(sep)
current_chunk = ""
for part in parts:
candidate = current_chunk + sep + part if current_chunk else part
if len(candidate) > max_chars and current_chunk:
chunks.append(current_chunk.strip())
# Start next chunk with overlap from end of current
overlap_text = current_chunk[-overlap:] if overlap else ""
current_chunk = overlap_text + sep + part
else:
current_chunk = candidate
if current_chunk.strip():
# If this chunk is still too big, try finer separator
if len(current_chunk) > max_chars and sep_idx + 1 < len(separators):
split_recursive(current_chunk, sep_idx + 1)
else:
chunks.append(current_chunk.strip())
split_recursive(text)
return chunks
Two parameters matter most:
- Chunk size (400–512 tokens / ~1000 characters): Smaller chunks give more precise retrieval but less context per chunk. Start at 1000 characters and adjust.
- Overlap (10–20% of chunk size): If a critical sentence falls exactly at a boundary, overlap ensures both neighboring chunks contain it. Without overlap, you risk splitting "Do NOT use this near open flames" into two chunks — one saying "use this" and the other "near open flames."
Let's test it on a real passage:
sample = """Solar panels convert sunlight into electricity through the
photovoltaic effect. When photons hit silicon cells, they knock electrons
loose, creating an electrical current.
Installation requires careful roof assessment. South-facing roofs with
15-40 degree pitch are ideal in the northern hemisphere. Shading from
trees or neighboring buildings can reduce output by 10-25%.
A typical residential system is 6-10 kW, requiring 15-25 panels. At
average US electricity rates, payback period is 6-10 years. Federal tax
credits currently cover 30% of installation costs."""
chunks = chunk_text(sample, max_chars=300, overlap=60)
for i, chunk in enumerate(chunks):
print(f"--- Chunk {i} ({len(chunk)} chars) ---")
print(chunk[:80] + "..." if len(chunk) > 80 else chunk)
print()
--- Chunk 0 (175 chars) ---
Solar panels convert sunlight into electricity through the
photovoltaic effect. ...
--- Chunk 1 (248 chars) ---
reduce output by 10-25%. Installation requires careful roof assessment. South-f...
--- Chunk 2 (268 chars) ---
A typical residential system is 6-10 kW, requiring 15-25 panels. At
average US ...
Notice how Chunk 1 starts with the tail end of the previous section — that's the overlap doing its job. If someone asks about shading and installation requirements, both Chunk 0 and Chunk 1 contain relevant context.
Step 2: Embedding Chunks
Next, we convert each text chunk into a dense vector that captures its meaning. As we explored in the embeddings post, words (and passages) that appear in similar contexts end up as nearby points in vector space. A question about "solar panel installation" will land close to chunks about solar panel installation — even if they use different words like "PV system setup."
For a local, free embedding model, all-MiniLM-L6-v2 from Sentence Transformers is the workhorse. It produces 384-dimensional vectors and runs on CPU in milliseconds:
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer("all-MiniLM-L6-v2")
# Embed all chunks at once — returns shape (N, 384)
embeddings = model.encode(chunks, normalize_embeddings=True)
print(f"Embedded {len(chunks)} chunks → {embeddings.shape}")
# Embedded 3 chunks → (3, 384)
The normalize_embeddings=True flag is important — it L2-normalizes each vector so cosine similarity reduces to a simple dot product. This makes retrieval a single matrix multiplication.
How does this compare to other models?
| Model | Dims | Speed | Quality | Cost |
|---|---|---|---|---|
| all-MiniLM-L6-v2 | 384 | Very fast | Good | Free (local) |
| nomic-embed-text | 768 | Fast | Very good | Free (local) |
| text-embedding-3-small | 1536 | API call | Very good | $0.02/1M tokens |
| text-embedding-3-large | 3072 | API call | Excellent | $0.13/1M tokens |
For prototyping, all-MiniLM-L6-v2 is perfect. For production with large document collections, consider OpenAI's v3 models — they support Matryoshka embeddings, meaning you can truncate a 3072-dim vector to 256 dims and still outperform older models at full dimensionality.
Step 3: Store and Retrieve
With chunks embedded, we need to find which chunks are most relevant to a given question. As the vector search benchmarks showed, brute-force NumPy handles small-to-medium datasets beautifully — no FAISS, no database, just matrix math:
def retrieve(query, chunks, embeddings, model, k=3):
"""Find the k most relevant chunks for a query."""
# Embed the query with the same model
query_vec = model.encode([query], normalize_embeddings=True)
# Dot product against all chunk embeddings (cosine sim for normalized vecs)
scores = (embeddings @ query_vec.T).squeeze()
# Get top-k indices
top_k = np.argsort(-scores)[:k]
return [(chunks[i], float(scores[i])) for i in top_k]
That's it. Three lines of math. Since we normalized our embeddings, the dot product is cosine similarity — no extra computation needed.
Let's test retrieval with a real question:
results = retrieve(
"How long until solar panels pay for themselves?",
chunks, embeddings, model, k=2
)
for chunk, score in results:
print(f"Score: {score:.3f}")
print(chunk[:100] + "...")
print()
Score: 0.612
A typical residential system is 6-10 kW, requiring 15-25 panels. At
average US electricity rates, pa...
Score: 0.384
Solar panels convert sunlight into electricity through the
photovoltaic effect. When photons hit sili...
The retriever correctly surfaces the chunk about payback period and costs, even though the query says "pay for themselves" while the document says "payback period." Embeddings handle this semantic equivalence automatically.
When to graduate from NumPy: If your chunk count exceeds ~50K, or you need metadata filtering, consider FAISS or ChromaDB. For most RAG prototypes and even many production systems, brute-force search is more than enough. See the vector search benchmarks for exact numbers.
Step 4: Augmented Generation
This is where the "G" in RAG earns its keep. We take the retrieved chunks, stuff them into a prompt, and ask the LLM to answer only from the provided context.
The prompt template is the most important piece of engineering in the entire pipeline:
def build_rag_prompt(question, retrieved_chunks):
"""Assemble retrieved context into a grounded prompt."""
context_block = "\n\n".join(
f"[Document {i+1}]\n{chunk}"
for i, (chunk, score) in enumerate(retrieved_chunks)
)
return f"""You are a helpful assistant that answers questions using
ONLY the provided context documents. Follow these rules strictly:
1. Answer based ONLY on the context below. Do not use prior knowledge.
2. If the context doesn't contain enough information, say:
"I don't have enough information to answer this question."
3. Cite which document(s) you used, e.g. [Document 1].
4. Keep answers concise and direct.
Context:
{context_block}
Question: {question}
Answer:"""
Two design choices here are critical:
- Explicit grounding instructions — "ONLY the provided context" and the explicit fallback for missing information. Without this, the LLM will cheerfully hallucinate to fill gaps.
- Numbered documents with citation requests — This gives the model a mechanism to attribute claims, and gives you a mechanism to verify them.
The "Lost in the Middle" Problem
Research by Liu et al. (2023) found that LLMs exhibit a U-shaped attention pattern: they focus heavily on information at the beginning and end of the context window, but underweight text in the middle. This means if you retrieve 10 chunks, the model may effectively ignore chunks 4–7.
The fix is simple: retrieve fewer, better chunks. Three to five highly relevant passages beat fifteen sort-of-relevant ones. If you must include more, put the most relevant chunks first and last.
The Full Pipeline: Putting It Together
Let's assemble everything into a single class you can copy, paste, and run:
from sentence_transformers import SentenceTransformer
import numpy as np
import anthropic # pip install anthropic
class RAGPipeline:
def __init__(self, embed_model="all-MiniLM-L6-v2"):
self.model = SentenceTransformer(embed_model)
self.chunks = []
self.embeddings = None
self.client = anthropic.Anthropic() # uses ANTHROPIC_API_KEY env var
def ingest(self, text, max_chars=1000, overlap=200):
"""Chunk and embed a document."""
self.chunks = chunk_text(text, max_chars, overlap)
self.embeddings = self.model.encode(
self.chunks, normalize_embeddings=True
)
print(f"Ingested {len(self.chunks)} chunks")
def query(self, question, k=3):
"""Retrieve relevant chunks and generate a grounded answer."""
# Retrieve
query_vec = self.model.encode(
[question], normalize_embeddings=True
)
scores = (self.embeddings @ query_vec.T).squeeze()
top_k = np.argsort(-scores)[:k]
retrieved = [(self.chunks[i], float(scores[i])) for i in top_k]
# Build prompt
prompt = build_rag_prompt(question, retrieved)
# Generate
response = self.client.messages.create(
model="claude-sonnet-4-6",
max_tokens=512,
messages=[{"role": "user", "content": prompt}]
)
return {
"answer": response.content[0].text,
"sources": retrieved,
}
That's about 40 lines for a working RAG system. Let's use it:
rag = RAGPipeline()
rag.ingest(solar_panel_guide) # our sample document from earlier
result = rag.query("What tax incentives are available for solar panels?")
print(result["answer"])
Based on the provided context, federal tax credits currently cover 30%
of solar panel installation costs [Document 1]. The context does not
mention state-level incentives or other tax benefits beyond this
federal credit.
The model answers from context and explicitly flags what it doesn't know. Now let's ask something the documents can't answer:
result = rag.query("What's the best brand of solar panel?")
print(result["answer"])
I don't have enough information to answer this question. The provided
context discusses solar panel installation, system sizing, and costs,
but does not mention specific panel brands or manufacturers.
That refusal is the most important feature of a well-built RAG system. A vanilla LLM would have happily recommended SunPower or LG — RAG makes it honest.
When RAG Goes Wrong
RAG isn't magic. Here are the failure modes you'll hit and how to fix them:
| Failure Mode | Symptom | Fix |
|---|---|---|
| Irrelevant retrieval | Retrieved chunks don't match the question's intent | Add a reranking step (cross-encoder), try hybrid search (BM25 + vectors) |
| Chunk boundary splits | Answer requires info split across two chunks | Increase overlap, try semantic chunking, or use parent-document retrieval |
| Hallucination despite context | Model adds facts not in the retrieved text | Stronger grounding prompts, lower temperature, add faithfulness evaluation |
| Context overflow | Too many chunks dilute attention | Retrieve fewer chunks (3–5), rerank, compress context before prompting |
| Domain mismatch | Embedding model doesn't understand jargon | Fine-tune embeddings on domain data, or use a domain-specific model |
The most common beginner mistake is retrieving too many chunks. More context feels safer, but it costs 2–3x more tokens, adds latency, and paradoxically reduces answer quality due to the lost-in-the-middle effect. Start with k=3 and increase only if you're missing information.
Evaluation tip: The RAGAS framework provides automated metrics: faithfulness (is the answer grounded in context?), answer relevancy (does it address the question?), and context recall (did retrieval find everything needed?). Scores above 0.8 indicate a solid pipeline.
Try It: RAG Pipeline Explorer
Paste some text (or use the sample), then ask a question. Watch the pipeline chunk, retrieve, and answer step by step.
This demo uses word-overlap similarity as a browser-friendly stand-in for real embeddings. A production system would use a proper embedding model like all-MiniLM-L6-v2.
Step 1: Chunk ✂
Step 2: Retrieve 🔍
Step 3: Generate ✨
What's Next
You now have a working RAG pipeline in ~60 lines of Python. From here, there are several paths to make it better:
- Hybrid search — Combine vector similarity with BM25 keyword matching. Catches cases where exact terms matter (product names, error codes).
- Reranking — After retrieving 15–20 candidates with fast vector search, score them with a cross-encoder model for much higher precision. Take the top 3–5.
- Query transformation — Rewrite the user's question before searching. HyDE (Hypothetical Document Embeddings) generates a hypothetical answer first, then uses that as the search query.
- Agentic RAG — Let the LLM decide when to search and what to search for, iteratively refining its retrieval until it has enough context.
This post completes a trilogy. We taught words to become vectors. We benchmarked how to search those vectors. And now we've connected retrieval to generation — turning an LLM from a confident improviser into a grounded research assistant.
RAG isn't magic. It's plumbing. But good plumbing makes everything downstream work.
References & Further Reading
- DadOps — Embeddings from Scratch — How words become vectors through training on co-occurrence
- DadOps — Vector Search at Small Scale — Benchmarks for pgvector, FAISS, and brute-force NumPy
- Liu et al. — Lost in the Middle (2023) — Research on how LLMs underweight information in the middle of long contexts
- RAGAS Documentation — Automated evaluation framework for RAG systems
- Sentence Transformers — The library behind all-MiniLM-L6-v2 and many other embedding models
- Lewis et al. — Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (2020) — The original RAG paper from Facebook AI Research
- Gao et al. — Precise Zero-Shot Dense Retrieval without Relevance Labels (HyDE, 2022) — Hypothetical document embeddings for query transformation