When I decided to put an AI chat widget on my portfolio, the first question was what it should actually do. Another generic "chat with my CV" felt unappealing. But I have something more interesting sitting on a shelf: six peer-reviewed papers, 169+ citations, and a dense technical vocabulary that no thirty-second elevator pitch could ever capture. So I built a retrieval-augmented generation (RAG) pipeline over my own research — and wired it into the hero section of this site.
This is a writeup of how it works, the decisions I made, and the gotchas that only show up once you actually try to ship the thing.
What RAG actually is
RAG is a simple idea wrapped in some acronyms. A language model is trained on a fixed corpus; it doesn't know about your documents. Retrieval-augmented generation fixes this by chunking your documents into small pieces, embedding each piece into a high-dimensional vector that captures its semantic meaning, storing those vectors in a database, and — at query time — embedding the user's question with the same model, finding the most similar chunks, and stuffing them into the prompt as context. The LLM then answers the question while grounding its response in the retrieved material.
The appeal: no fine-tuning, no hallucinations when it's done right, and the knowledge base can be updated without retraining anything. Add a new paper, re-run the ingestion, push to production. That's it.
The pipeline, end to end
At query time the full flow takes around one second. Here's what happens in that second:
384-dim vector
cosine similarity
with source tags
via Groq
The ingestion side of the pipeline is offline and runs once when the paper collection changes. It loads each PDF page by page, splits the text into overlapping chunks, embeds every chunk, and persists the vectors to disk. The whole thing is about seventy lines of LangChain code.
Design decision 1: Chunking
Writing academic papers isn't like writing tweets. A single paragraph in a Methods section can contain three dependent clauses, one equation, and a reference to a figure on the next page. Chunk too small and you slice the sentence in half. Chunk too large and your similarity search dilutes — the chunk ends up containing multiple topics, and the embedding becomes an average that matches nothing well.
I settled on 1000 characters with 200-character overlap, split recursively on paragraph boundaries first, then sentences, then words. The overlap is insurance: if a key claim straddles a chunk boundary, both chunks carry part of it, so neither is lost at retrieval time.
from langchain_text_splitters import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separators=["\n\n", "\n", ". ", " ", ""],
)
chunks = splitter.split_documents(docs)
Could I do better? Yes — section-aware chunking (detecting Abstract, Methods, Results headers and splitting within sections) would preserve more document structure. But for six papers, the gains wouldn't justify the code. That's a decision I'd revisit at one hundred papers, not six.
Design decision 2: Embedding model
For local development I used Ollama's nomic-embed-text — 768
dimensions, runs on my laptop, free. For deployment I had to swap it out, because
Ollama can't run on Render's free tier. I went with
sentence-transformers/all-MiniLM-L6-v2: 384 dimensions, 80MB on disk,
runs on CPU, widely supported, and specifically designed for semantic similarity.
The tradeoff: MiniLM is smaller and slightly less accurate than nomic-embed-text
on retrieval benchmarks. For a portfolio demo over six papers, the difference is
imperceptible. At scale — thousands of domain-specific documents — I'd
re-evaluate. Domain-tuned embeddings like BioBERT or
SciBERT would likely improve recall on technical terms like
sarbecovirus or Mastomys natalensis that appear rarely in
general-purpose training data.
Design decision 3: Vector store
ChromaDB. I wanted boring. ChromaDB runs in-process, persists to a directory on disk, has a Python API that mirrors a dictionary, and handles my 779 chunks with a latency that rounds to zero. The entire vector store is a 30MB sqlite file I commit to the repo.
If this ever grew to thousands of documents, I'd move to Pinecone or Weaviate for managed scaling, metadata filtering, and hybrid search out of the box. For six papers, a sqlite file is the right answer. Choosing the simplest thing that works is an engineering virtue.
Design decision 4: LLM
This one took a few attempts. My initial plan was Google Gemini — free tier,
generous quota, well-documented. When I tested it from Auckland, the API returned
RESOURCE_EXHAUSTED with a free-tier limit of zero. Regional
restrictions, apparently. A new API key from the same Google account produced the
same error. Lesson learned.
I switched to Groq, which hosts Llama 3.3 70B on custom inference hardware and gives away 1,000 requests per day on its free tier. For a portfolio chat widget that might get fifty queries a week from curious recruiters, this is effectively unlimited. Groq's latency is also genuinely impressive — responses start streaming in under a second even for a 70-billion-parameter model.
The API keeps the provider swappable via an LLM_PROVIDER environment
variable. If Groq's free tier ever disappears, I can flip to OpenAI
gpt-4o-mini or Gemini with a one-line change. Hedging against
provider lock-in is cheap when LangChain handles the abstraction.
if LLM_PROVIDER == "gemini":
from langchain_google_genai import ChatGoogleGenerativeAI
llm = ChatGoogleGenerativeAI(model="gemini-2.0-flash", temperature=0.1)
elif LLM_PROVIDER == "openai":
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.1)
else:
from langchain_groq import ChatGroq
llm = ChatGroq(model="llama-3.3-70b-versatile", temperature=0.1)
Design decision 5: The prompt
The prompt is where you actually earn the "grounded" in retrieval-augmented generation. Mine is short but opinionated:
# SYSTEM_TEMPLATE
You are a research assistant for Dr Reju Sam John, a computational
epidemiologist and data scientist based in Auckland, New Zealand.
INSTRUCTIONS:
- Answer the question using ONLY the provided context from his
published peer-reviewed papers.
- If the context does not contain enough information to answer,
say so honestly — do not hallucinate.
- Cite papers using short names like (John et al., 2024) — derive
the author and year from the filename in the [Source:] tags.
- Keep answers under 150 words for readability.
CONTEXT:
{context}
QUESTION:
{question}
ANSWER:
Two instructions matter most. "Use ONLY the provided context" is the hallucination fence. "Say so honestly when context is insufficient" is the escape hatch — without it, models will invent plausible-sounding answers to avoid looking stupid. Together these two lines turn an unconstrained language model into a cited research assistant.
The citation format was a late addition. My first version told the model to cite
using the raw [Source: filename] tags that appear in the context.
It dutifully did — and the output read like this:
[Source: john-et-al-2024-high-connectivity-and-human-movement-limits-...].
Correct and unreadable. I changed the instruction to derive author-year citations
from the filename and the output now reads like a paper:
(John et al., 2024). Small prompt change, big UX win.
What actually broke
Three things tripped me up that the tutorials don't warn you about.
1. The embedding model has to match
The vector store is a collection of vectors produced by one specific
embedding model. If I ingest with nomic-embed-text and then query
with all-MiniLM-L6-v2, the query vector lives in a completely
different geometric space from the stored vectors. Similarity scores become
statistical noise. I now maintain two separate vector stores —
chroma_db/ for local Ollama and chroma_db_deploy/ for
the hosted API — specifically to avoid this landmine.
2. Free tiers are not portable
Gemini's free tier is zero in New Zealand. Hugging Face's Inference API quietly dropped Mistral-7B support last year. OpenRouter's free models rate-limit aggressively. I wrote the API with a pluggable LLM layer not because I wanted the flexibility — I wanted one working provider. I wrote it that way because I didn't trust any single provider to stay free for the lifetime of my portfolio.
3. Retrieval quality degrades on meta-questions
If you ask "What methods did you use in the sarbecovirus study?", the embedding of that question is dominated by generic words like methods and study that appear in every paper. The retrieved chunks end up being a mix of generic methodology text from several papers, not the specific Methods section of the one you meant.
Fixing this properly needs one of: metadata filtering (search within a named paper), query expansion (rephrase the question with content-specific terms), or hybrid search combining BM25 keyword matching with dense retrieval. I haven't implemented any of them yet. It's on the list.
Local vs deployed: two stacks, one codebase
Running the app locally and serving it in production turned out to need two different stacks. Here's the split:
| Component | Local (Streamlit) | Deployed (FastAPI) |
|---|---|---|
| Embeddings | nomic-embed-text via Ollama | all-MiniLM-L6-v2 via sentence-transformers |
| LLM | llama3.2 via Ollama | llama-3.3-70b-versatile via Groq |
| Vector store | chroma_db/ | chroma_db_deploy/ |
| Interface | Streamlit chat UI | FastAPI → website JS fetch |
| Hosting | localhost | Render (free tier) |
| Cost | $0 (local compute) | $0 (all free tiers) |
The two stacks share every other line of code: chunking logic, retriever setup, prompt template, chain composition. The only variation is in the embeddings module and the LLM factory. That separation is what made the pivot from Streamlit-on-localhost to FastAPI-on-Render take an afternoon instead of a week.
Does it actually work?
Three example queries, responses verbatim from the live chat:
The cited paper is correct in all three cases. The answers are concise and faithful to the source. When I ask a question the papers don't address ("Who is Reju Sam John?"), the model refuses gracefully rather than inventing a biography — which is exactly the failure mode the prompt engineering is designed to prevent.
What I'd build next
- Hybrid search. BM25 keyword retrieval fused with dense vector search. This directly fixes the meta-question problem and handles proper nouns better than pure semantic matching.
-
Cross-encoder re-ranking. Initial retrieval pulls maybe
twenty candidate chunks; a small cross-encoder model (e.g.
ms-marco-MiniLM) re-scores them for relevance to the specific question. Expensive per query, but dramatic precision gains. - Section-aware chunking. Detect paper structure (Abstract, Methods, Results, Discussion) and chunk within sections, with section type stored as metadata. Enables filtered retrieval for questions like "what methods…".
- Citation-aware synthesis. When multiple chunks from different papers agree, the model should combine them into a single answer with multiple citations, not pick one and ignore the rest.
- Evaluation harness. A held-out set of questions with ground-truth answers and hand-labelled relevant chunks, wired into CI so I can tell when a config change helps or hurts retrieval quality.
all-MiniLM-L6-v2) and Ollama
(nomic-embed-text) for embeddings, Groq for inference
(llama-3.3-70b-versatile), FastAPI for the REST API, Streamlit for
the local dev UI, Render for deployment. Full code and papers list on
GitHub.