When I decided to put an AI chat widget on my portfolio, the first question was what it should actually do. Another generic "chat with my CV" felt unappealing. But I have something more interesting sitting on a shelf: six peer-reviewed papers, 169+ citations, and a dense technical vocabulary that no thirty-second elevator pitch could ever capture. So I built a retrieval-augmented generation (RAG) pipeline over my own research — and wired it into the hero section of this site.

This is a writeup of how it works, the decisions I made, and the gotchas that only show up once you actually try to ship the thing.

6
Peer-reviewed papers
779
Text chunks indexed
384
Embedding dimensions
4
Chunks retrieved per query

What RAG actually is

RAG is a simple idea wrapped in some acronyms. A language model is trained on a fixed corpus; it doesn't know about your documents. Retrieval-augmented generation fixes this by chunking your documents into small pieces, embedding each piece into a high-dimensional vector that captures its semantic meaning, storing those vectors in a database, and — at query time — embedding the user's question with the same model, finding the most similar chunks, and stuffing them into the prompt as context. The LLM then answers the question while grounding its response in the retrieved material.

The appeal: no fine-tuning, no hallucinations when it's done right, and the knowledge base can be updated without retraining anything. Add a new paper, re-run the ingestion, push to production. That's it.

The pipeline, end to end

At query time the full flow takes around one second. Here's what happens in that second:

1
Embed
all-MiniLM-L6-v2
384-dim vector
2
Search
ChromaDB
cosine similarity
3
Retrieve
top-4 chunks
with source tags
4
Generate
Llama 3.3 70B
via Groq

The ingestion side of the pipeline is offline and runs once when the paper collection changes. It loads each PDF page by page, splits the text into overlapping chunks, embeds every chunk, and persists the vectors to disk. The whole thing is about seventy lines of LangChain code.

Design decision 1: Chunking

Writing academic papers isn't like writing tweets. A single paragraph in a Methods section can contain three dependent clauses, one equation, and a reference to a figure on the next page. Chunk too small and you slice the sentence in half. Chunk too large and your similarity search dilutes — the chunk ends up containing multiple topics, and the embedding becomes an average that matches nothing well.

I settled on 1000 characters with 200-character overlap, split recursively on paragraph boundaries first, then sentences, then words. The overlap is insurance: if a key claim straddles a chunk boundary, both chunks carry part of it, so neither is lost at retrieval time.

from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", ". ", " ", ""],
)
chunks = splitter.split_documents(docs)

Could I do better? Yes — section-aware chunking (detecting Abstract, Methods, Results headers and splitting within sections) would preserve more document structure. But for six papers, the gains wouldn't justify the code. That's a decision I'd revisit at one hundred papers, not six.

Design decision 2: Embedding model

For local development I used Ollama's nomic-embed-text — 768 dimensions, runs on my laptop, free. For deployment I had to swap it out, because Ollama can't run on Render's free tier. I went with sentence-transformers/all-MiniLM-L6-v2: 384 dimensions, 80MB on disk, runs on CPU, widely supported, and specifically designed for semantic similarity.

The tradeoff: MiniLM is smaller and slightly less accurate than nomic-embed-text on retrieval benchmarks. For a portfolio demo over six papers, the difference is imperceptible. At scale — thousands of domain-specific documents — I'd re-evaluate. Domain-tuned embeddings like BioBERT or SciBERT would likely improve recall on technical terms like sarbecovirus or Mastomys natalensis that appear rarely in general-purpose training data.

The embedding model is the most important choice in a RAG pipeline. Everything downstream assumes that "semantically similar text ends up geometrically close in the vector space." If your embedding model fails that promise for your domain, no amount of prompt engineering will save you.

Design decision 3: Vector store

ChromaDB. I wanted boring. ChromaDB runs in-process, persists to a directory on disk, has a Python API that mirrors a dictionary, and handles my 779 chunks with a latency that rounds to zero. The entire vector store is a 30MB sqlite file I commit to the repo.

If this ever grew to thousands of documents, I'd move to Pinecone or Weaviate for managed scaling, metadata filtering, and hybrid search out of the box. For six papers, a sqlite file is the right answer. Choosing the simplest thing that works is an engineering virtue.

Design decision 4: LLM

This one took a few attempts. My initial plan was Google Gemini — free tier, generous quota, well-documented. When I tested it from Auckland, the API returned RESOURCE_EXHAUSTED with a free-tier limit of zero. Regional restrictions, apparently. A new API key from the same Google account produced the same error. Lesson learned.

I switched to Groq, which hosts Llama 3.3 70B on custom inference hardware and gives away 1,000 requests per day on its free tier. For a portfolio chat widget that might get fifty queries a week from curious recruiters, this is effectively unlimited. Groq's latency is also genuinely impressive — responses start streaming in under a second even for a 70-billion-parameter model.

The API keeps the provider swappable via an LLM_PROVIDER environment variable. If Groq's free tier ever disappears, I can flip to OpenAI gpt-4o-mini or Gemini with a one-line change. Hedging against provider lock-in is cheap when LangChain handles the abstraction.

if LLM_PROVIDER == "gemini":
    from langchain_google_genai import ChatGoogleGenerativeAI
    llm = ChatGoogleGenerativeAI(model="gemini-2.0-flash", temperature=0.1)
elif LLM_PROVIDER == "openai":
    from langchain_openai import ChatOpenAI
    llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.1)
else:
    from langchain_groq import ChatGroq
    llm = ChatGroq(model="llama-3.3-70b-versatile", temperature=0.1)

Design decision 5: The prompt

The prompt is where you actually earn the "grounded" in retrieval-augmented generation. Mine is short but opinionated:

# SYSTEM_TEMPLATE
You are a research assistant for Dr Reju Sam John, a computational
epidemiologist and data scientist based in Auckland, New Zealand.

INSTRUCTIONS:
- Answer the question using ONLY the provided context from his
  published peer-reviewed papers.
- If the context does not contain enough information to answer,
  say so honestly — do not hallucinate.
- Cite papers using short names like (John et al., 2024) — derive
  the author and year from the filename in the [Source:] tags.
- Keep answers under 150 words for readability.

CONTEXT:
{context}

QUESTION:
{question}

ANSWER:

Two instructions matter most. "Use ONLY the provided context" is the hallucination fence. "Say so honestly when context is insufficient" is the escape hatch — without it, models will invent plausible-sounding answers to avoid looking stupid. Together these two lines turn an unconstrained language model into a cited research assistant.

The citation format was a late addition. My first version told the model to cite using the raw [Source: filename] tags that appear in the context. It dutifully did — and the output read like this: [Source: john-et-al-2024-high-connectivity-and-human-movement-limits-...]. Correct and unreadable. I changed the instruction to derive author-year citations from the filename and the output now reads like a paper: (John et al., 2024). Small prompt change, big UX win.

What actually broke

Three things tripped me up that the tutorials don't warn you about.

1. The embedding model has to match

The vector store is a collection of vectors produced by one specific embedding model. If I ingest with nomic-embed-text and then query with all-MiniLM-L6-v2, the query vector lives in a completely different geometric space from the stored vectors. Similarity scores become statistical noise. I now maintain two separate vector stores — chroma_db/ for local Ollama and chroma_db_deploy/ for the hosted API — specifically to avoid this landmine.

2. Free tiers are not portable

Gemini's free tier is zero in New Zealand. Hugging Face's Inference API quietly dropped Mistral-7B support last year. OpenRouter's free models rate-limit aggressively. I wrote the API with a pluggable LLM layer not because I wanted the flexibility — I wanted one working provider. I wrote it that way because I didn't trust any single provider to stay free for the lifetime of my portfolio.

3. Retrieval quality degrades on meta-questions

If you ask "What methods did you use in the sarbecovirus study?", the embedding of that question is dominated by generic words like methods and study that appear in every paper. The retrieved chunks end up being a mix of generic methodology text from several papers, not the specific Methods section of the one you meant.

Fixing this properly needs one of: metadata filtering (search within a named paper), query expansion (rephrase the question with content-specific terms), or hybrid search combining BM25 keyword matching with dense retrieval. I haven't implemented any of them yet. It's on the list.

Local vs deployed: two stacks, one codebase

Running the app locally and serving it in production turned out to need two different stacks. Here's the split:

Component Local (Streamlit) Deployed (FastAPI)
Embeddings nomic-embed-text via Ollama all-MiniLM-L6-v2 via sentence-transformers
LLM llama3.2 via Ollama llama-3.3-70b-versatile via Groq
Vector store chroma_db/ chroma_db_deploy/
Interface Streamlit chat UI FastAPI → website JS fetch
Hosting localhost Render (free tier)
Cost $0 (local compute) $0 (all free tiers)

The two stacks share every other line of code: chunking logic, retriever setup, prompt template, chain composition. The only variation is in the embeddings module and the LLM factory. That separation is what made the pivot from Streamlit-on-localhost to FastAPI-on-Render take an afternoon instead of a week.

Does it actually work?

Three example queries, responses verbatim from the live chat:

Q: How does human movement affect infectious disease transmission?
Human movement limits the impact of travel time on infectious disease transmission, and its restriction can have a negligible effect on epidemic dynamics unless severely limited.
(John et al., 2024)
Q: How does Lassa virus transmit from rodents to humans?
Lassa virus is transmitted to humans probably through the ingestion of food or contact with contaminated rodent urine, droppings, or saliva.
(John et al., 2024)
Q: What is the connectivity paradox in pandemic spread?
Highly connected networks lead to rapid pandemic spread, making travel times less important as a predictor of spread, unless human movement is severely restricted.
(John et al., 2024)

The cited paper is correct in all three cases. The answers are concise and faithful to the source. When I ask a question the papers don't address ("Who is Reju Sam John?"), the model refuses gracefully rather than inventing a biography — which is exactly the failure mode the prompt engineering is designed to prevent.

What I'd build next

Stack. LangChain for pipeline orchestration, ChromaDB for vector storage, sentence-transformers (all-MiniLM-L6-v2) and Ollama (nomic-embed-text) for embeddings, Groq for inference (llama-3.3-70b-versatile), FastAPI for the REST API, Streamlit for the local dev UI, Render for deployment. Full code and papers list on GitHub.

Try the live chat widget or explore the code.

  Try the live demo   View on GitHub