When I decided to put an AI chat widget on my portfolio, the first question was what it should actually do. Another generic "chat with my CV" felt unappealing. But I have something more interesting sitting on a shelf: a body of peer-reviewed research in infectious disease modelling — dense technical vocabulary, formal notation, results that no thirty-second elevator pitch could ever capture. So I built a retrieval-augmented generation (RAG) pipeline over my own research and wired it into the hero section of this site.

This is a writeup of how it works, the decisions I made, and the gotchas that only show up once you actually try to ship the thing.

A note on scope. I have eleven peer-reviewed publications, but the demo corpus is a deliberate six-paper subset — all in epidemiological modelling and network science. The papers I excluded are from an earlier chapter of my research life in astrophysics and computational physics, and mixing them in would have hurt the demo rather than helped it. A question like "how does Lassa virus spread?" would start pulling in passages about dark matter halos just because both corpora share words like "network" and "density". That is exactly the kind of cross-domain retrieval noise a single-namespace RAG system is bad at handling — and it is worth a few paragraphs at the end of this article, because it is the most interesting extension of this work.

6
Peer-reviewed papers
779
Text chunks indexed
384
Embedding dimensions
4
Chunks retrieved per query

What RAG actually is

RAG is a simple idea wrapped in some acronyms. A language model is trained on a fixed corpus; it doesn't know about your documents. Retrieval-augmented generation fixes this by chunking your documents into small pieces, embedding each piece into a high-dimensional vector that captures its semantic meaning, storing those vectors in a database, and — at query time — embedding the user's question with the same model, finding the most similar chunks, and stuffing them into the prompt as context. The LLM then answers the question while grounding its response in the retrieved material.

The appeal: no fine-tuning, no hallucinations when it's done right, and the knowledge base can be updated without retraining anything. Add a new paper, re-run the ingestion, push to production. That's it.

The pipeline, end to end

At query time the full flow takes around one second. Here's what happens in that second:

1
Embed
all-MiniLM-L6-v2
384-dim vector
2
Search
ChromaDB
cosine similarity
3
Retrieve
top-4 chunks
with source tags
4
Generate
Llama 3.3 70B
via Groq

The ingestion side of the pipeline is offline and runs once when the paper collection changes. It loads each PDF page by page, splits the text into overlapping chunks, embeds every chunk, and persists the vectors to disk. The whole thing is about seventy lines of LangChain code.

Design decision 1: Chunking

Writing academic papers isn't like writing tweets. A single paragraph in a Methods section can contain three dependent clauses, one equation, and a reference to a figure on the next page. Chunk too small and you slice the sentence in half. Chunk too large and your similarity search dilutes — the chunk ends up containing multiple topics, and the embedding becomes an average that matches nothing well.

I settled on 1000 characters with 200-character overlap, split recursively on paragraph boundaries first, then sentences, then words. The overlap is insurance: if a key claim straddles a chunk boundary, both chunks carry part of it, so neither is lost at retrieval time.

from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", ". ", " ", ""],
)
chunks = splitter.split_documents(docs)

Could I do better? Yes — section-aware chunking (detecting Abstract, Methods, Results headers and splitting within sections) would preserve more document structure. But for six papers, the gains wouldn't justify the code. That's a decision I'd revisit at one hundred papers, not six.

Design decision 2: Embedding model

For local development I used Ollama's nomic-embed-text — 768 dimensions, runs on my laptop, free. For deployment I had to swap it out, because Ollama can't run on Render's free tier. I went with sentence-transformers/all-MiniLM-L6-v2: 384 dimensions, 80MB on disk, runs on CPU, widely supported, and specifically designed for semantic similarity.

The tradeoff: MiniLM is smaller and slightly less accurate than nomic-embed-text on retrieval benchmarks. For a portfolio demo over six papers, the difference is imperceptible. At scale — thousands of domain-specific documents — I'd re-evaluate. Domain-tuned embeddings like BioBERT or SciBERT would likely improve recall on technical terms like sarbecovirus or Mastomys natalensis that appear rarely in general-purpose training data.

The embedding model is the most important choice in a RAG pipeline. Everything downstream assumes that "semantically similar text ends up geometrically close in the vector space." If your embedding model fails that promise for your domain, no amount of prompt engineering will save you.

Design decision 3: Vector store

ChromaDB. I wanted boring. ChromaDB runs in-process, persists to a directory on disk, has a Python API that mirrors a dictionary, and handles my 779 chunks with a latency that rounds to zero. The entire vector store is a 30MB sqlite file I commit to the repo.

If this ever grew to thousands of documents, I'd move to Pinecone or Weaviate for managed scaling, metadata filtering, and hybrid search out of the box. For six papers, a sqlite file is the right answer. Choosing the simplest thing that works is an engineering virtue.

Design decision 4: LLM

This one took a few attempts. My initial plan was Google Gemini — free tier, generous quota, well-documented. When I tested it from Auckland, the API returned RESOURCE_EXHAUSTED with a free-tier limit of zero. Regional restrictions, apparently. A new API key from the same Google account produced the same error. Lesson learned.

I switched to Groq, which hosts Llama 3.3 70B on custom inference hardware and gives away 1,000 requests per day on its free tier. For a portfolio chat widget that might get fifty queries a week from curious recruiters, this is effectively unlimited. Groq's latency is also genuinely impressive — responses start streaming in under a second even for a 70-billion-parameter model.

The API keeps the provider swappable via an LLM_PROVIDER environment variable. If Groq's free tier ever disappears, I can flip to OpenAI gpt-4o-mini or Gemini with a one-line change. Hedging against provider lock-in is cheap when LangChain handles the abstraction.

if LLM_PROVIDER == "gemini":
    from langchain_google_genai import ChatGoogleGenerativeAI
    llm = ChatGoogleGenerativeAI(model="gemini-2.0-flash", temperature=0.1)
elif LLM_PROVIDER == "openai":
    from langchain_openai import ChatOpenAI
    llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.1)
else:
    from langchain_groq import ChatGroq
    llm = ChatGroq(model="llama-3.3-70b-versatile", temperature=0.1)

Design decision 5: The prompt

The prompt is where you actually earn the "grounded" in retrieval-augmented generation. Mine is short but opinionated:

# SYSTEM_TEMPLATE
You are a research assistant for Dr Reju Sam John, a computational
epidemiologist and data scientist based in Auckland, New Zealand.

INSTRUCTIONS:
- Answer the question using ONLY the provided context from his
  published peer-reviewed papers.
- If the context does not contain enough information to answer,
  say so honestly — do not hallucinate.
- Cite papers using short names like (John et al., 2024) — derive
  the author and year from the filename in the [Source:] tags.
- Keep answers under 150 words for readability.

CONTEXT:
{context}

QUESTION:
{question}

ANSWER:

Two instructions matter most. "Use ONLY the provided context" is the hallucination fence. "Say so honestly when context is insufficient" is the escape hatch — without it, models will invent plausible-sounding answers to avoid looking stupid. Together these two lines turn an unconstrained language model into a cited research assistant.

The citation format was a late addition. My first version told the model to cite using the raw [Source: filename] tags that appear in the context. It dutifully did — and the output read like this: [Source: john-et-al-2024-high-connectivity-and-human-movement-limits-...]. Correct and unreadable. I changed the instruction to derive author-year citations from the filename and the output now reads like a paper: (John et al., 2024). Small prompt change, big UX win.

What actually broke

Three things tripped me up that the tutorials don't warn you about.

1. The embedding model has to match

The vector store is a collection of vectors produced by one specific embedding model. If I ingest with nomic-embed-text and then query with all-MiniLM-L6-v2, the query vector lives in a completely different geometric space from the stored vectors. Similarity scores become statistical noise. I now maintain two separate vector stores — chroma_db/ for local Ollama and chroma_db_deploy/ for the hosted API — specifically to avoid this landmine.

2. Free tiers are not portable

Gemini's free tier is zero in New Zealand. Hugging Face's Inference API quietly dropped Mistral-7B support last year. OpenRouter's free models rate-limit aggressively. I wrote the API with a pluggable LLM layer not because I wanted the flexibility — I wanted one working provider. I wrote it that way because I didn't trust any single provider to stay free for the lifetime of my portfolio.

3. Retrieval quality degrades on meta-questions

If you ask "What methods did you use in the sarbecovirus study?", the embedding of that question is dominated by generic words like methods and study that appear in every paper. The retrieved chunks end up being a mix of generic methodology text from several papers, not the specific Methods section of the one you meant.

Fixing this properly needs one of: metadata filtering (search within a named paper), query expansion (rephrase the question with content-specific terms), or hybrid search combining BM25 keyword matching with dense retrieval. I haven't implemented any of them yet. It's on the list.

Local vs deployed: two stacks, one codebase

Running the app locally and serving it in production turned out to need two different stacks. Here's the split:

Component Local (Streamlit) Deployed (FastAPI)
Embeddings nomic-embed-text via Ollama all-MiniLM-L6-v2 via sentence-transformers
LLM llama3.2 via Ollama llama-3.3-70b-versatile via Groq
Vector store chroma_db/ chroma_db_deploy/
Interface Streamlit chat UI FastAPI → website JS fetch
Hosting localhost Render (free tier)
Cost $0 (local compute) $0 (all free tiers)

The two stacks share every other line of code: chunking logic, retriever setup, prompt template, chain composition. The only variation is in the embeddings module and the LLM factory. That separation is what made the pivot from Streamlit-on-localhost to FastAPI-on-Render take an afternoon instead of a week.

Does it actually work?

Three example queries, responses verbatim from the live chat:

Q: How does human movement affect infectious disease transmission?
Human movement limits the impact of travel time on infectious disease transmission, and its restriction can have a negligible effect on epidemic dynamics unless severely limited.
(John et al., 2024)
Q: How does Lassa virus transmit from rodents to humans?
Lassa virus is transmitted to humans probably through the ingestion of food or contact with contaminated rodent urine, droppings, or saliva.
(John et al., 2024)
Q: What is the connectivity paradox in pandemic spread?
Highly connected networks lead to rapid pandemic spread, making travel times less important as a predictor of spread, unless human movement is severely restricted.
(John et al., 2024)

The cited paper is correct in all three cases. The answers are concise and faithful to the source. When I ask a question the papers don't address ("Who is Reju Sam John?"), the model refuses gracefully rather than inventing a biography — which is exactly the failure mode the prompt engineering is designed to prevent.

What I'd build next

Extending this to a multi-disciplinary corpus

I mentioned at the top that this demo deliberately excludes my earlier astrophysics and computational-physics papers. That was a pragmatic choice for a portfolio piece, but the underlying problem is one of the more interesting open questions in applied RAG: what happens when the corpus spans disciplines that share vocabulary but mean different things by it? The word "network" in an epidemiology paper is a contact graph between hosts; in an astrophysics paper it is a cosmic filament structure. Dense embeddings learn a single average meaning and happily return the wrong one.

A single flat vector index is the wrong shape for that problem. Four techniques, stacked, solve almost all of it.

  1. Per-discipline namespaces, not a single collection. ChromaDB collections, Pinecone namespaces, Weaviate classes — the primitive exists in every serious vector store. Ingest each paper with a domain tag and store it in the collection for that domain. Retrieval is then scoped to one domain at a time, and cross-domain pollution becomes architecturally impossible rather than statistically unlikely.
  2. Metadata-filtered retrieval. Alongside the chunks, store a structured metadata record: domain, year, venue, authors, section. Filter at query time — either by explicit user choice (a dropdown on the chat widget) or by a classifier that infers the domain from the question itself. This is a single parameter in every modern vector DB API and solves 80% of the problem for almost no engineering cost.
  3. Query routing. A small LLM call classifies the question ("epidemiology", "astrophysics", "methodology", spanning both) and dispatches the retrieval to the correct sub-index. LlamaIndex ships a RouterQueryEngine for exactly this; LangChain calls the same pattern a multi-retriever chain. For questions that span domains, the router fans out, retrieves from both, and merges with explicit provenance so the LLM can discuss the overlap honestly.
  4. Domain-specific embeddings where it matters. A generic MiniLM embedding is a reasonable default, but for scientific corpora a model trained on academic text — SPECTER, SciBERT, or BGE fine-tuned on your own corpus — measurably improves retrieval on domain terminology. You can even run different embedding models per namespace: SPECTER for biology, a physics-tuned model for physics, and store both in their own collections. They never need to share a vector space.

A good sanity check for this kind of architecture is to write the obvious failing query by hand — "describe the role of network structure in disease spread" — and confirm the retriever no longer returns dark-matter halo passages. That single test is worth a hundred abstract discussions about embedding quality.

The honest answer to "how do you handle a mixed-discipline corpus?" is therefore not "use a better embedding model". It is "stop pretending one index can represent every domain — split the index, tag the chunks, route the queries, and let each domain have its own retrieval space". The engineering is straightforward. The discipline is remembering to do it before the embeddings quietly lie to you.

Stack. LangChain for pipeline orchestration, ChromaDB for vector storage, sentence-transformers (all-MiniLM-L6-v2) and Ollama (nomic-embed-text) for embeddings, Groq for inference (llama-3.3-70b-versatile), FastAPI for the REST API, Streamlit for the local dev UI, Render for deployment. Full code and papers list on GitHub.

Try the live chat widget or explore the code.

  Try the live demo   View on GitHub