The Problem Worth Solving

Most analytics portfolio projects demonstrate a skill in isolation: a Pandas notebook here, a Power BI dashboard there. What they rarely show is the full chain — from a real, messy dataset through a validated pipeline to a BI-ready data model that a team could actually trust in production.

This project was built to close that gap deliberately. Using the MIMIC-III Waveform Database — a publicly available dataset of ICU patient vital signs from PhysioNet — the goal was to build exactly the kind of system a Data Centre of Excellence maintains: automated, validated, documented, and maintainable by anyone on the team.

The domain choice was intentional. Respiratory biometrics — SpO₂, respiratory rate, heart rate, blood pressure — map directly to the kind of measurement data that healthcare analytics teams work with daily: time-series measurements, patient cohorts, clinical thresholds, and risk stratification. The engineering challenge is identical regardless of whether the underlying metric is a vital sign or a manufacturing sensor reading.

The benchmark for every design decision: if I left tomorrow, could another analyst run this confidently — and trust the outputs?

How It's Built

The system has three layers: an ETL and validation layer that processes raw data, a Power BI export layer that generates the star-schema model, and an interactive dashboard layer for exploratory analysis. All three are orchestrated by a single pipeline runner with gate-based validation logic.

RAW DATA MIMIC-III PhysioNet ETL PIPELINE Extract + standardise Feature engineering Quality flags ORCHESTRATOR 4-gate validation JSON audit logs data_pipeline.py POWER BI EXPORT Star schema CSVs DAX measures data_model.json POWER BI Desktop DASHBOARD 5-module Streamlit Plotly + real-time filter CSV export GitHub Actions CI/CD lint · pytest · data validation · on every push

The data_pipeline.py orchestrator is the spine of the system. It runs each stage sequentially, logging every gate result before proceeding. A FAIL at any gate halts the pipeline — outputs are never generated from data that has not passed validation.

The Four-Gate Pipeline

Most pipelines validate reactively — you notice something is wrong when a dashboard shows nonsense. This pipeline validates proactively, at four explicit checkpoints, with structured JSON logging at each one.

GATE 01

Source Validation

Verify raw data directories exist and are non-empty before anything else runs. Fast-fail before wasting compute on a missing source.

GATE 02

ETL Processing

Extract minute-level vital signs via WFDB, standardise inconsistent monitor naming (SpO2spo2), engineer clinical features, apply quality flags.

GATE 03

Quality Validation

Completeness checks per field, range validation against clinical thresholds, duplicate detection, Z-score outlier flagging with structured output.

GATE 04

Power BI Export

Generate star-schema CSVs, pre-authored DAX measures JSON, data model documentation. Fires only on clean, fully-validated data.

Each gate writes a timestamped PASS / WARN / FAIL entry to a structured JSON audit log. A complete report is generated per run. You can always answer: when did this pipeline last run cleanly, and what did it find?

Power BI Star Schema

The export layer produces a proper star schema — not a flat CSV handed off to an analyst to wrestle into shape. The model is designed for query performance and self-service usability: dimension tables for filtering and slicing, a central fact table for aggregation, and pre-authored DAX measures so analysts aren't writing the same CALCULATE expressions from scratch.

fact_vitalsFACT TABLE · 34,630 rows
▸ vital_keyPK · INT
◆ patient_keyFK → dim_patient
◆ time_keyFK → dim_time
◆ quality_keyFK → dim_quality
heart_rateFLOAT
spo2FLOAT
respiratory_rateFLOAT
hypoxemia_flagINT · 0/1
tachypnea_flagINT · 0/1
dim_patientDIMENSION
▸ patient_keyPK · INT
patient_idVARCHAR
risk_categoryVARCHAR
monitoring_hoursFLOAT
avg_spo2FLOAT
dim_timeDIMENSION
▸ time_keyPK · INT
hourINT
shiftVARCHAR
day_periodVARCHAR

Pre-authored DAX measures exported alongside the schema:

Average SpO₂
Average SpO2 =
  AVERAGE(fact_vitals[spo2])
Hypoxemia Rate %
Hypoxemia Rate % =
  DIVIDE(
    COUNTROWS(FILTER(
      fact_vitals,
      fact_vitals[hypoxemia_flag]=1)),
    COUNTROWS(fact_vitals)
  ) * 100
High Risk Patient Count
High Risk Patients =
  CALCULATE(
    DISTINCTCOUNT(
      dim_patient[patient_key]),
    dim_patient[risk_category]
      = "High"
  )
Pre-aggregated rollups
agg_hourly_vitals.csv

Hourly rollups pre-computed
at export for dashboard
performance at scale.

Five-Module Streamlit Application

The dashboard is built for multi-level interrogation. A clinical or analytics team can move from cohort-level KPIs to individual patient trajectories to statistical hypothesis testing within the same application. All modules share a real-time patient filter that cascades across pages.

Overview & Demographics

Cohort KPIs, severity gauges, monitoring duration stats across the full patient set.

Respiratory Biometrics

SpO₂ and respiratory rate distributions, patient-level comparisons with clinical threshold overlays.

Temporal Analysis

Per-patient time-series trajectories, hypoxemia episode detection, shift-period breakdowns.

Data Quality Report

Completeness rates per field, quality flag summaries, outlier counts — the audit trail made visible in the UI.

Statistical Analysis

t-tests, Cohen's d effect sizes, Pearson correlation heatmap, outcome analysis — for the analyst who wants to go deeper.

Every chart supports CSV download. The application is stateless and configuration-driven via config.py — clinical thresholds and parameters are defined once and respected everywhere.

Built to Be Maintained

The features matter less than what they reveal about underlying discipline. Each engineering choice was made with the same question in mind: what does it take for a team to own and extend this independently?

CI/CD

GitHub Actions on every push and pull request

Three automated stages: flake8 lint, pytest unit tests, and data validation schema checks. The CI badge is a live signal of system health — not an afterthought.

CONFIG

Centralised configuration management

config.py holds all clinical thresholds, file paths, and pipeline parameters. Change a threshold once — it propagates everywhere. No scattered magic numbers.

TESTS

pytest unit test suite

Tests cover standardisation logic, metric computation, and summary generation. A breaking code change is caught before it reaches the data model.

AUDIT

Structured JSON audit logs per run

Every execution writes a machine-readable log to data/logs/. Complete traceability: what ran, what passed, what was flagged, and when.

DEV ENV

VS Code Dev Container configuration

The .devcontainer/ setup guarantees a reproducible environment across machines. New contributor, same environment — no "works on my machine" surprises.

What This Project Shows

End-to-end BI delivery

Raw data → validated pipeline → star-schema model → Power BI exports with pre-authored DAX. The complete chain, not just a segment of it.

Power BI fluency

Not just connecting a CSV. Proper star-schema design, surrogate keys, dimension tables, and authored DAX measures — the architecture a BI team operates within.

Analytics engineering craft

CI/CD, pytest, structured logging, configuration management, Dev Container. The infrastructure that makes data products maintainable at team scale.

Multi-audience design

Five dashboard modules calibrated for different analytical depths — from executive KPIs to statistical hypothesis testing — in a single coherent application.

Clinical domain depth

Hypoxemia detection logic, clinical threshold validation, ventilation type classification. Domain knowledge embedded in the pipeline, not bolted on afterward.

Reproducible by design

Any analyst with the repo and PhysioNet credentials can reproduce the full pipeline from scratch — no handover session required.

Technology Stack

Python 3.11
Pandas
NumPy
SciPy
Streamlit
Plotly
Power BI
DAX
Star Schema
GitHub Actions
pytest
flake8
WFDB
Dev Containers
Makefile

The full code, tests, and CI configuration are open on GitHub.

View on GitHub ← Back to portfolio