The Problem Worth Solving
Most analytics portfolio projects demonstrate a skill in isolation: a Pandas notebook here, a Power BI dashboard there. What they rarely show is the full chain — from a real, messy dataset through a validated pipeline to a BI-ready data model that a team could actually trust in production.
This project was built to close that gap deliberately. Using the MIMIC-III Waveform Database — a publicly available dataset of ICU patient vital signs from PhysioNet — the goal was to build exactly the kind of system a Data Centre of Excellence maintains: automated, validated, documented, and maintainable by anyone on the team.
The domain choice was intentional. Respiratory biometrics — SpO₂, respiratory rate, heart rate, blood pressure — map directly to the kind of measurement data that healthcare analytics teams work with daily: time-series measurements, patient cohorts, clinical thresholds, and risk stratification. The engineering challenge is identical regardless of whether the underlying metric is a vital sign or a manufacturing sensor reading.
How It's Built
The system has three layers: an ETL and validation layer that processes raw data, a Power BI export layer that generates the star-schema model, and an interactive dashboard layer for exploratory analysis. All three are orchestrated by a single pipeline runner with gate-based validation logic.
The data_pipeline.py orchestrator is the spine of the system.
It runs each stage sequentially, logging every gate result before proceeding.
A FAIL at any gate halts the pipeline — outputs are never
generated from data that has not passed validation.
The Four-Gate Pipeline
Most pipelines validate reactively — you notice something is wrong when a dashboard shows nonsense. This pipeline validates proactively, at four explicit checkpoints, with structured JSON logging at each one.
Source Validation
Verify raw data directories exist and are non-empty before anything else runs. Fast-fail before wasting compute on a missing source.
ETL Processing
Extract minute-level vital signs via WFDB, standardise inconsistent monitor naming (SpO2→spo2), engineer clinical features, apply quality flags.
Quality Validation
Completeness checks per field, range validation against clinical thresholds, duplicate detection, Z-score outlier flagging with structured output.
Power BI Export
Generate star-schema CSVs, pre-authored DAX measures JSON, data model documentation. Fires only on clean, fully-validated data.
Each gate writes a timestamped PASS / WARN / FAIL entry to a
structured JSON audit log. A complete report is generated per run. You can
always answer: when did this pipeline last run cleanly, and what did it
find?
Power BI Star Schema
The export layer produces a proper star schema — not a flat CSV
handed off to an analyst to wrestle into shape. The model is designed for
query performance and self-service usability: dimension
tables for filtering and slicing, a central fact table for aggregation, and
pre-authored DAX measures so analysts aren't writing the same
CALCULATE expressions from scratch.
Pre-authored DAX measures exported alongside the schema:
Average SpO2 = AVERAGE(fact_vitals[spo2])
Hypoxemia Rate % =
DIVIDE(
COUNTROWS(FILTER(
fact_vitals,
fact_vitals[hypoxemia_flag]=1)),
COUNTROWS(fact_vitals)
) * 100High Risk Patients =
CALCULATE(
DISTINCTCOUNT(
dim_patient[patient_key]),
dim_patient[risk_category]
= "High"
)agg_hourly_vitals.csv Hourly rollups pre-computed at export for dashboard performance at scale.
Five-Module Streamlit Application
The dashboard is built for multi-level interrogation. A clinical or analytics team can move from cohort-level KPIs to individual patient trajectories to statistical hypothesis testing within the same application. All modules share a real-time patient filter that cascades across pages.
Overview & Demographics
Cohort KPIs, severity gauges, monitoring duration stats across the full patient set.
Respiratory Biometrics
SpO₂ and respiratory rate distributions, patient-level comparisons with clinical threshold overlays.
Temporal Analysis
Per-patient time-series trajectories, hypoxemia episode detection, shift-period breakdowns.
Data Quality Report
Completeness rates per field, quality flag summaries, outlier counts — the audit trail made visible in the UI.
Statistical Analysis
t-tests, Cohen's d effect sizes, Pearson correlation heatmap, outcome analysis — for the analyst who wants to go deeper.
Every chart supports CSV download. The application is stateless and
configuration-driven via config.py — clinical thresholds
and parameters are defined once and respected everywhere.
Built to Be Maintained
The features matter less than what they reveal about underlying discipline. Each engineering choice was made with the same question in mind: what does it take for a team to own and extend this independently?
GitHub Actions on every push and pull request
Three automated stages: flake8 lint, pytest unit tests, and data validation schema checks. The CI badge is a live signal of system health — not an afterthought.
Centralised configuration management
config.py holds all clinical thresholds, file paths, and pipeline parameters. Change a threshold once — it propagates everywhere. No scattered magic numbers.
pytest unit test suite
Tests cover standardisation logic, metric computation, and summary generation. A breaking code change is caught before it reaches the data model.
Structured JSON audit logs per run
Every execution writes a machine-readable log to data/logs/. Complete traceability: what ran, what passed, what was flagged, and when.
VS Code Dev Container configuration
The .devcontainer/ setup guarantees a reproducible environment across machines. New contributor, same environment — no "works on my machine" surprises.
What This Project Shows
End-to-end BI delivery
Raw data → validated pipeline → star-schema model → Power BI exports with pre-authored DAX. The complete chain, not just a segment of it.
Power BI fluency
Not just connecting a CSV. Proper star-schema design, surrogate keys, dimension tables, and authored DAX measures — the architecture a BI team operates within.
Analytics engineering craft
CI/CD, pytest, structured logging, configuration management, Dev Container. The infrastructure that makes data products maintainable at team scale.
Multi-audience design
Five dashboard modules calibrated for different analytical depths — from executive KPIs to statistical hypothesis testing — in a single coherent application.
Clinical domain depth
Hypoxemia detection logic, clinical threshold validation, ventilation type classification. Domain knowledge embedded in the pipeline, not bolted on afterward.
Reproducible by design
Any analyst with the repo and PhysioNet credentials can reproduce the full pipeline from scratch — no handover session required.