Skip to main content
Graduate coursework

Heart Disease Prediction with Logistic Regression

A 4-person comparative ML study on the UCI Heart Disease dataset. I owned the Logistic Regression model and the Introduction. The model hit a 5-fold cross-validated ROC-AUC of 0.911, and its top coefficients lined up with documented clinical risk factors for coronary artery disease. That alignment is the part that matters.

Role ML modeling + Background section (4-person team)
Timeline Apr - May 2026
Type Machine Learning / Clinical Decision Support
Dataset UCI Heart Disease (Cleveland) · n = 303
Read time 8 min

The question we asked

Can a simple, interpretable model predict whether a patient has coronary artery disease from a routine cardiology workup, and does it reason about features the way a clinician would? Heart disease is the leading cause of death in the United States. The strongest risk factors are already collected during a standard workup. The harder problem is combining them into a consistent risk estimate that a clinician can defend to a patient.

Our team picked the UCI Heart Disease dataset, Cleveland subset (303 patients, 13 features). I owned Logistic Regression and the Introduction section. My teammates ran SVM and Random Forest on the same shared preprocessing pipeline so all three models could be compared like-for-like.

The trade-off

Interpretability vs. raw accuracy

Random Forest typically scores higher on this dataset, but its decision logic is opaque. A clinician who has to defend a screening recommendation needs to know why the model fired. That made the comparison a question about cost, not just performance.

What we found

LogReg gets you 90% of the AUC, all of the interpretability

5-fold CV ROC-AUC: LogReg 0.911, SVM 0.894, Random Forest ~0.87. On the holdout test set LogReg hit 0.939. The top five coefficients matched documented coronary artery disease risk factors. The strongest negative predictor matched the cardiovascular fitness literature.

Modeling pipeline

All three models share the same preprocessing pipeline so the comparison is honest. Median imputation and StandardScaler are wrapped in a sklearn Pipeline so they re-fit inside each cross-validation fold. The 80/20 split uses a fixed random seed for reproducibility.

Pipeline Flow
UCI Cleveland 303 patients, 13 features
Preprocessing One-hot, impute, scale
80/20 Split Stratified, seed 42
Logistic Regression 5-fold stratified CV
Coefficients + ROC Tied to clinical risk factors

Key decisions

Shared preprocessing module across all three models. If each teammate ran their own train/test split and scaler, the cross-model comparison would be meaningless. We agreed on one preprocessing.py with median imputation and StandardScaler in a Pipeline. All three models import and use the same fixed split.

One-hot encoding for ordinal categoricals. Chest pain type, resting ECG, ST slope, and thalassemia status are categorical, not ordered numerical quantities. Treating them as continuous would have introduced a false ordering. One-hot encoding turned 4 categorical features into 14 binary features that the model could weight independently.

5-fold stratified CV plus a holdout split. CV gives a stable estimate of out-of-sample performance with confidence bounds. The holdout split keeps a portion of data the model has never seen for a final, single-shot evaluation. Reporting both keeps the reader from being misled by either alone.

Why interpretability mattered here

In a clinical screening tool, a model that scores 1-2 points higher on accuracy is not always the right model. Logistic Regression coefficients can be inspected one by one. A clinician can confirm that the model is weighting fluoroscopy vessel count, asymptomatic chest pain, and a flat ST slope as positive risk signals, and weighting maximum exercise heart rate as a protective factor. Black-box models that predict equally well still have to be trusted blind. For a first-pass risk stratification tool that gets followed by confirmatory testing, that trust is the differentiator.

Top predictors and clinical context

The whole point of choosing Logistic Regression was that you can read the coefficients. Here are the strongest signals the model learned, ranked by absolute coefficient magnitude on the standardized features.

ca (vessels on fluoroscopy)

Strongest positive predictor. Coefficient +1.15. Direct anatomical evidence of coronary disease. The more colored vessels, the higher the predicted risk.

cp_4 (asymptomatic chest pain)

Coefficient +0.87. Counter-intuitive at first read, but consistent with the literature: asymptomatic presentations in this dataset are largely incidental findings during a workup, biasing toward true disease.

sex (male)

Coefficient +0.74. Male sex is a non-modifiable cardiovascular risk factor in every major guideline.

thal_7 (reversible perfusion defect)

Coefficient +0.68. Reversible defects on thallium imaging indicate ischemic but viable myocardium. A documented marker of coronary artery disease.

slope_2 (flat ST slope)

Coefficient +0.60. Flat or downsloping ST during exercise is one of the classic ECG signs of myocardial ischemia.

thalach (max exercise heart rate)

Strongest negative predictor. Coefficient -0.36. Higher exercise capacity is consistently associated with lower cardiovascular risk.

Key numbers

All numbers are from a Logistic Regression model trained on the shared preprocessing pipeline, evaluated with 5-fold stratified cross-validation and a separate 20% holdout test set.

0.939
ROC-AUC (holdout, n=61)
0.911
ROC-AUC (5-fold CV mean)
0.848
Accuracy (5-fold CV mean)
0.868
Precision (5-fold CV mean)

Built with

Python scikit-learn Pandas NumPy Matplotlib UCI ML Repository Jupyter Logistic Regression

Why these choices?

scikit-learn for the model and the pipeline. Pipeline + StratifiedKFold gives you correct cross-validation behavior almost for free. Imputation and scaling re-fit inside each fold instead of leaking statistics from the test data into training. That detail is the difference between a defensible result and a misleading one.

UCI dataset, not a bigger one. 303 patients is small. But the comparative goal here was methodology, not headline accuracy. UCI Cleveland is the well-known reference dataset for this task, which makes our results comparable to a large body of published work.

Coefficient inspection over feature importance from a black box. Random Forest gives feature importance numbers, but they reflect splits across an ensemble of trees, not how a single decision is made. A LogReg coefficient is the actual log-odds contribution of that feature on every prediction. That is the kind of explanation a clinician can read.

Limitations we reported

Small dataset, single source, no clinical validation

303 patients from one institution in 1988 is a tiny dataset by modern ML standards. The class distribution is mildly imbalanced (54/46) but not severely. We did not perform hyperparameter tuning, since the goal was a like-for-like comparison at default settings. The coefficients align with clinical literature, but a real clinical decision support tool would require external validation against contemporary data, calibration analysis, decision-curve analysis, and a regulatory pathway under FDA SaMD guidance. None of that was in scope for a semester project, but a recruiter should know we can articulate the gap.

What I learned

The simplest model that works is usually the right model. Logistic Regression sits one hair behind Random Forest on this dataset, and well within Random Forest's confidence interval. That tiny gap is not worth the loss of interpretability when the downstream user has to defend a recommendation. This is the same lesson Cynthia Rudin has been making for years: stop explaining black boxes for high-stakes decisions, and use models that are interpretable to begin with.

Shared preprocessing is the unsung half of a comparative study. The result table is only meaningful if every model saw the same train/test split, the same imputation, and the same scaler fit. Building the shared preprocessing module before anyone trained a model saved us from comparing a model that benefited from data leakage to one that did not.

Coefficients are a hypothesis-generating tool, not just a deliverable. When the model put +0.87 on asymptomatic chest pain, my first instinct was that the coding was wrong. Reading the data dictionary clarified that the asymptomatic class in this dataset captures incidental findings during workup, not low-risk patients. The model was teaching me about the data. That kind of feedback is impossible to get from a black-box predictor.

What I got wrong.

The model and the comparison were clean. The way I scoped the project had real gaps that a real clinical ML role would catch.

01
No calibration analysis.

A model with strong ROC-AUC can still be poorly calibrated. ROC-AUC measures ranking, not probability accuracy. If the model says 70% disease probability, is that actually a 70% rate of disease in that bucket of patients? I did not run a reliability diagram, Brier score, or Platt scaling. For any model that would influence a clinical decision, calibration is non-optional. I would add it on a v2.

02
No decision-curve analysis or threshold tuning for clinical use.

The default 0.5 probability cutoff is rarely the right operating point for a screening tool. For heart disease, the cost of a false negative (missed disease) is much higher than the cost of a false positive (extra confirmatory test). A decision-curve analysis would show the net benefit at different thresholds and identify where the model adds value over treat-all or treat-none. Without that, the recall-precision tradeoff in the paper is informative but not actionable.

03
I did not stress-test the model on age and sex subgroups.

Performance averaged over the whole test set can hide subgroup failures. The model could be excellent for older male patients and useless for younger female patients, and the aggregate ROC-AUC of 0.94 would not show that. A subgroup analysis stratified by age band and sex would have caught this. In a regulated clinical decision support setting, FDA Good Machine Learning Practice principles call out subgroup performance as a baseline expectation. I would add the table on a v2.

Answers before the interview.

Three questions a recruiter or hiring manager would reasonably ask about this project, answered up front.

Q1
Why is this on a medtech portfolio? Isn't this a standard ML coursework problem?

Two reasons. First, classification on tabular clinical data is the bread-and-butter of clinical decision support, post-market surveillance ML, and risk stratification tools. The methodology I used here, shared preprocessing, stratified CV, holdout evaluation, calibration thinking, and subgroup awareness, is the exact methodology you bring to a regulated clinical model. Second, the AI/ML SaMD space is one of the fastest-growing regulatory areas at the FDA. Engineers who can talk fluently about both the model and the validation framework around it are scarce. This project shows I can do both.

Q2
You only owned one model in a 4-person team. What was the actual contribution?

The Logistic Regression notebook, the verified results on real UCI data (5-fold CV ROC-AUC 0.911 +/- 0.020, holdout 0.939), and the Introduction and Background section of the paper. I also pushed the team to lock in shared preprocessing, a fixed train/test split, and a common metrics format before anyone trained a model. Without that, the comparative analysis at the end of the paper would have been three different models trained on three different datasets and labeled the same. I am explicit about this on the portfolio because misattributing team work is a fast way to fail an interview reference check.

Q3
What would a production version of this look like?

External validation on a contemporary multi-institution dataset (MIMIC-IV, eICU, or institutional EHR data). Calibration analysis with reliability diagrams and Platt or isotonic scaling. Decision-curve analysis to find a clinical operating threshold. Subgroup performance tables for age, sex, and major comorbidities. A model card documenting all of the above. And a regulatory framing under the FDA's AI/ML SaMD action plan and Good Machine Learning Practice principles. The semester version answered the methodology question. The production version would answer the deployment question.

Interested in this work?

Interpretable ML on clinical data, shared preprocessing pipelines for honest comparisons, and the regulatory awareness needed to take a model from notebook to clinical use. These are skills I would bring to a clinical decision support, validation, or applications team.