Graduate coursework

BMI and ICU Recovery: A Biostatistics Study

Retrospective analysis of whether BMI category predicts ICU length of stay after cardiac surgery. MIMIC-III database, non-parametric tests, honest null result. The finding was that there is no statistically significant association, and that matters.

Role Data analyst (3-person team)

Timeline Sep - Dec 2025

Type Biostatistics / Clinical Research

Dataset MIMIC-III · n = 31

Read time 8 min

The question we asked

Does BMI category predict how long cardiac surgery patients stay in the ICU? Intuitively, it seems like it should. Higher BMI is associated with more surgical complications, longer ventilation times, and higher rates of post-operative infection. Multiple studies have investigated the relationship, with mixed results.

We used the MIMIC-III clinical database to pull a retrospective cohort of adult patients who underwent cardiac surgery (ICD-9 procedure codes 35.x-39.x). After applying exclusion criteria (missing BMI, ICU stays under 24 hours, patients under 18), we had 31 patients across 5 BMI categories.

The assumption

Higher BMI = longer ICU stay

Clinical intuition and some prior literature suggest obese patients recover more slowly. If true, BMI could serve as a pre-operative risk factor for ICU resource planning.

What we found

No significant association

Spearman correlation: p = 0.291. Chi-square test: p = 0.413. Kruskal-Wallis: p = 0.603. None of our three independent tests found a statistically significant relationship between BMI category and ICU length of stay in this cohort.

Analysis pipeline

The study followed a standard retrospective cohort design. We queried the MIMIC-III database for cardiac surgery patients, computed BMI from recorded height and weight, categorized patients using WHO BMI classes, and ran three non-parametric statistical tests.

Analysis Flow

MIMIC-III Query ICD-9 codes 35.x-39.x

→

Cohort Selection Exclusion criteria, n = 31

→

BMI Categorization WHO classes (5 groups)

→

Statistical Tests Spearman, Chi-square, Kruskal-Wallis

→

Honest Result No significant association

Key decisions

Non-parametric tests, not parametric. ICU length of stay is right-skewed. Patients cluster around 2-5 days, but a few stay 20+ days. Parametric tests assume normality, which doesn't hold here. Spearman rank correlation, Chi-square, and Kruskal-Wallis all handle non-normal distributions correctly.

Three independent tests, not one. Any single test can miss or overstate a relationship depending on the data structure. Spearman tests monotonic correlation. Chi-square tests categorical independence. Kruskal-Wallis tests whether group distributions differ. All three returning non-significant results strengthens the null finding.

WHO BMI categories, not continuous BMI. Clinicians think in terms of "underweight," "normal," "overweight," "obese." Using WHO-defined cutoffs (18.5, 25, 30, 35) makes the analysis clinically interpretable rather than just statistically convenient.

Why the null result matters

In clinical research, null results are underreported. Publication bias favors studies that find effects. But knowing that BMI category does not independently predict ICU length of stay (at least in this cohort) is actionable information. It means pre-operative risk models shouldn't over-weight BMI for ICU resource planning without additional covariates. Our small sample size (n = 31) limits generalizability, but the consistency across all three tests suggests the effect, if it exists, is smaller than clinical intuition assumes.

Statistical foundation

Each test was chosen for a specific reason. Together they cover different aspects of the BMI-LOS relationship.

Spearman Rank Correlation

Tests monotonic relationship between two ordinal/continuous variables. Does not assume normality. Our result: rho = -0.196, p = 0.291.

Chi-Square Test

Tests independence between two categorical variables (BMI category vs. binned LOS). Our result: chi-squared = 9.26, p = 0.413.

Kruskal-Wallis H Test

Non-parametric one-way ANOVA. Tests whether ICU LOS distributions differ across BMI groups. Our result: H = 1.85, p = 0.603.

MIMIC-III Database

De-identified critical care data from Beth Israel Deaconess Medical Center. Over 40,000 ICU stays. Requires PhysioNet credentialing and CITI training.

Key numbers

Every number here is directly from the analysis. No rounding to make things look better.

patients in final cohort

-0.196

Spearman rho (p = 0.291)

WHO BMI categories tested

independent statistical tests

Built with

Python R Pandas SciPy MIMIC-III SQL Biostatistics Clinical Research

Why these choices?

Python + Pandas for data wrangling. MIMIC-III queries return raw relational data across multiple tables (admissions, chartevents, patients). Joining, filtering, and computing BMI from height/weight records requires reliable data manipulation. Pandas handled the merge logic and missing-value filtering.

SciPy for statistical testing. Spearman, Chi-square, and Kruskal-Wallis are all available in scipy.stats. No need for custom implementations when the library is well-validated and widely cited in published biostatistics research.

R for verification. We cross-checked Python results against R (cor.test, chisq.test, kruskal.test) to make sure our p-values matched. They did. This dual-language check caught a rounding issue in our BMI categorization early on.

Limitations we reported

Small sample, single center, retrospective

n = 31 is underpowered. We acknowledged this in the paper. MIMIC-III is a single-center database (Beth Israel Deaconess), so the cohort may not represent other hospitals. BMI was computed from a single recorded height and weight, not longitudinal measurements. And retrospective studies can't establish causation, only association. We reported all of this because honest reporting of limitations is part of doing biostatistics correctly, not an afterthought.

What I learned

Null results are results. There's a temptation to keep slicing data until something reaches p < 0.05. We didn't do that. Three tests, three non-significant results, reported as-is. In a field plagued by p-hacking and selective reporting, publishing null findings honestly is more useful than manufacturing significance.

MIMIC-III teaches you to respect data provenance. Every value in that database has a chain of custody: which caregiver entered it, what device recorded it, when it was validated. Working with clinical data made me appreciate that a number on a spreadsheet is only as trustworthy as the process that put it there.

Small samples teach statistical humility. With n = 31, even a moderate effect might not reach significance. That doesn't mean the effect doesn't exist. It means we can't detect it with this sample. The difference between "no effect" and "insufficient power to detect an effect" is a distinction that matters in clinical decision-making, and this project forced me to articulate it clearly.

Post-Mortem

What I got wrong.

The analysis was clean and the reporting was honest. But the study design had problems I should have caught earlier.

We should have done a power analysis before committing to n = 31.

We pulled patients matching our criteria and ended up with 31. That's what MIMIC-III gave us after exclusions. But we didn't run a power analysis first to determine the minimum sample size needed to detect a clinically meaningful effect. A post-hoc power calculation showed that detecting a moderate correlation (r = 0.3) at alpha = 0.05 with 80% power requires roughly 85 patients. We had 31. The study was underpowered from the start, and we should have either broadened our inclusion criteria, included additional surgical categories, or acknowledged the power limitation in our study design rather than discovering it in our discussion section.

BMI from a single measurement is a weak exposure variable.

MIMIC-III records height and weight at ICU admission. We computed BMI from that single data point. But BMI at the moment of ICU entry may not reflect the patient's chronic weight status. Post-surgical fluid retention, edema, and perioperative weight changes all affect the number. A patient categorized as "obese" at ICU admission might have been "overweight" the week before surgery. We should have discussed this measurement validity issue more prominently. In retrospective database studies, you use what's available, but acknowledging the gap between what you measure and what you actually want to measure is important.

We didn't control for confounders.

BMI doesn't exist in isolation. Age, comorbidities, surgical complexity, and pre-operative health status all affect ICU length of stay. Our analysis tested the bivariate relationship between BMI and LOS without adjusting for confounders. A multivariable regression or stratified analysis would have been more informative. We flagged this as a limitation, but the better approach would have been to include at least age and Charlson comorbidity index as covariates from the start. The simple bivariate design was appropriate for a semester project, but it limits what the null result can actually tell us.

Questions You Might Ask

Answers before the interview.

If I were screening this portfolio, these are the three questions I'd ask. So here they are, answered.

Why include a study with a null result in your portfolio?

Because null results are honest results. The temptation in a portfolio is to only show projects where everything worked and the numbers were impressive. But in clinical research and medical device validation, the ability to report what the data actually shows, even when it contradicts your hypothesis, is more important than flashy results. I included this project because it demonstrates that I can design a study, execute the analysis, and report findings without inflating them. If I'm going to work in a regulated industry where data integrity matters, this is the kind of work that proves I take it seriously.

How does this relate to medical device work?

Clinical outcome analysis is a core competency in medical device post-market surveillance and clinical studies. Understanding how to query clinical databases, apply appropriate statistical tests, handle missing data, and interpret results is directly applicable to clinical investigations under 21 CFR 812, post-market clinical follow-up studies, and real-world evidence submissions. The specific question (BMI vs. ICU LOS) is less important than the methodology: retrospective cohort design, non-parametric testing for non-normal distributions, honest reporting of limitations, and cross-validation of results between Python and R.

What would you do differently with more data?

First, I'd expand the cohort to at least 100 patients by broadening surgical categories or using MIMIC-IV (larger dataset, more recent). Second, I'd include confounders: a multivariable logistic regression with age, Charlson comorbidity index, and surgical complexity as covariates. Third, I'd switch from WHO BMI categories to continuous BMI with restricted cubic splines to capture non-linear relationships. The current study answers a narrow question with limited data. A larger study could answer the more useful clinical question: after adjusting for known risk factors, does BMI independently predict prolonged ICU stay?

Interested in this work?

Clinical data analysis, honest statistical reporting, and the ability to work with regulated healthcare databases. These are the skills I'd bring to a validation or clinical affairs team.

Get in touch