Introduction

In this study, I found a dataset that contains several variables with medical statistics for a population of Pima Indian females. These numeric variables outline various health factors that can help indicate the likelihood of an individual having diabetes. These factors include their number of pregnancies, blood glucose levels, diastolic blood pressure, skin thickness (indicative of body fat), BMI, 2 hour serum insulin, BMI, age, and Diabetes Pedigree Function. The response variable is a binary categorical variable that indicates whether or not the individual tests positive for diabetes.

The objective of this study is to analyze the correlations between the various predictor variables and understand which ones most strongly predict the outcome of a patient having diabetes. There are 9 variables total in the dataset with 767 observations made.

library(tidyverse)
library(dplyr)

diab <- read.csv("https://raw.githubusercontent.com/apatni97/CompProject2/main/pima-indians-diabetes%202.csv")

diab <- rename(diab, Pregnancies = X6)
diab <- rename(diab, Glucose = X148)
diab <- rename(diab, BloodPressure = X72)
diab <- rename(diab, SkinThickness = X35)
diab <- rename(diab, Insulin = X0)
diab <- rename(diab, BMI = X33.6)
diab <- rename(diab, Pedi = X0.627)
diab <- rename(diab, Age = X50)
diab <- rename(diab, Diagnosis = X1)

MANOVA and univariate ANOVA testing

I performed a MANOVA to test the mean difference of the numerical health variables across the patient diagnoses (positive or negative) and determine the relationship between the diagnosis and all of the different health factors. The test did prove to be significant, and thus we reject the null hypothesis and conclude that there is a significant difference based on positive or negative diagnosis. I subsequently ran 8 univariate ANOVA tests and 16 pairwise t tests to show the differing groups and the mean differences. The ANOVA tests also proved to be significant. Overall, I performed 25 tests total. The Bonferroni error rate was found to be .002, and the type 1 error rate was found to be .7226.

For the MANOVA testing, it is likely that most of the assumptions were met. There were very few significant outliers and the data was taken with random samples from independent individuals. None of the variables are overly correlated with each other. Variance between groups isn’t equal but isn’t excessive either.

diab %>% group_by(Diagnosis) %>% summarize(mean(Pregnancies), 
    mean(Glucose), mean(BloodPressure), mean(SkinThickness), 
    mean(Insulin), mean(BMI), mean(Pedi), mean(Age))
## # A tibble: 2 x 9
##   Diagnosis `mean(Pregnanci… `mean(Glucose)` `mean(BloodPres… `mean(SkinThick…
##       <int>            <dbl>           <dbl>            <dbl>            <dbl>
## 1         0             3.30            110.             68.2             19.7
## 2         1             4.86            141.             70.8             22.1
## # … with 4 more variables: `mean(Insulin)` <dbl>, `mean(BMI)` <dbl>,
## #   `mean(Pedi)` <dbl>, `mean(Age)` <dbl>
diabmanova <- manova(cbind(Pregnancies, Glucose, BloodPressure, 
    SkinThickness, Insulin, BMI, Pedi, Age) ~ Diagnosis, data = diab)

summary(diabmanova)
##            Df  Pillai approx F num Df den Df    Pr(>F)    
## Diagnosis   1 0.30226   41.045      8    758 < 2.2e-16 ***
## Residuals 765                                             
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary.aov(diabmanova)
##  Response Pregnancies :
##              Df Sum Sq Mean Sq F value    Pr(>F)    
## Diagnosis     1  425.4  425.44  39.314 6.028e-10 ***
## Residuals   765 8278.5   10.82                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response Glucose :
##              Df Sum Sq Mean Sq F value    Pr(>F)    
## Diagnosis     1 169999  169999  212.04 < 2.2e-16 ***
## Residuals   765 613329     802                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response BloodPressure :
##              Df Sum Sq Mean Sq F value  Pr(>F)  
## Diagnosis     1   1210 1209.63   3.234 0.07252 .
## Residuals   765 286136  374.03                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response SkinThickness :
##              Df Sum Sq Mean Sq F value  Pr(>F)  
## Diagnosis     1   1047  1046.6  4.1285 0.04251 *
## Residuals   765 193925   253.5                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response Insulin :
##              Df   Sum Sq Mean Sq F value   Pr(>F)    
## Diagnosis     1   177338  177338  13.562 0.000247 ***
## Residuals   765 10002951   13076                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response BMI :
##              Df Sum Sq Mean Sq F value    Pr(>F)    
## Diagnosis     1   4084  4084.3  71.679 < 2.2e-16 ***
## Residuals   765  43590    57.0                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response Pedi :
##              Df Sum Sq Mean Sq F value    Pr(>F)    
## Diagnosis     1  2.526 2.52646  23.671 1.388e-06 ***
## Residuals   765 81.650 0.10673                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response Age :
##              Df Sum Sq Mean Sq F value    Pr(>F)    
## Diagnosis     1   5913  5913.3   45.29 3.333e-11 ***
## Residuals   765  99884   130.6                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
pairwise.t.test(diab$Pregnancies, diab$Diagnosis, p.adj = "none")
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  diab$Pregnancies and diab$Diagnosis 
## 
##   0    
## 1 6e-10
## 
## P value adjustment method: none
pairwise.t.test(diab$Glucose, diab$Diagnosis, p.adj = "none")
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  diab$Glucose and diab$Diagnosis 
## 
##   0     
## 1 <2e-16
## 
## P value adjustment method: none
pairwise.t.test(diab$BloodPressure, diab$Diagnosis, p.adj = "none")
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  diab$BloodPressure and diab$Diagnosis 
## 
##   0    
## 1 0.073
## 
## P value adjustment method: none
pairwise.t.test(diab$SkinThickness, diab$Diagnosis, p.adj = "none")
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  diab$SkinThickness and diab$Diagnosis 
## 
##   0    
## 1 0.043
## 
## P value adjustment method: none
pairwise.t.test(diab$Insulin, diab$Diagnosis, p.adj = "none")
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  diab$Insulin and diab$Diagnosis 
## 
##   0      
## 1 0.00025
## 
## P value adjustment method: none
pairwise.t.test(diab$BMI, diab$Diagnosis, p.adj = "none")
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  diab$BMI and diab$Diagnosis 
## 
##   0     
## 1 <2e-16
## 
## P value adjustment method: none
pairwise.t.test(diab$Pedi, diab$Diagnosis, p.adj = "none")
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  diab$Pedi and diab$Diagnosis 
## 
##   0      
## 1 1.4e-06
## 
## P value adjustment method: none
pairwise.t.test(diab$Age, diab$Diagnosis, p.adj = "none")
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  diab$Age and diab$Diagnosis 
## 
##   0      
## 1 3.3e-11
## 
## P value adjustment method: none
bonferroni_ <- 0.05/25
type1rate_ <- 1 - 0.95^(25)
bonferroni_
## [1] 0.002
type1rate_
## [1] 0.7226104

Randomization Test

I performed a randomization test of mean differences because the diagnosis variable was categorical while the pregnancy variable was numeric. The test demonstrated the mean difference statistic for both conditions (either diabetes positive or negative). The mean difference for positive patients was significant while it was insignificant for negative patients, indicating a clear relationship between number of pregnancies and positive diabetes status.

rand_dist <- vector()

for (i in 1:5000) {
    new <- data.frame(preg = sample(diab$Pregnancies), diag1 = diab$Diagnosis)
    rand_dist[i] <- mean(new[new$diag1 == "1", ]$preg) - mean(new[new$diag1 == 
        "0", ]$preg)
}

mean(rand_dist > mean(diab$Diagnosis == "1")) * 2
## [1] 0.17
mean(rand_dist > mean(diab$Diagnosis == "0")) * 2
## [1] 0.0092

Linear Regression

I performed a linear regression to see the effects of Glucose and Blood Pressure on BMI. I mean centered all 3 variables as they were numeric. After looking at the coefficient estimates, both Glucose and BP correlated positively with BMI, although the interaction between Glucose and BP produced a negative relationship.

The assumptions of linearity and normality were met in this model. Heteroskedasticity was hard to determine visually, although the BP test showed a p value < .05 therefore affirming that the assumption was met and the null hypothesis was rejected. After using robust standard errors in the regression, the p values remain less than .05 the conclusion does not change and the null hypothesis remains rejected. This shows that glucose and Blood Pressure are predictive of BMI, although the interaction between the two variables is not.

library(lmtest)
library(sandwich)
library(ggplot2)

diab$meanBMI <- diab$BMI - mean(diab$BMI)
diab$meanGlucose <- diab$Glucose - mean(diab$Glucose)
diab$meanBP <- diab$BloodPressure - mean(diab$BloodPressure)

regression <- lm(diab$meanBMI ~ diab$meanGlucose * diab$meanBP, 
    diab)
summary(regression)
## 
## Call:
## lm(formula = diab$meanBMI ~ diab$meanGlucose * diab$meanBP, data = diab)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.922  -4.608  -0.201   4.523  30.989 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   0.187950   0.269599   0.697    0.486    
## diab$meanGlucose              0.047255   0.008436   5.601 2.97e-08 ***
## diab$meanBP                   0.102960   0.013901   7.407 3.43e-13 ***
## diab$meanGlucose:diab$meanBP -0.001992   0.000472  -4.222 2.72e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.364 on 763 degrees of freedom
## Multiple R-squared:  0.1321, Adjusted R-squared:  0.1287 
## F-statistic: 38.71 on 3 and 763 DF,  p-value: < 2.2e-16
ggplot(regression, aes(diab$meanGlucose, diab$meanBMI)) + geom_point() + 
    geom_smooth(method = "lm")

bptest(regression)
## 
##  studentized Breusch-Pagan test
## 
## data:  regression
## BP = 10.6, df = 3, p-value = 0.0141
coeftest(regression, vcov = vcovHC(regression))
## 
## t test of coefficients:
## 
##                                 Estimate  Std. Error t value  Pr(>|t|)    
## (Intercept)                   0.18795008  0.26663214  0.7049  0.481085    
## diab$meanGlucose              0.04725467  0.00811898  5.8203 8.649e-09 ***
## diab$meanBP                   0.10296034  0.02190726  4.6998 3.089e-06 ***
## diab$meanGlucose:diab$meanBP -0.00199251  0.00063277 -3.1489  0.001703 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Logistic Regression Model 1

I performed a logistic regression to see if the variables Glucose and Blood Pressure are predictive of a patient’s diabetes diagnosis. The coefficients demonstrate a positive effect, that higher values of Glucose and Blood Pressure correlate with a patient testing positive for diabetes. Glucose is slightly more positively correlated.

The Sensitivity was found to be .487, and the Specificity was found to be .884. The precision was determined to be .6914. The AUC of the ROC plot was determined to be .7526 which is an OK or adequate value but not very good.

fit1 <- glm(Diagnosis ~ Glucose + BloodPressure, data = diab, 
    family = binomial(link = "logit"))

exp(coef(fit1))
##   (Intercept)       Glucose BloodPressure 
##   0.005227388   1.038679930   0.998432808
library(ggplot2)
library(plotROC)

ROCplot1 <- ggplot(diab) + geom_roc(aes(d = Diagnosis, m = Glucose + 
    BloodPressure), n.cuts = 0) + geom_segment(aes(x = 0, xend = 1, 
    y = 0, yend = 1), lty = 2)
ROCplot1

calc_auc(ROCplot1)
##   PANEL group      AUC
## 1     1    -1 0.752618
tdat <- diab %>% mutate(prob = predict(fit1, type = "response"), 
    prediction = ifelse(prob > 0.5, 1, 0))
classify <- tdat %>% transmute(prob, prediction, truth = Diagnosis)
classify
##          prob prediction truth
## 1  0.10606722          0     0
## 2  0.83072004          1     1
## 3  0.12134494          0     0
## 4  0.47069688          0     1
## 5  0.27535193          0     0
## 6  0.08532377          0     1
## 7  0.29120817          0     0
## 8  0.89212536          1     1
## 9  0.34060801          0     1
## 10 0.22730857          0     0
## 11 0.73219653          1     1
## 12 0.47397788          0     0
## 13 0.86113108          1     1
## 14 0.71768984          1     1
## 15 0.18865359          0     1
## 16 0.28752826          0     1
## 17 0.21262216          0     1
## 18 0.19909130          0     0
## 19 0.26907651          0     1
## 20 0.35203914          0     0
## 21 0.16403860          0     0
## 22 0.88527068          1     1
## 23 0.29667229          0     1
## 24 0.50641512          1     1
## 25 0.34982536          0     1
## 26 0.55124343          1     1
## 27 0.15760591          0     0
## 28 0.53005975          1     0
## 29 0.27729543          0     0
## 30 0.22533216          0     0
## 31 0.65093522          1     1
## 32 0.11866239          0     0
## 33 0.12935404          0     0
##  [ reached 'max' / getOption("max.print") -- omitted 734 rows ]
table(prediction = classify$prediction, truth = classify$truth) %>% 
    addmargins()
##           truth
## prediction   0   1 Sum
##        0   442 137 579
##        1    58 130 188
##        Sum 500 267 767
TPR <- 130/267
TNR <- 442/500
PPV <- 130/188
TPR
## [1] 0.4868914
TNR
## [1] 0.884
PPV
## [1] 0.6914894

Logistic Regression Model Continued

I performed a logistic regression predicting Diabetes diagnosis from all of the health factors. The AUC value was fairly good at around .84. The sensitivity was .58, specificity at .89, and precision at .74.

fit2 <- glm(Diagnosis ~ ., data = diab, family = "binomial")
prob <- predict(fit2, type = "response")

class_diag(prob, diab$Diagnosis)
##         acc      sens spec       ppv       auc
## 1 0.7822686 0.5805243 0.89 0.7380952 0.8390637
table(predict = as.numeric(prob > 0.5), truth = diab$Diagnosis) %>% 
    addmargins
##        truth
## predict   0   1 Sum
##     0   445 112 557
##     1    55 155 210
##     Sum 500 267 767

Subsequently, I perfoemd a 10 fold cross-validation to find out of sample class diagnostic statistics. The AUC was slightly lower than the in sample value, at .82. The values for sensitivity, specificity, and precision were all slightly lower than the in sample metrics as well.

set.seed(1234)
k = 10

data <- diab[sample(nrow(diab)), ]
folds <- cut(seq(1:nrow(diab)), breaks = k, labels = F)

diags <- NULL
for (i in 1:k) {
    
    train <- data[folds != i, ]
    test <- data[folds == i, ]
    truth <- test$Diagnosis
    
    fit3 <- glm(Diagnosis ~ ., data = train, family = "binomial")
    prob3 <- predict(fit3, newdata = test, type = "response")
    
    diags <- rbind(diags, class_diag(prob3, truth))
    
}

summarize_all(diags, mean)
##         acc      sens      spec       ppv       auc
## 1 0.7718387 0.5649954 0.8829649 0.7193941 0.8270645

After performing a LASSO, the model retained the variables for Pregnancies, Glucose, BMI, Pedigree, and Age. BP, skin thickness, and insulin levels were dropped.

library(glmnet)
set.seed(1234)

y <- as.matrix(diab$Diagnosis)
x <- model.matrix(Diagnosis ~ ., data = diab)[, -1]
x <- scale(x)

cv <- cv.glmnet(x, y, family = "binomial")
lasso <- glmnet(x, y, family = "binomial", lambda = cv$lambda.1se)
coef(lasso)
## 12 x 1 sparse Matrix of class "dgCMatrix"
##                          s0
## (Intercept)   -7.669011e-01
## Pregnancies    2.415250e-01
## Glucose        8.500326e-01
## BloodPressure  .           
## SkinThickness  .           
## Insulin        .           
## BMI            3.879291e-01
## Pedi           1.034209e-01
## Age            5.243686e-02
## meanBMI        6.470962e-05
## meanGlucose    .           
## meanBP         .
lasso_dat <- diab %>% dplyr::select(Pregnancies, Glucose, BMI:Age, 
    Diagnosis)

data1 <- lasso_dat %>% sample_frac
folds <- ntile(1:nrow(data1), n = 10)

diags1 <- NULL
for (i in 1:k) {
    train <- data1[folds != i, ]
    test <- data1[folds == i, ]
    truth <- test$Diagnosis
    
    fit4 <- glm(Diagnosis ~ ., data = train, family = "binomial")
    prob4 <- predict(fit4, newdata = test, type = "response")
    
    diags1 <- rbind(diags1, class_diag(prob4, truth))
}

diags1 %>% summarize_all(mean)
##         acc      sens    spec      ppv       auc
## 1 0.7654135 0.5619553 0.88422 0.715493 0.8371973
table(predict = as.numeric(prob > 0.5), truth = lasso_dat$Diagnosis) %>% 
    addmargins
##        truth
## predict   0   1 Sum
##     0   445 112 557
##     1    55 155 210
##     Sum 500 267 767

After perfroming a 10 fold CV on the variables that LASSO retained, I found an AUC value of around .83 which is close to the values found in the logistic regressions performed above.

set.seed(1234)
k = 10

data <- lasso_dat[sample(nrow(lasso_dat)), ]
folds <- cut(seq(1:nrow(lasso_dat)), breaks = k, labels = F)

diags3 <- NULL
for (i in 1:k) {
    
    train <- data[folds != i, ]
    test <- data[folds == i, ]
    truth <- test$Diagnosis
    
    fit5 <- glm(Diagnosis ~ ., data = train, family = "binomial")
    prob5 <- predict(fit5, newdata = test, type = "response")
    
    diags3 <- rbind(diags, class_diag(prob5, truth))
    
}

summarize_all(diags3, mean)
##         acc      sens      spec       ppv       auc
## 1 0.7725098 0.5675042 0.8826954 0.7201104 0.8277691