(Refer back to the Essential Statistics lesson).
- Descriptive statistics
- Hypothesis testing
- Normality assumptions
- Cross tabulation
- Logistic regression
- Interpreting model summaries
The datasets we’ll be using for this assignment are both curated and hosted by the Vanderbilt Department of Biostatistics.
Dobutamine is a drug that is used during echocardiograms (aka “stress tests”), which are clinical tests used to estimate heart function. The treatment causes heart rate to increase, and its effects at different dosages were measured in a study published in 1999. We’ll be using the data behind this paper to answer the questions that follow.
# Load libraries
library(readr)
library(dplyr)
# Read data
stress <- read_csv("data/stressEcho.csv")
# Take a look
stress
## # A tibble: 558 × 32
## id bhr basebp basedp pkhr sbp dp dose maxhr pctMphr mbp dpmaxdo
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 92 103 9476 114 86 9804 40 100 74 121 12100
## 2 2 62 139 8618 120 158 18960 40 120 82 158 18960
## 3 3 62 139 8618 120 157 18840 40 120 82 157 18840
## 4 4 93 118 10974 118 105 12390 30 118 72 105 12390
## 5 5 89 103 9167 129 173 22317 40 129 69 176 22704
## 6 6 58 100 5800 123 140 17220 40 123 83 140 17220
## 7 7 63 120 7560 98 130 12740 40 98 71 130 12740
## 8 8 86 161 13846 144 157 22608 40 144 111 157 22608
## 9 9 69 143 9867 115 118 13570 40 113 81 151 17063
## 10 10 76 105 7980 126 125 15750 40 126 94 125 15750
## # … with 548 more rows, and 20 more variables: dobdose <dbl>, age <dbl>,
## # gender <chr>, baseEF <dbl>, dobEF <dbl>, chestpain <dbl>, restwma <dbl>,
## # posSE <dbl>, newMI <dbl>, newPTCA <dbl>, newCABG <dbl>, death <dbl>,
## # hxofHT <dbl>, hxofDM <dbl>, hxofCig <chr>, hxofMI <dbl>, hxofPTCA <dbl>,
## # hxofCABG <dbl>, any.event <dbl>, ecg <chr>
Note that in addition measuring dobutamine dosages during each stress test, the authors collected information on other variables including: resting heart rate, max heart rate, blood pressure, age and (most importantly) whether or not the patient experienced any cardiac event in the 12 months that followed the test.
Before answering the questions, make sure to review the data dictionary:
http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/stressEcho.html
Note that the data were originally coded by the authors of the study but were then partially recoded when curated. Pay particular attention to the any.event
variable (did the subject experience any cardiac event over 12 months following the stress test), which should be interpeted as 0=NO, 1=YES. The top of the data dicionary makes this clear that all event variables are recoded such that 0=no and 1=yes, which is more conventional.
dpmaxdo
)?max(stress$dpmaxdo)
## [1] 45114
HINT: The quantile()
function defaults to 0, 0.25, .5, .75 and 1 but can accept arbitrary threshholds. See ?quantile
and look for the probs
argument.
quantile(stress$dpmaxdo, probs = 0.99)
## 99%
## 31949
dpmaxdo
values.library(ggplot2)
ggplot(stress, aes(dpmaxdo)) +
geom_histogram()
any.event
). Assume equal variances between these groups.t.test(dpmaxdo ~ any.event, data = stress, var.equal = TRUE)
##
## Two Sample t-test
##
## data: dpmaxdo by any.event
## t = 2, df = 556, p-value = 0.04
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 60.1 2279.9
## sample estimates:
## mean in group 0 mean in group 1
## 18736 17566
$
operator) rather than simply re-typing the value. Feel free to extract the p-value directly from the t-test, or first use broom::tidy() to tidy the model first.t.test(dpmaxdo ~ any.event, data = stress, var.equal = TRUE)$p.value
## [1] 0.0389
hxofCig
) is represented categorically as “heavy”, “moderate” and “non-smoker”. Create a margin table showing the total counts of individuals in each smoking history category, for all individuals who either did or did not have any cardiac event by smoking status. Next, show proportions over the row margin (what percentage of each category had any cardiac event?).xt <- xtabs(~ hxofCig + any.event, data = stress)
addmargins(xt)
## any.event
## hxofCig 0 1 Sum
## heavy 98 24 122
## moderate 115 23 138
## non-smoker 256 42 298
## Sum 469 89 558
round(prop.table(xt, margin=1), 3)
## any.event
## hxofCig 0 1
## heavy 0.803 0.197
## moderate 0.833 0.167
## non-smoker 0.859 0.141
mosaicplot(xt)
chisq.test(xt)
##
## Pearson's Chi-squared test
##
## data: xt
## X-squared = 2, df = 2, p-value = 0.4
# If you don't have broom installed:
# install.packages("broom")
library(broom)
tidy(chisq.test(xt))
## # A tibble: 1 × 4
## statistic p.value parameter method
## <dbl> <dbl> <int> <chr>
## 1 2.08 0.354 2 Pearson's Chi-squared test
The questions that follow are based on a data collected to examine several blood serum markers believed to be associated with genetics for a specific kind of muscular dystrophy (DMD). The data were analyzed and results reported in a 1985 paper. In particular, the authors were interested in whether a woman’s DMD carrier status (carrier
) was related to the blood serum markers creatine kinase (ck
), hemopexin (h
), pyruvate kinase (pk
) and lactate dehydrogenase (ld
).
Use the following to read and store the data:
dmd <- read_csv("data/dmd.csv")
For more information on the data set see:
http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/dmd.html
mean(dmd$ld, na.rm = TRUE)
## [1] 199
HINT: The plot below uses gather()
from tidyr to transform the data so all histograms can be rendered in a single “facet wrapped” plot. Feel free to give this a shot or create separate histograms for each variable. Either method is acceptable.
library(ggplot2)
library(tidyr)
library(dplyr)
dmd %>%
gather(marker, value,ck:ld) %>%
ggplot(aes(value)) +
geom_histogram(bins = 25) +
facet_wrap(~marker, scales = "free")
ck
seems to require the most attention. Try using a log transformation on that column and create another histogram.dmd %>%
mutate(logck = log(ck)) %>%
ggplot(aes(logck)) +
geom_histogram(bins = 25)
ck
, and to use summary()
on the model object to view the coefficients.fit <- glm(carrier ~ log(ck) + h + pk + ld, data = dmd, family= "binomial")
summary(fit)
##
## Call:
## glm(formula = carrier ~ log(ck) + h + pk + ld, family = "binomial",
## data = dmd)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3512 -0.4151 -0.1539 0.0829 2.4006
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -25.95891 4.28709 -6.06 1.4e-09
## log(ck) 2.86251 0.65632 4.36 1.3e-05
## h 0.11280 0.02825 3.99 6.5e-05
## pk 0.10321 0.04378 2.36 0.018
## ld 0.01284 0.00526 2.44 0.015
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 250.077 on 193 degrees of freedom
## Residual deviance: 99.005 on 189 degrees of freedom
## (15 observations deleted due to missingness)
## AIC: 109
##
## Number of Fisher Scoring iterations: 7
round(exp(fit$coefficients), 4)
## (Intercept) log(ck) h pk ld
## 0.00 17.51 1.12 1.11 1.01