This notebook summarizes the most important steps in inference from small samples, where t-test can frequently be applied. The notebook will demonstrate how to do hypothesis tests about sample means and which factors influence their outcome. We will be operating in the domain of blood pressure-lowering medications, and we will be interested in how well each drug meets our expectations.
Conventional blood pressure treatment results in stable pressure values represented by the sample of 10 different patients shown below. Decide whether the treatment is adequate, given that the goal is to achieve a mean pressure of no more than 95.
cgroup <- c(90,95,67,120,89,92,100,82,79,85)
max_target_mean <- 95
# mean blood pressure is smaller than the desired value
paste("Mean blood pressure and standard deviation in the conventional treatment sample:", mean(cgroup), "and", round(sd(cgroup),2))
## [1] "Mean blood pressure and standard deviation in the conventional treatment sample: 89.9 and 14.02"
The mean blood pressure \(\bar{x}_c\) calculated from the sample is lower than the desired value. However, since it is a random variable, we need to determine how often it will fall below 95 when sampling repeatedly. We can proceed in two ways: to employ hypothesis testing or calculate confidence intervals. The key parameters will be the standard deviation of the sample \(s_c\) and the sample size \(n_c\).
Before we do so, there is a couple of basic observations and assumptions: 1) we deal with a small sample (the most common small/large sample size threshold is 30), 2) we may assume that blood pressure is normally distributed (a commonly known medical fact), 3) the measurements are independent (see the design of the study mentioned above), 4) blood pressure variance in our population is unknown (and we will have to estimate it from the sample).
Consequently, we will approximate the distribution of sample means with t-distribution. We will plot it, run the hypothesis test and compute the 95% sample mean confidence interval.
c_mean <- mean(cgroup)
c_sd <- sd(cgroup)
# call the one-sided t-test
t.test(x=cgroup,alternative="less",mu=max_target_mean)
##
## One Sample t-test
##
## data: cgroup
## t = -1.1504, df = 9, p-value = 0.1398
## alternative hypothesis: true mean is less than 95
## 95 percent confidence interval:
## -Inf 98.0268
## sample estimates:
## mean of x
## 89.9
# Let us reach the same outcomes slower:
m_sd <- c_sd/sqrt(length(cgroup))
t.value <- (c_mean-max_target_mean)/m_sd
# use the distribution function of t-distribution with 9 degrees of freedom to obtain p-value
paste("P-value of one-sided t-test:", round(pt(t.value,df=9,lower.tail=T),2))
## [1] "P-value of one-sided t-test: 0.14"
We performed the one-sided t-test with \(H_0\): “the population mean is 95, i.e., \(\mu_c=95\)” and \(H_a\): “the population mean is smaller than 95, i.e., \(\mu_c<95\)”. We employed the formula: \[t_c=\frac{\bar{x}_c-\mu}{s_c/\sqrt{n_c}}=\frac{89.9-95}{14.02/\sqrt{10}}=-1.15\]
where \(\bar{x}_c\) is the sample mean, \(\mu\) is the (desired) population mean, \(s_c\) is the sample standard deviation and \(n_c\) is the sample size. In terms of hypothesis testing, the value of t-statistic is compared with the quantiles of standardized t-distribution with \(n_c-1\) degrees of freedom.
The p-value is 0.14. If we choose a common level of significance \(\alpha=0.05\), we can see that \(p>\alpha\). We cannot reject the null hypothesis. We failed to confirm the initial goal statistically. The p-value of 0.14 suggests that there is a 14% probability of observing a sample like ours, or one even more extreme, by chance if the null hypothesis holds. Consequently, we cannot reject the null hypothesis in favor of the alternative hypothesis.
Let us explain the rejection with two plots. Both the plots assume that \(H_0\) holds. The first one constructs a t-distribution with \(n_c-1=9\) degrees of freedom, whose mean is 95 and standard deviation corresponds to \(s_c/\sqrt{n_c}\). The probability of observing a sample mean \(\bar{x}_c=89.9\) or smaller from the population where \(H_0\) holds is relatively high. It corresponds to the area under the left tail of the distribution, which matches the p-value obtained from the one-sided t-test, i.e., 0.14.
The second plot constructs a cumulative distribution function of the same distribution. The function maps values from the domain of a random variable to their corresponding cumulative probabilities. We observed the t-statistic of -1.15 which maps to the probability of 0.14. In order to reject the null hypothesis with \(\alpha=0.05\), we would need to reach at least -1.83.
# the upper bound of the confidence interval
paste("The upper bound of the 95%-confidence interval of the coventional treatment sample mean:", c_mean+qt(0.95,df=9)*m_sd)
## [1] "The upper bound of the 95%-confidence interval of the coventional treatment sample mean: 98.0268006681093"
The one-tailed 95% confidence interval will be: \[[-\infty,89.9+t_{1-\alpha,n_c-1}\times s_c/\sqrt{n_c}]=[-\infty,89.9+t_{0.95,9}\times 14.02/\sqrt{10}]=[-\infty,98.03]\] The true value of population mean will be captured in 95% of sampling trials like this. The observation that the 95% confidence interval contains the blood pressure threshold of 95 leads us to the same conclusion as with hypothesis testing: the sample does not provide sufficient evidence to demonstrate the desired effect of the drug. In practice, a larger sample would likely be taken at this stage.
However, let a new blood pressure drug emerge in the meantime. Physicians have identified a new treatment group of 10 people who received the new drug. The drug is expected to lower blood pressure more effectively than the conventional drug. After a couple of months of treatment (following the same procedure as in the previous conventional group), we need to decide whether the new drug meets the original goal and also, whether it works better than the conventional treatment.
ngroup <- c(71,79,69,98,91,85,89,75,78,80)
paste("Mean blood pressure and standard deviation in the new treatment sample:", mean(ngroup), "and", round(sd(ngroup),2))
## [1] "Mean blood pressure and standard deviation in the new treatment sample: 81.5 and 9.19"
The new treatment looks promising, but we need to decide it formally. The sample mean \(\bar{x}_n\) is 81.5, which is less than \(\bar{x}_c\). The histogram above checks whether sample distributions agree with our expectations for both the treatments. It rather confirms our assumptions (no outliers, no multiple modes, etc.). We do not have to consider non-parametric tests without applying assumptions. We will proceed with t-distributions as we did before.
When plotting the t-distributions for both the sample means we can see that the new treatment has a large chance to overcome the conventional one, however, as the mean pdfs overlap, the relationship between the treatments can still be opposite.
t.test(x=ngroup,alternative="less",mu=max_target_mean)
##
## One Sample t-test
##
## data: ngroup
## t = -4.6441, df = 9, p-value = 0.0006061
## alternative hypothesis: true mean is less than 95
## 95 percent confidence interval:
## -Inf 86.82865
## sample estimates:
## mean of x
## 81.5
Before we will compare both the treatments, we checked whether the new drug fulfills the original goal, reach the mean blood pressure below 95. We work with \(H_0\): “the population mean is 95”, and \(H_a\): “the population mean is smaller than 95”. The p-value of 0.0006 suggests that there is only very small probability that the sample as ours may appear by chance if \(H_0\) holds, we reject the null hypothesis in favor of the alternative one.
Obviously, we will now use a two-sample t-test to compare the treatments. We will additionally assume that the blood pressure variance is equal in both the treatment groups (even though \(s_c\) and \(s_n\) do not match exactly). This will help us to decide which type of t-test to use and use the pooled t-test.
t.test(x=cgroup,y=ngroup,var.equal=T,alternative="greater")
##
## Two Sample t-test
##
## data: cgroup and ngroup
## t = 1.5845, df = 18, p-value = 0.06525
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## -0.7928998 Inf
## sample estimates:
## mean of x mean of y
## 89.9 81.5
We work with \(H_0\): “the population means are equal, i.e., \(\mu_n=\mu_c\)”, and \(H_a\): “the new drug population mean is smaller than the conventional drug population mean, i.e., \(\mu_n<\mu_c\)”. The p-value of 0.065 suggests that there is around 7% probability that the samples as ours may appear by chance if \(H_0\) holds, we cannot reject the null hypothesis in favor of the alternative one at the level of significance 0.05.
Technically, the test formula is: \[df=n_c+n_n-2=18\;\;\; s_p^2=\frac{(n_c-1)s_c^2+(n_n-1)s_n^2}{df}=\frac{9\times 14.02^2+9\times 9.19^2}{18}=140.5\] \[t=\frac{\bar{x}_c-\bar{x}_n}{\sqrt{s_p^2(\frac{1}{n_c}+\frac{1}{n_n})}}=\frac{89.9-81.5}{\sqrt{\frac{140.5\times 2}{10}}}=1.58\] where \(df\) is the number of degrees of freedom in our test and \(s_p\) is the pooled sample standard deviation. The t-statistic is 1.58, p-value is around 7%, larger than the selected level of significance \(\alpha\). We would need to reach t-statistic at least 1.73 to reject the null hypothesis.
Further questions and tasks:
Now let us assume that both the blood pressure populations are known. In particular, they are normally distributed, the mean blood pressure in the conventional group as well as the new drug group is 90, the equal standard deviations of 12 can be observed in both the populations. In other words, \(\mu_c=\mu_n=90\), \(\sigma_c=\sigma_n=12\). Let us generate a large number of small samples from both the populations and see how well the t-test works.
mu_c <- 90
mu_n <- 90
sigma_c <- 12
sigma_n <- 12
## sample size and the number of repeats
sample_size <- 10
reps <- 10000
## generate the samples and run the tests
# is mean in cgroup <95?
tcs <- c()
pcs <- c()
# is mean in ngroup <95?
tns <- c()
pns <- c()
# compare cgroup and ngroup
tcns <- c()
pcns <- c()
for (rep in seq(reps)){
## generate two samples with the random normal generator
cgroup <- rnorm(sample_size,mean=mu_c,sd=sigma_c)
ngroup <- rnorm(sample_size,mean=mu_n,sd=sigma_n)
## compare groups with 95
## t-test statistic
tcs[rep] <- (mean(cgroup)-max_target_mean)/(sd(cgroup)/sqrt(sample_size))
tns[rep] <- (mean(ngroup)-max_target_mean)/(sd(ngroup)/sqrt(sample_size))
## do the same with built in t-test (remember only p-values)
pcs[rep] <- t.test(x=cgroup,alternative="less",mu=max_target_mean)$p.value
pns[rep] <- t.test(x=ngroup,alternative="less",mu=max_target_mean)$p.value
## compare cgroup and ngroup
## t-test statistic
tcns[rep] <- (mean(cgroup)-mean(ngroup))/(sqrt((sd(cgroup)^2+sd(ngroup)^2)/2)*sqrt(2/sample_size))
## do the same with built in t-test (remember only p-values), run two-tailed test this time
pcns[rep] <- t.test(cgroup,ngroup,alternative = "two.sided",var.equal = TRUE,conf.level = 0.95)$p.value
}
Hypotheses tests become only confirmatory now, the truth is known. In the code above, we did two types of t-tests. The first one verifies whether the treatments achieve the mean blood pressure of no more than 95. We run the same test independently for both the treatments. Since we know that \(\mu_c=\mu_n=90\) which is clearly less than 95, we also know that the null hypotheses \(\mu_c=95\) and \(\mu_n=95\) should definitely be rejected against their alternatives \(\mu_c<95\) and \(\mu_n<95\). Whenever we failed to reject the null hypotheses we made Type II error. The power of t-test is the probability that we correctly reject the null hypotheses.
## how often will we correctly reject the null hypothesis?
## the power of test
## can be reached in two ways, to compare the p-values with alpha or the t-statistics with the corresponding t-distribution quantile
thres_05 <- qt(0.05,df=sample_size-1)
## the conventional group
## the same outcome can be reached with: mean(pcs<0.05)
paste("The experimental power of t-test in the conventional group:", mean(tcs<thres_05))
## [1] "The experimental power of t-test in the conventional group: 0.3247"
## the new drug group
## the same outcome can be reached with: mean(pns<0.05)
paste("The experimental power of t-test in the new group:", mean(tns<thres_05))
## [1] "The experimental power of t-test in the new group: 0.3281"
## t-statistic that is expected under ideal conditions when the sample mean and variance agree with the population mean and variance
t_expected <- (mu_c-max_target_mean)/(sigma_c/sqrt(sample_size))
## the real power of the test derived from theory
## shift the distribution function to the left by |t_expected|
paste("The theoretical power of t-test in each group:", pt(thres_05-t_expected,df=sample_size-1))
## [1] "The theoretical power of t-test in each group: 0.309314170128656"
Under the given experimental setting, the power of tests for both the treatments is only a bit more than 30%. The populations are the same, the small difference in the outcomes is due to randomness. As a rule of thumb, statistical experiments should be designed to have a power of around 80%. We will return to this issue later.
The observation can be reinforced with the probability density plot. The power of test corresponds to the area under left tail. You can compare the experimental areas reached for the conventional and new treatment samples and the area under the shifted t-distribution with 9 degrees of freedom shown in the dotted line. The dashed vertical line shows the threshold that leads to rejection of the null hypothesis, the threshold is \(t_{0.05,9}=-1.83\).
The second type of t-tests verifies whether the treatments match in their mean blood pressures. Since we know that \(\mu_c=\mu_n=90\), we also know that the null hypotheses \(\mu_c=\mu_n\) should not be rejected against its alternative \(\mu_c\neq\mu_n\). Whenever we rejected the null hypothesis we made Type I error.
## how often will we incorrectly reject the null hypothesis about the identity of population means?
## Type I error, false positive decisions
thres_025 <- qt(0.975,df=2*sample_size-2)
## the same outcome can be reachjed with: mean(pcns<0.05)
paste("The probability of Type I error of the t-test that compares both the groups:", mean(abs(tcns)>thres_025))
## [1] "The probability of Type I error of the t-test that compares both the groups: 0.0488"
The Type I error should match the selected significance level \(\alpha=0.05\). Quite as expected, we see that the null hypothesis was incorrectly rejected in around 5% of the tests.
The observation can be reinforced with the probability density plot. In the bold line we can see the experimental density of t-statistics generated through 10,000 sample pairs. The dotted line corresponds to theoretical t-distribution with 18 degrees of freedom. The Type I error corresponds to the area in both the tails generated for quantiles \(t_{0.025,18}\) and \(t_{0.975,18}\).
Further questions and tasks: