Mean and variance from a sample

Assume that our data set is \(n=100\) real numbers which have been sampled from a normal distribution with zero mean and unit variance.

Assume that given data we want to estimate the mean and variance (standard deviation) of the distribution that generated the data. The correct answer would be mean = 0 and sd = 1, but we never get exactly correct answer from a sample. Instead, we need to somehow estimate these values from our empirical sample.

Lets sample 1000 datasets of \(n=100\) numbers and compute mean and variance for each of them. Lets look at the distribution of means and variances.

Mean

set.seed(42)

## Make a matrix where columns are datasets of 100 real numbers
x <- replicate(1000,rnorm(100))

means <- apply(x,2,mean) # means for data sets
sds0 <- apply(x,2,function(y) sqrt(sum((y-mean(y))^2)/length(y))) # "biased" standard deviations
sds1 <- apply(x,2,function(y) sqrt(sum((y-mean(y))^2)/(length(y)-1))) # "unbiased" standard deviations

mv <- function(x) sprintf(" = %.4f ± %.4f",mean(x),sd(x))

hist(means,
     xlab=expression(hat(mu)),,
     main=bquote(paste(hat(mu),.(mv(means)))))

From the theory we know that given \(n\) real numbers \(x_i\), \(i\in\{1,\ldots,n\}\), the unbiased estimate of the mean is given by \[\hat\mu=\sum\nolimits_{i=1}^n{x_i}/n,\] which is confirmed by the histogram above.

We also know that the error in this estimate has a variance (standard deviation) of the mean estimate is \[ \sigma_{\hat\mu}=\sigma/\sqrt{n},\] where \(\sigma=1\) in our case. We notice that the theoretical values agree well with the ones that we see from the same histogram.

plot(means,sds1,
     xlab=expression(hat(mu)),
     ylab=expression(hat(sigma)[1]),
     main="estimated means and variances")
abline(v=0,col="red")
abline(h=1,col="red")
legend("topleft",
       legend=c("true values","estimates for datasets"),
       lty=c("solid",NA),
       pch=c(NA,1),
       col=c("red","black"),
       cex=0.5)

Variance (standard deviation)

Lets then look at the estimator of variance (standard deviation). If an estimate is “biased” it means that (for finite data) the expected value of the estimate differes from the true value. The most known instance of biased estimate is the naive estimator for the standard deviation.

hist(sds0,
     xlab=expression(hat(sigma)[0]),
     main=bquote(paste(hat(sigma)[0],.(mv(sds0)))))

hist(sds1,
     xlab=expression(hat(sigma)[1]),
     main=bquote(paste(hat(sigma)[1],.(mv(sds1)))))

Naive “biased” estimate of variance (standard deviation) is given by \[\hat\sigma_0=\left(\sum\nolimits_{i=1}^n{\left(x_i-\hat\mu\right)^2}/n\right)^{1/2}.\] “Unbiased” estimate of variance (standard deviation) is given by \[\hat\sigma_1=\left(\sum\nolimits_{i=1}^n{\left(x_i-\hat\mu\right)^2}/(n-1)\right)^{1/2}.\] We indeed notice that the unbiased estimate gives on average a result that is very close to the true value of \(\sigma=1\), while the biased estimate underestimates the variance.

Estimated regression coefficients

Assume that we have again \(n=100\) data points \((x_i,y_i)\in{\mathbb{R^2}}\), where \(i=\{1,\ldots,n\}\), \(y_i=2+3x_i+\epsilon_i\) and \(\epsilon_i\in{\mathbb{R}}\) is a noise term. Assume that \(x_i\) and \(\epsilon_i\) are sampled independently from a normal distribution with zero mean and unit variance.

y <- 2+3*x+rnorm(length(x))
plot(x[,1],y[,1],xlab="x",ylab="y")

Consider OLS linear regression where a regression function \[ f(x)=\beta_0+\beta_1 x\] is fitted to the data. Obviously, the “true” values for the parameters \(\beta_0\) and \(\beta_1\) read \(\beta_0=2\) and \(\beta_1=3\), but from a finite sample we never get exactly the correct result, instead, the fits vary slightly.

bb <- sapply(1:ncol(x),
             function(i) lm(y~x,data.frame(x=x[,i],y=y[,i]))$coefficients)

plot(x[,1],y[,1],xlab="x",ylab="y",main="OLS fits for different data sets")
abline(a=2,b=3)
for(i in 1:10) {
  abline(a=bb[1,i],b=bb[2,i],lty="dotted")
}
legend("topleft",legend=c("true model","fitted models"),
       lty=c("solid","dotted"))

We next study the distribution of regression coefficients \(\hat\beta_0\) and \(\hat\beta_1\) in various fits.

plot(t(bb),
     xlab=expression(hat(beta)[0]),ylab=expression(hat(beta)[1]),
     main="estimated regression coefficients")
abline(v=2,col="red")
abline(h=3,col="red")
legend("topleft",
       legend=c("true coefficients","estimates for datasets"),
       lty=c("solid",NA),
       pch=c(NA,1),
       col=c("red","black"),
       cex=0.5)


hist(bb[1,],
     xlab=expression(hat(beta)[0]),
     main=bquote(paste(hat(beta)[0],.(mv(bb[1,])))))

hist(bb[2,],
     xlab=expression(hat(beta)[1]),
     main=bquote(paste(hat(beta)[1],.(mv(bb[2,])))))

Not surprisingly, the “error” in the estimates of both \(\hat\beta_0\) and \(\hat\beta_1\) obey the \(\sigma/\sqrt{n}=0.1\) rule here. We however leave the derivation of this as an exercise for you. On average, we get rather accurately the “correct” result \(\hat\beta_0=2\) and \(\hat\beta_1=3\), but for individual data sets there is some variance of the order of \(0.1^2\).

Bootstrap

We were able to do the above exercise because we knew the distribution that generated the data. What if we do not know the distribution? Then we can use the bootstrap sampling, see the textbook, Sec. 5.2, for discussion.

First, we choose the first sampled dataset of size \(n=100\) and assume that we have no access to the 999 other datasets.

For this specific dataset we have the following estimates \(\hat\mu=\) 0.0325148, \(\hat\sigma_0=\) 1.0361371, \(\hat\sigma_1=\) 1.041357, \(\hat\beta_0=\) 2.0127722, and \(\hat\beta_1=\) 2.807633.

Lets generate 1000 bootstrap replicates out of this chosen dataset only.

## Resample each of the columns (datasets) with replacement
idx <- c(replicate(1000,sample.int(100,replace=TRUE)))

## Bootstrap samples of x[,1] and y[,1]
xb <- matrix(x[idx,1],nrow=nrow(x),ncol=ncol(x))
yb <- matrix(y[idx,1],nrow=nrow(y),ncol=ncol(y))

Do all of the same stuff with the bootstrap samples that we did with the samples from the “true” distribution.

meansb <- apply(xb,2,mean) # means for data sets
sds0b <- apply(xb,2,function(y) sqrt(sum((y-mean(y))^2)/length(y))) # "biased" standard deviations
sds1b <- apply(xb,2,function(y) sqrt(sum((y-mean(y))^2)/(length(y)-1))) # "unbiased" standard deviations

The bootstrap estimates of \(\hat\mu\) and \(\hat\sigma_1\) and their variances look very similar to the corresponding values from the true distribution.

hist(meansb,
     xlab=expression(hat(mu)),,
     main=bquote(paste(hat(mu),.(mv(means)))))

hist(sds0b,
     xlab=expression(hat(sigma)[0]),
     main=bquote(paste(hat(sigma)[0],.(mv(sds0)))))

hist(sds1b,
     xlab=expression(hat(sigma)[1]),
     main=bquote(paste(hat(sigma)[1],.(mv(sds1)))))

We next take a look at regression. More specifically, we do the OLS regression for bootstrap replicates of the dataset. Remember that for the original dataset the estimates of the regression coefficients are \(\hat\beta_0=\) 2.0127722 and \(\hat\beta_1=\) 2.807633.

bbb <- sapply(1:ncol(x),
             function(i) lm(y~x,data.frame(x=xb[,i],y=yb[,i]))$coefficients)

plot(x[,1],y[,1],xlab="x",ylab="y")
abline(a=2,b=3)
for(i in 1:10) {
  abline(a=bbb[1,i],b=bbb[2,i],lty="dotted")
}

The bootstrap distribution of the regression coefficients is a bit more interesting.

plot(t(bbb),
     xlab=expression(hat(beta)[0]),ylab=expression(hat(beta)[1]),
     main="estimated regression coefficients")
abline(v=2,col="red")
abline(h=3,col="red")
abline(v=bb[1,1],col="red",lty="dashed")
abline(h=bb[2,1],col="red",lty="dashed")
legend("topleft",
       legend=c("true coefficients","original estimates","bootstrap estimates"),
       lty=c("solid","dashed",NA),
       pch=c(NA,NA,1),
       col=c("red","red","black"),
       cex=0.5)


hist(bbb[1,],
     xlab=expression(hat(beta)[0]),
     main=bquote(paste(hat(beta)[0],.(mv(bbb[1,])))))

hist(bbb[2,],
     xlab=expression(hat(beta)[1]),
     main=bquote(paste(hat(beta)[1],.(mv(bbb[2,])))))

Miracle happened! It seems that we do not need to be able to sample from the true distribution at all. “Simulating” the true distribution with bootstrap samples seem to work just fine. Notice that there are of course some subtle assumptions that we need to make in order for this to be consistent, but they are outside the scope of this course.

Notice that the bootstrap estimates are centered around the empirical means of the chosen dataset, not the true means. For example, for the \(\hat\beta_1=\) 2.807633 for our dataset and this is where bootstrap distribution for \(\hat\beta_1\) is centered. This is logical, as the bootstrap cannot magically find out that it should be centered around 3. However, the estimate of variance is at least approximately correct. Therefore, with the bootstrap you cannot magically find the “true” mean (only the mean of the empirical sample), but you can find a good estimate of the variance of an estimator. When you know the variance you can find the confidence intervals for the estimator. For example \(\pm 1.96\hat\sigma_1\) confidence intervals for the parameter \(\hat\beta_1=\) 2.807633 would be given by \([\) 2.5966345 \(,\) 3.0186315 \(]\). The “true” value \(\beta_1=3\) is indeed inside this interval.

About confidence intervals and statistical testing

Why \(\pm 1.96\hat\sigma_1\)? If a random variable \(X\) is normally distributed with mean \(\mu\) and variance \(\sigma\), then its value is with 95% probability in the interval \([\mu-1.96\sigma,\mu+1.96\sigma]\). We call this interval 95% confidence interval.

Often, an interesting question is if the null hypothesis that a linear coefficient \(\beta_1\) vanishes can be rejected. We can reject the null hypothesis, i.e., claim that \(\beta_1\) really is non-zero, if zero is not included in the confidence interval. R reports directly standard errors for (i.e., \(1\sigma\) errors) for linear regression coefficients as well as p-values for the null hypothesis that the coefficient vanishes:

print(summary(lm(y~x,data.frame(x=x[,1],y=y[,1]))))

Call:
lm(formula = y ~ x, data = data.frame(x = x[, 1], y = y[, 1]))

Residuals:
    Min      1Q  Median      3Q     Max 
-2.9289 -0.5077 -0.0846  0.5860  3.5658 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.01277    0.09838   20.46   <2e-16 ***
x            2.80763    0.09491   29.58   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9834 on 98 degrees of freedom
Multiple R-squared:  0.8993,    Adjusted R-squared:  0.8983 
F-statistic: 875.2 on 1 and 98 DF,  p-value: < 2.2e-16

Please read Sec. 3.1.2 of the text book for more information.

---
title: "About bootstrap"
output: html_notebook
author: Kai Puolamäki
date: 13 December 2019
---

## Mean and variance from a sample

Assume that our data set is $n=100$ real numbers which have been sampled from a normal distribution with zero mean and unit variance. 

Assume that given data we want to estimate the mean and variance (standard deviation) of the distribution that generated the data. The correct answer would be mean = 0 and sd = 1, but we never get exactly correct answer from a sample. Instead, we need to somehow estimate these values from our empirical sample.

Lets sample 1000 datasets of $n=100$ numbers and compute mean and variance for each of them. Lets look at the distribution of means and variances.

### Mean

```{r}
set.seed(42)

## Make a matrix where columns are datasets of 100 real numbers
x <- replicate(1000,rnorm(100))

means <- apply(x,2,mean) # means for data sets
sds0 <- apply(x,2,function(y) sqrt(sum((y-mean(y))^2)/length(y))) # "biased" standard deviations
sds1 <- apply(x,2,function(y) sqrt(sum((y-mean(y))^2)/(length(y)-1))) # "unbiased" standard deviations

mv <- function(x) sprintf(" = %.4f ± %.4f",mean(x),sd(x))

hist(means,
     xlab=expression(hat(mu)),,
     main=bquote(paste(hat(mu),.(mv(means)))))
```

From the theory we know that given $n$ real numbers $x_i$, $i\in\{1,\ldots,n\}$, the unbiased estimate of the mean is given by
$$\hat\mu=\sum\nolimits_{i=1}^n{x_i}/n,$$
which is confirmed by the histogram above.

We also know that the error in this estimate has a variance (standard deviation) of the mean estimate is
$$ \sigma_{\hat\mu}=\sigma/\sqrt{n},$$
where $\sigma=1$ in our case. We notice that the theoretical values agree well with the ones that we see from the same histogram.

```{r}
plot(means,sds1,
     xlab=expression(hat(mu)),
     ylab=expression(hat(sigma)[1]),
     main="estimated means and variances")
abline(v=0,col="red")
abline(h=1,col="red")
legend("topleft",
       legend=c("true values","estimates for datasets"),
       lty=c("solid",NA),
       pch=c(NA,1),
       col=c("red","black"),
       cex=0.5)
```

### Variance (standard deviation)

Lets then look at the estimator of variance (standard deviation). If an estimate is "biased" it means that (for finite data) the expected value of the estimate differes from the true value. The most known instance of biased estimate is the naive estimator for the standard deviation.

```{r}
hist(sds0,
     xlab=expression(hat(sigma)[0]),
     main=bquote(paste(hat(sigma)[0],.(mv(sds0)))))
hist(sds1,
     xlab=expression(hat(sigma)[1]),
     main=bquote(paste(hat(sigma)[1],.(mv(sds1)))))
```

Naive "biased" estimate of variance (standard deviation) is given by
$$\hat\sigma_0=\left(\sum\nolimits_{i=1}^n{\left(x_i-\hat\mu\right)^2}/n\right)^{1/2}.$$
"Unbiased" estimate of variance (standard deviation) is given by
$$\hat\sigma_1=\left(\sum\nolimits_{i=1}^n{\left(x_i-\hat\mu\right)^2}/(n-1)\right)^{1/2}.$$
We indeed notice that the unbiased estimate gives on average a result that is very close to the true value of $\sigma=1$, while the biased estimate underestimates the variance.

## Estimated regression coefficients

Assume that we have again $n=100$ data points $(x_i,y_i)\in{\mathbb{R^2}}$, where $i=\{1,\ldots,n\}$, $y_i=2+3x_i+\epsilon_i$ and $\epsilon_i\in{\mathbb{R}}$ is a noise term. Assume that $x_i$ and $\epsilon_i$ are sampled independently from a normal distribution with zero mean and unit variance. 

```{r}
y <- 2+3*x+rnorm(length(x))
plot(x[,1],y[,1],xlab="x",ylab="y")
```

Consider OLS linear regression where a regression function $$
f(x)=\beta_0+\beta_1 x$$
is fitted to the data. Obviously, the "true" values for the parameters $\beta_0$ and $\beta_1$ read $\beta_0=2$ and $\beta_1=3$, but from a finite sample we never get exactly the correct result, instead, the fits vary slightly.

```{r}
bb <- sapply(1:ncol(x),
             function(i) lm(y~x,data.frame(x=x[,i],y=y[,i]))$coefficients)

plot(x[,1],y[,1],xlab="x",ylab="y",main="OLS fits for different data sets")
abline(a=2,b=3)
for(i in 1:10) {
  abline(a=bb[1,i],b=bb[2,i],lty="dotted")
}
legend("topleft",legend=c("true model","fitted models"),
       lty=c("solid","dotted"))
```

We next study the distribution of regression coefficients $\hat\beta_0$ and $\hat\beta_1$ in various fits.

```{r}
plot(t(bb),
     xlab=expression(hat(beta)[0]),ylab=expression(hat(beta)[1]),
     main="estimated regression coefficients")
abline(v=2,col="red")
abline(h=3,col="red")
legend("topleft",
       legend=c("true coefficients","estimates for datasets"),
       lty=c("solid",NA),
       pch=c(NA,1),
       col=c("red","black"),
       cex=0.5)

hist(bb[1,],
     xlab=expression(hat(beta)[0]),
     main=bquote(paste(hat(beta)[0],.(mv(bb[1,])))))
hist(bb[2,],
     xlab=expression(hat(beta)[1]),
     main=bquote(paste(hat(beta)[1],.(mv(bb[2,])))))
```

Not surprisingly, the "error" in the estimates of both $\hat\beta_0$ and $\hat\beta_1$ obey the $\sigma/\sqrt{n}=0.1$ rule here. We however leave the derivation of this as an exercise for you. On average, we get rather accurately the "correct" result $\hat\beta_0=2$ and $\hat\beta_1=3$, but for individual data sets there is some variance of the order of $0.1^2$.

## Bootstrap

We were able to do the above exercise because we knew the distribution that generated the data. What if we do not know the distribution? Then we can use the **bootstrap** sampling, see the textbook, Sec. 5.2, for discussion.

First, we choose the first sampled dataset of size $n=100$ and assume that we have no access to the 999 other datasets.

For this specific dataset we have the following estimates $\hat\mu=$ `r mean(x[,1])`, $\hat\sigma_0=$ `r sd(x[,1])*sqrt(99/100)`, $\hat\sigma_1=$ `r sd(x[,1])`, $\hat\beta_0=$ `r bb[1,1]`, and $\hat\beta_1=$ `r bb[2,1]`.

Lets generate 1000 bootstrap replicates out of this chosen dataset only.

```{r}
## Resample each of the columns (datasets) with replacement
idx <- c(replicate(1000,sample.int(100,replace=TRUE)))

## Bootstrap samples of x[,1] and y[,1]
xb <- matrix(x[idx,1],nrow=nrow(x),ncol=ncol(x))
yb <- matrix(y[idx,1],nrow=nrow(y),ncol=ncol(y))
```

Do all of the same stuff with the bootstrap samples that we did with the samples from the "true" distribution.

```{r}
meansb <- apply(xb,2,mean) # means for data sets
sds0b <- apply(xb,2,function(y) sqrt(sum((y-mean(y))^2)/length(y))) # "biased" standard deviations
sds1b <- apply(xb,2,function(y) sqrt(sum((y-mean(y))^2)/(length(y)-1))) # "unbiased" standard deviations
```

The bootstrap estimates of $\hat\mu$ and $\hat\sigma_1$ and their variances look very similar to the corresponding values from the true distribution.

```{r}
hist(meansb,
     xlab=expression(hat(mu)),,
     main=bquote(paste(hat(mu),.(mv(means)))))
hist(sds0b,
     xlab=expression(hat(sigma)[0]),
     main=bquote(paste(hat(sigma)[0],.(mv(sds0)))))
hist(sds1b,
     xlab=expression(hat(sigma)[1]),
     main=bquote(paste(hat(sigma)[1],.(mv(sds1)))))
```

We next take a look at regression. More specifically, we do the OLS regression for bootstrap replicates of the dataset. Remember that for the original dataset the estimates of the regression coefficients are $\hat\beta_0=$ `r bb[1,1]` and $\hat\beta_1=$ `r bb[2,1]`.

```{r}
bbb <- sapply(1:ncol(x),
             function(i) lm(y~x,data.frame(x=xb[,i],y=yb[,i]))$coefficients)

plot(x[,1],y[,1],xlab="x",ylab="y")
abline(a=2,b=3)
for(i in 1:10) {
  abline(a=bbb[1,i],b=bbb[2,i],lty="dotted")
}
```

The bootstrap distribution of the regression coefficients is a bit more interesting.

```{r}
plot(t(bbb),
     xlab=expression(hat(beta)[0]),ylab=expression(hat(beta)[1]),
     main="estimated regression coefficients")
abline(v=2,col="red")
abline(h=3,col="red")
abline(v=bb[1,1],col="red",lty="dashed")
abline(h=bb[2,1],col="red",lty="dashed")
legend("topleft",
       legend=c("true coefficients","original estimates","bootstrap estimates"),
       lty=c("solid","dashed",NA),
       pch=c(NA,NA,1),
       col=c("red","red","black"),
       cex=0.5)

hist(bbb[1,],
     xlab=expression(hat(beta)[0]),
     main=bquote(paste(hat(beta)[0],.(mv(bbb[1,])))))
hist(bbb[2,],
     xlab=expression(hat(beta)[1]),
     main=bquote(paste(hat(beta)[1],.(mv(bbb[2,])))))
```

Miracle happened! It seems that we do not need to be able to sample from the true distribution at all. "Simulating" the true distribution with bootstrap samples seem to work just fine. Notice that there are of course some subtle assumptions that we need to make in order for this to be consistent, but they are outside the scope of this course.

Notice that the bootstrap estimates are centered around the empirical means of the chosen dataset, not the true means. For example, for the $\hat\beta_1=$ `r bb[2,1]` for our dataset and this is where bootstrap distribution for $\hat\beta_1$ is centered. This is logical, as the bootstrap cannot magically find out that it should be centered around 3. However, the estimate of variance is at least approximately correct. Therefore, with the bootstrap you cannot magically find the "true" mean (only the mean of the empirical sample), but you can find a good estimate of the variance of an estimator. When you know the variance you can find the confidence intervals for the estimator. For example $\pm 1.96\hat\sigma_1$ confidence intervals for the parameter $\hat\beta_1=$ `r bb[2,1]` would be given by $[$ `r bb[2,1]-1.96*sd(bbb[2,])` $,$ `r bb[2,1]+1.96*sd(bbb[2,])` $]$. The "true" value $\beta_1=3$ is indeed inside this interval.

## About confidence intervals and statistical testing

Why $\pm 1.96\hat\sigma_1$? If a random variable $X$ is normally distributed with mean $\mu$ and variance $\sigma$, then its value is with 95% probability in the interval $[\mu-1.96\sigma,\mu+1.96\sigma]$. We call this interval *95% confidence interval*.

Often, an interesting question is if the null hypothesis that a linear coefficient $\beta_1$ vanishes can be rejected. We can reject the null hypothesis, i.e., claim that $\beta_1$ really is non-zero, if zero is not included in the confidence interval. R reports directly standard errors for (i.e., $1\sigma$ errors) for linear regression coefficients as well as p-values for the null hypothesis that the coefficient vanishes:

```{r}
print(summary(lm(y~x,data.frame(x=x[,1],y=y[,1]))))
```

Please read Sec. 3.1.2 of the text book for more information.