- Source: Pooled variance
In statistics, pooled variance (also known as combined variance, composite variance, or overall variance, and written
σ
2
{\displaystyle \sigma ^{2}}
) is a method for estimating variance of several different populations when the mean of each population may be different, but one may assume that the variance of each population is the same. The numerical estimate resulting from the use of this method is also called the pooled variance.
Under the assumption of equal population variances, the pooled sample variance provides a higher precision estimate of variance than the individual sample variances. This higher precision can lead to increased statistical power when used in statistical tests that compare the populations, such as the t-test.
The square root of a pooled variance estimator is known as a pooled standard deviation (also known as combined standard deviation, composite standard deviation, or overall standard deviation).
Motivation
In statistics, many times, data are collected for a dependent variable, y, over a range of values for the independent variable, x. For example, the observation of fuel consumption might be studied as a function of engine speed while the engine load is held constant. If, in order to achieve a small variance in y, numerous repeated tests are required at each value of x, the expense of testing may become prohibitive. Reasonable estimates of variance can be determined by using the principle of pooled variance after repeating each test at a particular x only a few times.
Definition and computation
The pooled variance is an estimate of the fixed common variance
σ
2
{\displaystyle \sigma ^{2}}
underlying various populations that have different means.
We are given a set of sample variances
s
i
2
{\displaystyle s_{i}^{2}}
, where the populations are indexed
i
=
1
,
…
,
m
{\displaystyle i=1,\ldots ,m}
,
s
i
2
{\displaystyle s_{i}^{2}}
=
1
n
i
−
1
∑
j
=
1
n
i
(
y
i
,
j
−
y
¯
i
)
2
.
{\displaystyle {\frac {1}{n_{i}-1}}\sum _{j=1}^{n_{i}}\left(y_{i,j}-{\overline {y}}_{i}\right)^{2}.}
Assuming uniform sample sizes,
n
i
=
n
{\displaystyle n_{i}=n}
, then the pooled variance
s
p
2
{\displaystyle s_{p}^{2}}
can be computed by the arithmetic mean:
s
p
2
=
∑
i
=
1
m
s
i
2
m
=
s
1
2
+
s
2
2
+
⋯
+
s
m
2
m
.
{\displaystyle s_{p}^{2}={\frac {\sum _{i=1}^{m}s_{i}^{2}}{m}}={\frac {s_{1}^{2}+s_{2}^{2}+\cdots +s_{m}^{2}}{m}}.}
If the sample sizes are non-uniform, then the pooled variance
s
p
2
{\displaystyle s_{p}^{2}}
can be computed by the weighted average, using as weights
w
i
=
n
i
−
1
{\displaystyle w_{i}=n_{i}-1}
the respective degrees of freedom (see also: Bessel's correction):
s
p
2
=
∑
i
=
1
m
(
n
i
−
1
)
s
i
2
∑
i
=
1
m
(
n
i
−
1
)
=
(
n
1
−
1
)
s
1
2
+
(
n
2
−
1
)
s
2
2
+
⋯
+
(
n
m
−
1
)
s
m
2
n
1
+
n
2
+
⋯
+
n
m
−
m
.
{\displaystyle s_{p}^{2}={\frac {\sum _{i=1}^{m}(n_{i}-1)s_{i}^{2}}{\sum _{i=1}^{m}(n_{i}-1)}}={\frac {(n_{1}-1)s_{1}^{2}+(n_{2}-1)s_{2}^{2}+\cdots +(n_{m}-1)s_{m}^{2}}{n_{1}+n_{2}+\cdots +n_{m}-m}}.}
The distribution of
s
p
2
/
σ
2
{\displaystyle s_{p}^{2}/\sigma ^{2}}
is
χ
2
(
∑
i
n
i
−
m
)
{\displaystyle \chi ^{2}(\sum _{i}n_{i}-m)}
.
Proof. When there is a single mean, the distribution of
(
y
1
−
y
¯
,
…
,
y
n
−
y
¯
)
{\displaystyle (y_{1}-{\bar {y}},\dots ,y_{n}-{\bar {y}})}
is a gaussian in
Δ
n
−
1
{\displaystyle \Delta _{n-1}}
, the
(
n
−
1
)
{\displaystyle (n-1)}
-dimensional simplex, with standard deviation
σ
{\displaystyle \sigma }
. Where there are multiple means, the distribution of
(
y
1
,
1
−
y
¯
1
,
…
,
y
1
,
n
1
−
y
¯
1
,
…
,
y
m
,
1
−
y
¯
m
,
…
,
y
m
,
n
m
−
y
¯
m
)
{\displaystyle (y_{1,1}-{\bar {y}}_{1},\dots ,y_{1,n_{1}}-{\bar {y}}_{1},\dots ,y_{m,1}-{\bar {y}}_{m},\dots ,y_{m,n_{m}}-{\bar {y}}_{m})}
is a gaussian in
Δ
n
1
−
1
×
⋯
×
Δ
n
m
−
1
{\displaystyle \Delta _{n_{1}-1}\times \dots \times \Delta _{n_{m}-1}}
.
= Variants
=The unbiased least squares estimate of
σ
2
{\displaystyle \sigma ^{2}}
(as presented above),
and the biased maximum likelihood estimate below:
s
p
2
=
∑
i
=
1
N
(
n
i
−
1
)
s
i
2
∑
i
=
1
N
n
i
,
{\displaystyle s_{p}^{2}={\frac {\sum _{i=1}^{N}(n_{i}-1)s_{i}^{2}}{\sum _{i=1}^{N}n_{i}}},}
are used in different contexts. The former can give an unbiased
s
p
2
{\displaystyle s_{p}^{2}}
to estimate
σ
2
{\displaystyle \sigma ^{2}}
when the two groups share an equal population variance. The latter one can give a more efficient
s
p
2
{\displaystyle s_{p}^{2}}
to estimate
σ
2
{\displaystyle \sigma ^{2}}
, although subject to bias. Note that the quantities
s
i
2
{\displaystyle s_{i}^{2}}
in the right hand sides of both equations are the unbiased estimates.
Example
Consider the following set of data for y obtained at various levels of the independent variable x.
The number of trials, mean, variance and standard deviation are presented in the next table.
These statistics represent the variance and standard deviation for each subset of data at the various levels of x. If we can assume that the same phenomena are generating random error at every level of x, the above data can be “pooled” to express a single estimate of variance and standard deviation. In a sense, this suggests finding a mean variance or standard deviation among the five results above. This mean variance is calculated by weighting the individual values with the size of the subset for each level of x. Thus, the pooled variance is defined by
s
p
2
=
(
n
1
−
1
)
s
1
2
+
(
n
2
−
1
)
s
2
2
+
⋯
+
(
n
k
−
1
)
s
k
2
(
n
1
−
1
)
+
(
n
2
−
1
)
+
⋯
+
(
n
k
−
1
)
{\displaystyle s_{p}^{2}={\frac {(n_{1}-1)s_{1}^{2}+(n_{2}-1)s_{2}^{2}+\cdots +(n_{k}-1)s_{k}^{2}}{(n_{1}-1)+(n_{2}-1)+\cdots +(n_{k}-1)}}}
where n1, n2, . . ., nk are the sizes of the data subsets at each level of the variable x, and s12, s22, . . ., sk2 are their respective variances.
The pooled variance of the data shown above is therefore:
s
p
2
=
2.764
{\displaystyle s_{p}^{2}=2.764\,}
Effect on precision
Pooled variance is an estimate when there is a correlation between pooled data sets or the average of the data sets is not identical. Pooled variation is less precise the more non-zero the correlation or distant the averages between data sets.
The variation of data for non-overlapping data sets is:
σ
X
2
=
∑
i
[
(
N
X
i
−
1
)
σ
X
i
2
+
N
X
i
μ
X
i
2
]
−
[
∑
i
N
X
i
]
μ
X
2
∑
i
N
X
i
−
1
{\displaystyle \sigma _{X}^{2}={\frac {\sum _{i}\left[(N_{X_{i}}-1)\sigma _{X_{i}}^{2}+N_{X_{i}}\mu _{X_{i}}^{2}\right]-\left[\sum _{i}N_{X_{i}}\right]\mu _{X}^{2}}{\sum _{i}N_{X_{i}}-1}}}
where the mean is defined as:
μ
X
=
∑
i
N
X
i
μ
X
i
∑
i
N
X
i
{\displaystyle \mu _{X}={\frac {\sum _{i}N_{X_{i}}\mu _{X_{i}}}{\sum _{i}N_{X_{i}}}}}
Given a biased maximum likelihood defined as:
s
p
2
=
∑
i
=
1
k
(
n
i
−
1
)
s
i
2
∑
i
=
1
k
n
i
,
{\displaystyle s_{p}^{2}={\frac {\sum _{i=1}^{k}(n_{i}-1)s_{i}^{2}}{\sum _{i=1}^{k}n_{i}}},}
Then the error in the biased maximum likelihood estimate is:
Error
=
s
p
2
−
σ
X
2
=
∑
i
(
N
X
i
−
1
)
s
i
2
∑
i
N
X
i
−
1
∑
i
N
X
i
−
1
(
∑
i
[
(
N
X
i
−
1
)
σ
X
i
2
+
N
X
i
μ
X
i
2
]
−
[
∑
i
N
X
i
]
μ
X
2
)
{\displaystyle {\begin{aligned}{\text{Error}}&=s_{p}^{2}-\sigma _{X}^{2}\\[6pt]&={\frac {\sum _{i}(N_{X_{i}}-1)s_{i}^{2}}{\sum _{i}N_{X_{i}}}}-{\frac {1}{\sum _{i}N_{X_{i}}-1}}\left(\sum _{i}\left[(N_{X_{i}}-1)\sigma _{X_{i}}^{2}+N_{X_{i}}\mu _{X_{i}}^{2}\right]-\left[\sum _{i}N_{X_{i}}\right]\mu _{X}^{2}\right)\end{aligned}}}
Assuming N is large such that:
∑
i
N
X
i
≈
∑
i
N
X
i
−
1
{\displaystyle \sum _{i}N_{X_{i}}\approx \sum _{i}N_{X_{i}}-1}
Then the error in the estimate reduces to:
E
=
−
(
∑
i
[
N
X
i
μ
X
i
2
]
−
[
∑
i
N
X
i
]
μ
X
2
)
∑
i
N
X
i
=
μ
X
2
−
∑
i
[
N
X
i
μ
X
i
2
]
∑
i
N
X
i
{\displaystyle {\begin{aligned}E&=-{\frac {\left(\sum _{i}\left[N_{X_{i}}\mu _{X_{i}}^{2}\right]-\left[\sum _{i}N_{X_{i}}\right]\mu _{X}^{2}\right)}{\sum _{i}N_{X_{i}}}}\\[3pt]&=\mu _{X}^{2}-{\frac {\sum _{i}\left[N_{X_{i}}\mu _{X_{i}}^{2}\right]}{\sum _{i}N_{X_{i}}}}\end{aligned}}}
Or alternatively:
E
=
[
∑
i
N
X
i
μ
X
i
∑
i
N
X
i
]
2
−
∑
i
[
N
X
i
μ
X
i
2
]
∑
i
N
X
i
=
[
∑
i
N
X
i
μ
X
i
]
2
−
∑
i
N
X
i
∑
i
[
N
X
i
μ
X
i
2
]
[
∑
i
N
X
i
]
2
{\displaystyle {\begin{aligned}E&=\left[{\frac {\sum _{i}N_{X_{i}}\mu _{X_{i}}}{\sum _{i}N_{X_{i}}}}\right]^{2}-{\frac {\sum _{i}\left[N_{X_{i}}\mu _{X_{i}}^{2}\right]}{\sum _{i}N_{X_{i}}}}\\[3pt]&={\frac {\left[\sum _{i}N_{X_{i}}\mu _{X_{i}}\right]^{2}-\sum _{i}N_{X_{i}}\sum _{i}\left[N_{X_{i}}\mu _{X_{i}}^{2}\right]}{\left[\sum _{i}N_{X_{i}}\right]^{2}}}\end{aligned}}}
Aggregation of standard deviation data
Rather than estimating pooled standard deviation, the following is the way to exactly aggregate standard deviation when more statistical information is available.
= Population-based statistics
=The populations of sets, which may overlap, can be calculated simply as follows:
N
X
∪
Y
=
N
X
+
N
Y
−
N
X
∩
Y
{\displaystyle {\begin{aligned}&&N_{X\cup Y}&=N_{X}+N_{Y}-N_{X\cap Y}\\\end{aligned}}}
The populations of sets, which do not overlap, can be calculated simply as follows:
X
∩
Y
=
∅
⇒
N
X
∩
Y
=
0
⇒
N
X
∪
Y
=
N
X
+
N
Y
{\displaystyle {\begin{aligned}X\cap Y=\varnothing &\Rightarrow &N_{X\cap Y}&=0\\&\Rightarrow &N_{X\cup Y}&=N_{X}+N_{Y}\end{aligned}}}
Standard deviations of non-overlapping (X ∩ Y = ∅) sub-populations can be aggregated as follows if the size (actual or relative to one another) and means of each are known:
μ
X
∪
Y
=
N
X
μ
X
+
N
Y
μ
Y
N
X
+
N
Y
σ
X
∪
Y
=
N
X
σ
X
2
+
N
Y
σ
Y
2
N
X
+
N
Y
+
N
X
N
Y
(
N
X
+
N
Y
)
2
(
μ
X
−
μ
Y
)
2
{\displaystyle {\begin{aligned}\mu _{X\cup Y}&={\frac {N_{X}\mu _{X}+N_{Y}\mu _{Y}}{N_{X}+N_{Y}}}\\[3pt]\sigma _{X\cup Y}&={\sqrt {{\frac {N_{X}\sigma _{X}^{2}+N_{Y}\sigma _{Y}^{2}}{N_{X}+N_{Y}}}+{\frac {N_{X}N_{Y}}{(N_{X}+N_{Y})^{2}}}(\mu _{X}-\mu _{Y})^{2}}}\end{aligned}}}
For example, suppose it is known that the average American man has a mean height of 70 inches with a standard deviation of three inches and that the average American woman has a mean height of 65 inches with a standard deviation of two inches. Also assume that the number of men, N, is equal to the number of women. Then the mean and standard deviation of heights of American adults could be calculated as
μ
=
N
⋅
70
+
N
⋅
65
N
+
N
=
70
+
65
2
=
67.5
σ
=
3
2
+
2
2
2
+
(
70
−
65
)
2
2
2
=
12.75
≈
3.57
{\displaystyle {\begin{aligned}\mu &={\frac {N\cdot 70+N\cdot 65}{N+N}}={\frac {70+65}{2}}=67.5\\[3pt]\sigma &={\sqrt {{\frac {3^{2}+2^{2}}{2}}+{\frac {(70-65)^{2}}{2^{2}}}}}={\sqrt {12.75}}\approx 3.57\end{aligned}}}
For the more general case of M non-overlapping populations, X1 through XM, and the aggregate population
X
=
⋃
i
X
i
{\textstyle X\,=\,\bigcup _{i}X_{i}}
,
μ
X
=
∑
i
N
X
i
μ
X
i
∑
i
N
X
i
σ
X
=
∑
i
N
X
i
σ
X
i
2
∑
i
N
X
i
+
∑
i
<
j
N
X
i
N
X
j
(
μ
X
i
−
μ
X
j
)
2
(
∑
i
N
X
i
)
2
{\displaystyle {\begin{aligned}\mu _{X}&={\frac {\sum _{i}N_{X_{i}}\mu _{X_{i}}}{\sum _{i}N_{X_{i}}}}\\[3pt]\sigma _{X}&={\sqrt {{\frac {\sum _{i}N_{X_{i}}\sigma _{X_{i}}^{2}}{\sum _{i}N_{X_{i}}}}+{\frac {\sum _{i
,
where
X
i
∩
X
j
=
∅
,
∀
i
<
j
.
{\displaystyle X_{i}\cap X_{j}=\varnothing ,\quad \forall \ i
If the size (actual or relative to one another), mean, and standard deviation of two overlapping populations are known for the populations as well as their intersection, then the standard deviation of the overall population can still be calculated as follows:
μ
X
∪
Y
=
1
N
X
∪
Y
(
N
X
μ
X
+
N
Y
μ
Y
−
N
X
∩
Y
μ
X
∩
Y
)
σ
X
∪
Y
=
1
N
X
∪
Y
(
N
X
[
σ
X
2
+
μ
X
2
]
+
N
Y
[
σ
Y
2
+
μ
Y
2
]
−
N
X
∩
Y
[
σ
X
∩
Y
2
+
μ
X
∩
Y
2
]
)
−
μ
X
∪
Y
2
{\displaystyle {\begin{aligned}\mu _{X\cup Y}&={\frac {1}{N_{X\cup Y}}}\left(N_{X}\mu _{X}+N_{Y}\mu _{Y}-N_{X\cap Y}\mu _{X\cap Y}\right)\\[3pt]\sigma _{X\cup Y}&={\sqrt {{\frac {1}{N_{X\cup Y}}}\left(N_{X}[\sigma _{X}^{2}+\mu _{X}^{2}]+N_{Y}[\sigma _{Y}^{2}+\mu _{Y}^{2}]-N_{X\cap Y}[\sigma _{X\cap Y}^{2}+\mu _{X\cap Y}^{2}]\right)-\mu _{X\cup Y}^{2}}}\end{aligned}}}
If two or more sets of data are being added together datapoint by datapoint, the standard deviation of the result can be calculated if the standard deviation of each data set and the covariance between each pair of data sets is known:
σ
X
=
∑
i
σ
X
i
2
+
2
∑
i
,
j
cov
(
X
i
,
X
j
)
{\displaystyle \sigma _{X}={\sqrt {\sum _{i}{\sigma _{X_{i}}^{2}}+2\sum _{i,j}\operatorname {cov} (X_{i},X_{j})}}}
For the special case where no correlation exists between any pair of data sets, then the relation reduces to the root sum of squares:
cov
(
X
i
,
X
j
)
=
0
,
∀
i
<
j
⇒
σ
X
=
∑
i
σ
X
i
2
.
{\displaystyle {\begin{aligned}&\operatorname {cov} (X_{i},X_{j})=0,\quad \forall i
= Sample-based statistics
=Standard deviations of non-overlapping (X ∩ Y = ∅) sub-samples can be aggregated as follows if the actual size and means of each are known:
μ
X
∪
Y
=
1
N
X
∪
Y
(
N
X
μ
X
+
N
Y
μ
Y
)
σ
X
∪
Y
=
1
N
X
∪
Y
−
1
(
[
N
X
−
1
]
σ
X
2
+
N
X
μ
X
2
+
[
N
Y
−
1
]
σ
Y
2
+
N
Y
μ
Y
2
−
[
N
X
+
N
Y
]
μ
X
∪
Y
2
)
{\displaystyle {\begin{aligned}\mu _{X\cup Y}&={\frac {1}{N_{X\cup Y}}}\left(N_{X}\mu _{X}+N_{Y}\mu _{Y}\right)\\[3pt]\sigma _{X\cup Y}&={\sqrt {{\frac {1}{N_{X\cup Y}-1}}\left([N_{X}-1]\sigma _{X}^{2}+N_{X}\mu _{X}^{2}+[N_{Y}-1]\sigma _{Y}^{2}+N_{Y}\mu _{Y}^{2}-[N_{X}+N_{Y}]\mu _{X\cup Y}^{2}\right)}}\end{aligned}}}
For the more general case of M non-overlapping data sets, X1 through XM, and the aggregate data set
X
=
⋃
i
X
i
{\textstyle X\,=\,\bigcup _{i}X_{i}}
,
μ
X
=
1
∑
i
N
X
i
(
∑
i
N
X
i
μ
X
i
)
σ
X
=
1
∑
i
N
X
i
−
1
(
∑
i
[
(
N
X
i
−
1
)
σ
X
i
2
+
N
X
i
μ
X
i
2
]
−
[
∑
i
N
X
i
]
μ
X
2
)
{\displaystyle {\begin{aligned}\mu _{X}&={\frac {1}{\sum _{i}{N_{X_{i}}}}}\left(\sum _{i}{N_{X_{i}}\mu _{X_{i}}}\right)\\[3pt]\sigma _{X}&={\sqrt {{\frac {1}{\sum _{i}{N_{X_{i}}-1}}}\left(\sum _{i}{\left[(N_{X_{i}}-1)\sigma _{X_{i}}^{2}+N_{X_{i}}\mu _{X_{i}}^{2}\right]}-\left[\sum _{i}{N_{X_{i}}}\right]\mu _{X}^{2}\right)}}\end{aligned}}}
where
X
i
∩
X
j
=
∅
,
∀
i
<
j
.
{\displaystyle X_{i}\cap X_{j}=\varnothing ,\quad \forall i
If the size, mean, and standard deviation of two overlapping samples are known for the samples as well as their intersection, then the standard deviation of the aggregated sample can still be calculated. In general,
μ
X
∪
Y
=
1
N
X
∪
Y
(
N
X
μ
X
+
N
Y
μ
Y
−
N
X
∩
Y
μ
X
∩
Y
)
σ
X
∪
Y
=
[
N
X
−
1
]
σ
X
2
+
N
X
μ
X
2
+
[
N
Y
−
1
]
σ
Y
2
+
N
Y
μ
Y
2
−
[
N
X
∩
Y
−
1
]
σ
X
∩
Y
2
−
N
X
∩
Y
μ
X
∩
Y
2
−
[
N
X
+
N
Y
−
N
X
∩
Y
]
μ
X
∪
Y
2
N
X
∪
Y
−
1
{\displaystyle {\begin{aligned}\mu _{X\cup Y}&={\frac {1}{N_{X\cup Y}}}\left(N_{X}\mu _{X}+N_{Y}\mu _{Y}-N_{X\cap Y}\mu _{X\cap Y}\right)\\[3pt]\sigma _{X\cup Y}&={\sqrt {\frac {[N_{X}-1]\sigma _{X}^{2}+N_{X}\mu _{X}^{2}+[N_{Y}-1]\sigma _{Y}^{2}+N_{Y}\mu _{Y}^{2}-[N_{X\cap Y}-1]\sigma _{X\cap Y}^{2}-N_{X\cap Y}\mu _{X\cap Y}^{2}-[N_{X}+N_{Y}-N_{X\cap Y}]\mu _{X\cup Y}^{2}}{N_{X\cup Y}-1}}}\end{aligned}}}
See also
Chi-squared distribution#Asymptotic properties
Used for calculating Cohen's d (effect size)
Distribution of the sample variance
Pooled covariance matrix
Pooled degree of freedom
Pooled mean
References
Killeen PR (May 2005). "An alternative to null-hypothesis significance tests". Psychol Sci. 16 (5): 345–53. doi:10.1111/j.0956-7976.2005.01538.x. PMC 1473027. PMID 15869691.
External links
IUPAC Gold Book – pooled standard deviation
[1]
– also referring to Cohen's d (on page 6)
Kata Kunci Pencarian:
- Pooled variance
- Welch–Satterthwaite equation
- Variance
- Standard deviation
- Student's t-test
- Grand mean
- Dunnett's test
- Kruskal–Wallis test
- Hotelling's T-squared distribution
- Welch's t-test