- Source: Pseudo-R-squared
Pseudo-R-squared values are used when the outcome variable is nominal or ordinal such that the coefficient of determination R2 cannot be applied as a measure for goodness of fit and when a likelihood function is used to fit a model.
In linear regression, the squared multiple correlation, R2 is used to assess goodness of fit as it represents the proportion of variance in the criterion that is explained by the predictors.
In logistic regression analysis, there is no agreed upon analogous measure, but there are several competing measures each with limitations.
Four of the most commonly used indices and one less commonly used one are examined in this article:
Likelihood ratio R2L
Cox and Snell R2CS
Nagelkerke R2N
McFadden R2McF
Tjur R2T
R2L by Cohen
R2L is given by Cohen:
R
L
2
=
D
null
−
D
fitted
D
null
.
{\displaystyle R_{\text{L}}^{2}={\frac {D_{\text{null}}-D_{\text{fitted}}}{D_{\text{null}}}}.}
This is the most analogous index to the squared multiple correlations in linear regression. It represents the proportional reduction in the deviance wherein the deviance is treated as a measure of variation analogous but not identical to the variance in linear regression analysis. One limitation of the likelihood ratio R2 is that it is not monotonically related to the odds ratio, meaning that it does not necessarily increase as the odds ratio increases and does not necessarily decrease as the odds ratio decreases.
R2CS by Cox and Snell
R2CS is an alternative index of goodness of fit related to the R2 value from linear regression. It is given by:
R
CS
2
=
1
−
(
L
0
L
M
)
2
/
n
=
1
−
exp
(
2
n
(
ln
(
L
0
)
−
ln
(
L
M
)
)
)
{\displaystyle {\begin{aligned}R_{\text{CS}}^{2}&=1-\left({\frac {L_{0}}{L_{M}}}\right)^{2/n}\\[5pt]&=1-\exp \left({\frac {2}{n}}(\ln(L_{0})-\ln(L_{M}))\right)\end{aligned}}}
where LM and L0 are the likelihoods for the model being fitted and the null model, respectively. The Cox and Snell index corresponds to the standard R2 in case of a linear model with normal error. In certain situations, R2CS may be problematic as its maximum value is
1
−
L
0
2
/
n
{\displaystyle 1-L_{0}^{2/n}}
. For example, for logistic regression, the upper bound is
R
CS
2
≤
0.75
{\displaystyle R_{\text{CS}}^{2}\leq 0.75}
for a symmetric marginal distribution of events and decreases further for an asymmetric distribution of events.
R2N by Nagelkerke
R2N, proposed by Nico Nagelkerke in a highly cited Biometrika paper, provides a correction to the Cox and Snell R2 so that the maximum value is equal to 1. Nevertheless, the Cox and Snell and likelihood ratio R2s show greater agreement with each other than either does with the Nagelkerke R2. Of course, this might not be the case for values exceeding 0.75 as the Cox and Snell index is capped at this value. The likelihood ratio R2 is often preferred to the alternatives as it is most analogous to R2 in linear regression, is independent of the base rate (both Cox and Snell and Nagelkerke R2s increase as the proportion of cases increase from 0 to 0.5) and varies between 0 and 1.
R2McF by McFadden
The pseudo R2 by McFadden (sometimes called likelihood ratio index) is defined as
R
McF
2
=
1
−
ln
(
L
M
)
ln
(
L
0
)
,
{\displaystyle R_{\text{McF}}^{2}=1-{\frac {\ln(L_{M})}{\ln(L_{0})}},}
and is preferred over R2CS by Allison. The two expressions R2McF and R2CS are then related respectively by,
R
CS
2
=
1
−
(
1
L
0
)
2
(
R
McF
2
)
n
R
McF
2
=
−
n
2
⋅
ln
(
1
−
R
CS
2
)
ln
L
0
{\displaystyle {\begin{matrix}R_{\text{CS}}^{2}=1-\left({\dfrac {1}{L_{0}}}\right)^{\frac {2(R_{\text{McF}}^{2})}{n}}\\[1.5em]R_{\text{McF}}^{2}=-{\dfrac {n}{2}}\cdot {\dfrac {\ln(1-R_{\text{CS}}^{2})}{\ln L_{0}}}\end{matrix}}}
R2T by Tjur
Allison prefers R2T which is a relatively new measure developed by Tjur. It can be calculated in two steps:
For each level of the dependent variable, find the mean of the predicted probabilities of an event.
Take the absolute value of the difference between these means
Interpretation
A word of caution is in order when interpreting pseudo-R2 statistics. The reason these indices of fit are referred to as pseudo R2 is that they do not represent the proportionate reduction in error as the R2 in linear regression does. Linear regression assumes homoscedasticity, that the error variance is the same for all values of the criterion. Logistic regression will always be heteroscedastic – the error variances differ for each value of the predicted score. For each value of the predicted score there would be a different value of the proportionate reduction in error. Therefore, it is inappropriate to think of R2 as a proportionate reduction in error in a universal sense in logistic regression.
References
Kata Kunci Pencarian:
- Kritik teks
- Pseudo-R-squared
- Logistic regression
- Coefficient of determination
- Nico Nagelkerke
- Precision and recall
- Evaluation of binary classifiers
- Huber loss
- Elongated square gyrobicupola
- Polynomial greatest common divisor
- Pseudo-Euclidean space