- Source: Bregman divergence
In mathematics, specifically statistics and information geometry, a Bregman divergence or Bregman distance is a measure of difference between two points, defined in terms of a strictly convex function; they form an important class of divergences. When the points are interpreted as probability distributions – notably as either values of the parameter of a parametric model or as a data set of observed values – the resulting distance is a statistical distance. The most basic Bregman divergence is the squared Euclidean distance.
Bregman divergences are similar to metrics, but satisfy neither the triangle inequality (ever) nor symmetry (in general). However, they satisfy a generalization of the Pythagorean theorem, and in information geometry the corresponding statistical manifold is interpreted as a (dually) flat manifold. This allows many techniques of optimization theory to be generalized to Bregman divergences, geometrically as generalizations of least squares.
Bregman divergences are named after Russian mathematician Lev M. Bregman, who introduced the concept in 1967.
Definition
Let
F
:
Ω
→
R
{\displaystyle F\colon \Omega \to \mathbb {R} }
be a continuously-differentiable, strictly convex function defined on a convex set
Ω
{\displaystyle \Omega }
.
The Bregman distance associated with F for points
p
,
q
∈
Ω
{\displaystyle p,q\in \Omega }
is the difference between the value of F at point p and the value of the first-order Taylor expansion of F around point q evaluated at point p:
D
F
(
p
,
q
)
=
F
(
p
)
−
F
(
q
)
−
⟨
∇
F
(
q
)
,
p
−
q
⟩
.
{\displaystyle D_{F}(p,q)=F(p)-F(q)-\langle \nabla F(q),p-q\rangle .}
Properties
Non-negativity:
D
F
(
p
,
q
)
≥
0
{\displaystyle D_{F}(p,q)\geq 0}
for all
p
{\displaystyle p}
,
q
{\displaystyle q}
. This is a consequence of the convexity of
F
{\displaystyle F}
.
Positivity: When
F
{\displaystyle F}
is strictly convex,
D
F
(
p
,
q
)
=
0
{\displaystyle D_{F}(p,q)=0}
iff
p
=
q
{\displaystyle p=q}
.
Uniqueness up to affine difference:
D
F
=
D
G
{\displaystyle D_{F}=D_{G}}
iff
F
−
G
{\displaystyle F-G}
is an affine function.
Convexity:
D
F
(
p
,
q
)
{\displaystyle D_{F}(p,q)}
is convex in its first argument, but not necessarily in the second argument. If F is strictly convex, then
D
F
(
p
,
q
)
{\displaystyle D_{F}(p,q)}
is strictly convex in its first argument.
For example, Take f(x) = |x|, smooth it at 0, then take
y
=
1
,
x
1
=
0.1
,
x
2
=
−
0.9
,
x
3
=
0.9
x
1
+
0.1
x
2
{\displaystyle y=1,x_{1}=0.1,x_{2}=-0.9,x_{3}=0.9x_{1}+0.1x_{2}}
, then
D
f
(
y
,
x
3
)
≈
1
>
0.9
D
f
(
y
,
x
1
)
+
0.1
D
f
(
y
,
x
2
)
≈
0.2
{\displaystyle D_{f}(y,x_{3})\approx 1>0.9D_{f}(y,x_{1})+0.1D_{f}(y,x_{2})\approx 0.2}
.
Linearity: If we think of the Bregman distance as an operator on the function F, then it is linear with respect to non-negative coefficients. In other words, for
F
1
,
F
2
{\displaystyle F_{1},F_{2}}
strictly convex and differentiable, and
λ
≥
0
{\displaystyle \lambda \geq 0}
,
D
F
1
+
λ
F
2
(
p
,
q
)
=
D
F
1
(
p
,
q
)
+
λ
D
F
2
(
p
,
q
)
{\displaystyle D_{F_{1}+\lambda F_{2}}(p,q)=D_{F_{1}}(p,q)+\lambda D_{F_{2}}(p,q)}
Duality: If F is strictly convex, then the function F has a convex conjugate
F
∗
{\displaystyle F^{*}}
which is also strictly convex and continuously differentiable on some convex set
Ω
∗
{\displaystyle \Omega ^{*}}
. The Bregman distance defined with respect to
F
∗
{\displaystyle F^{*}}
is dual to
D
F
(
p
,
q
)
{\displaystyle D_{F}(p,q)}
as
D
F
∗
(
p
∗
,
q
∗
)
=
D
F
(
q
,
p
)
{\displaystyle D_{F^{*}}(p^{*},q^{*})=D_{F}(q,p)}
Here,
p
∗
=
∇
F
(
p
)
{\displaystyle p^{*}=\nabla F(p)}
and
q
∗
=
∇
F
(
q
)
{\displaystyle q^{*}=\nabla F(q)}
are the dual points corresponding to p and q.
Moreover, using the same notations :
D
F
(
p
,
q
)
=
F
(
p
)
+
F
∗
(
q
∗
)
−
⟨
p
,
q
∗
⟩
{\displaystyle D_{F}(p,q)=F(p)+F^{*}(q^{*})-\langle p,q^{*}\rangle }
Integral form: by the integral remainder form of Taylor's Theorem, a Bregman divergence can be written as the integral of the Hessian of
F
{\displaystyle F}
along the line segment between the Bregman divergence's arguments.
Mean as minimizer: A key result about Bregman divergences is that, given a random vector, the mean vector minimizes the expected Bregman divergence from the random vector. This result generalizes the textbook result that the mean of a set minimizes total squared error to elements in the set. This result was proved for the vector case by (Banerjee et al. 2005), and extended to the case of functions/distributions by (Frigyik et al. 2008). This result is important because it further justifies using a mean as a representative of a random set, particularly in Bayesian estimation.
Bregman balls are bounded, and compact if X is closed: Define Bregman ball centered at x with radius r by
B
f
(
x
,
r
)
:=
{
y
∈
X
:
D
f
(
y
,
x
)
≤
r
}
{\displaystyle B_{f}(x,r):=\left\{y\in X:D_{f}(y,x)\leq r\right\}}
. When
X
⊂
R
n
{\displaystyle X\subset \mathbb {R} ^{n}}
is finite dimensional,
∀
x
∈
X
{\displaystyle \forall x\in X}
, if
x
{\displaystyle x}
is in the relative interior of
X
{\displaystyle X}
, or if
X
{\displaystyle X}
is locally closed at
x
{\displaystyle x}
(that is, there exists a closed ball
B
(
x
,
r
)
{\displaystyle B(x,r)}
centered at
x
{\displaystyle x}
, such that
B
(
x
,
r
)
∩
X
{\displaystyle B(x,r)\cap X}
is closed), then
B
f
(
x
,
r
)
{\displaystyle B_{f}(x,r)}
is bounded for all
r
{\displaystyle r}
. If
X
{\displaystyle X}
is closed, then
B
f
(
x
,
r
)
{\displaystyle B_{f}(x,r)}
is compact for all
r
{\displaystyle r}
.
Law of cosines:
For any
p
,
q
,
z
{\displaystyle p,q,z}
D
F
(
p
,
q
)
=
D
F
(
p
,
z
)
+
D
F
(
z
,
q
)
−
(
p
−
z
)
T
(
∇
F
(
q
)
−
∇
F
(
z
)
)
{\displaystyle D_{F}(p,q)=D_{F}(p,z)+D_{F}(z,q)-(p-z)^{T}(\nabla F(q)-\nabla F(z))}
Parallelogram law: for any
θ
,
θ
1
,
θ
2
{\displaystyle \theta ,\theta _{1},\theta _{2}}
,
B
F
(
θ
1
:
θ
)
+
B
F
(
θ
2
:
θ
)
=
B
F
(
θ
1
:
θ
1
+
θ
2
2
)
+
B
F
(
θ
2
:
θ
1
+
θ
2
2
)
+
2
B
F
(
θ
1
+
θ
2
2
:
θ
)
{\displaystyle B_{F}\left(\theta _{1}:\theta \right)+B_{F}\left(\theta _{2}:\theta \right)=B_{F}\left(\theta _{1}:{\frac {\theta _{1}+\theta _{2}}{2}}\right)+B_{F}\left(\theta _{2}:{\frac {\theta _{1}+\theta _{2}}{2}}\right)+2B_{F}\left({\frac {\theta _{1}+\theta _{2}}{2}}:\theta \right)}
Bregman projection: For any
W
⊂
Ω
{\displaystyle W\subset \Omega }
, define the "Bregman projection" of
q
{\displaystyle q}
onto
W
{\displaystyle W}
:
P
W
(
q
)
=
argmin
ω
∈
W
D
F
(
ω
,
q
)
{\displaystyle P_{W}(q)={\text{argmin}}_{\omega \in W}D_{F}(\omega ,q)}
. Then
if
W
{\displaystyle W}
is convex, then the projection is unique if it exists;
if
W
{\displaystyle W}
is nonempty, closed, and convex and
Ω
⊂
R
n
{\displaystyle \Omega \subset \mathbb {R} ^{n}}
is finite dimensional, then the projection exists and is unique.
Generalized Pythagorean Theorem:
For any
v
∈
Ω
,
a
∈
W
{\displaystyle v\in \Omega ,a\in W}
,
D
F
(
a
,
v
)
≥
D
F
(
a
,
P
W
(
v
)
)
+
D
F
(
P
W
(
v
)
,
v
)
.
{\displaystyle D_{F}(a,v)\geq D_{F}(a,P_{W}(v))+D_{F}(P_{W}(v),v).}
This is an equality if
P
W
(
v
)
{\displaystyle P_{W}(v)}
is in the relative interior of
W
{\displaystyle W}
.
In particular, this always happens when
W
{\displaystyle W}
is an affine set.
Lack of triangle inequality: Since the Bregman divergence is essentially a generalization of squared Euclidean distance, there is no triangle inequality. Indeed,
D
F
(
z
,
x
)
−
D
F
(
z
,
y
)
−
D
F
(
y
,
x
)
=
⟨
∇
f
(
y
)
−
∇
f
(
x
)
,
z
−
y
⟩
{\displaystyle D_{F}(z,x)-D_{F}(z,y)-D_{F}(y,x)=\langle \nabla f(y)-\nabla f(x),z-y\rangle }
, which may be positive or negative.
= Proofs
=Non-negativity and positivity: use Jensen's inequality.
Uniqueness up to affine difference: Fix some
x
∈
Ω
{\displaystyle x\in \Omega }
, then for any other
y
∈
Ω
{\displaystyle y\in \Omega }
, we have by definition
F
(
y
)
−
G
(
y
)
=
F
(
x
)
−
G
(
x
)
+
⟨
∇
F
(
x
)
−
∇
G
(
x
)
,
y
−
x
⟩
{\displaystyle F(y)-G(y)=F(x)-G(x)+\langle \nabla F(x)-\nabla G(x),y-x\rangle }
.
Convexity in the first argument: by definition, and use convexity of F. Same for strict convexity.
Linearity in F, law of cosines, parallelogram law: by definition.
Duality: See figure 1 of.
Bregman balls are bounded, and compact if X is closed:
Fix
x
∈
X
{\displaystyle x\in X}
. Take affine transform on
f
{\displaystyle f}
, so that
∇
f
(
x
)
=
0
{\displaystyle \nabla f(x)=0}
.
Take some
ϵ
>
0
{\displaystyle \epsilon >0}
, such that
∂
B
(
x
,
ϵ
)
⊂
X
{\displaystyle \partial B(x,\epsilon )\subset X}
. Then consider the "radial-directional" derivative of
f
{\displaystyle f}
on the Euclidean sphere
∂
B
(
x
,
ϵ
)
{\displaystyle \partial B(x,\epsilon )}
.
⟨
∇
f
(
y
)
,
(
y
−
x
)
⟩
{\displaystyle \langle \nabla f(y),(y-x)\rangle }
for all
y
∈
∂
B
(
x
,
ϵ
)
{\displaystyle y\in \partial B(x,\epsilon )}
.
Since
∂
B
(
x
,
ϵ
)
⊂
R
n
{\displaystyle \partial B(x,\epsilon )\subset \mathbb {R} ^{n}}
is compact, it achieves minimal value
δ
{\displaystyle \delta }
at some
y
0
∈
∂
B
(
x
,
ϵ
)
{\displaystyle y_{0}\in \partial B(x,\epsilon )}
.
Since
f
{\displaystyle f}
is strictly convex,
δ
>
0
{\displaystyle \delta >0}
. Then
B
f
(
x
,
r
)
⊂
B
(
x
,
r
/
δ
)
∩
X
{\displaystyle B_{f}(x,r)\subset B(x,r/\delta )\cap X}
.
Since
D
f
(
y
,
x
)
{\displaystyle D_{f}(y,x)}
is
C
1
{\displaystyle C^{1}}
in
y
{\displaystyle y}
,
D
f
{\displaystyle D_{f}}
is continuous in
y
{\displaystyle y}
, thus
B
f
(
x
,
r
)
{\displaystyle B_{f}(x,r)}
is closed if
X
{\displaystyle X}
is.
Projection
P
W
{\displaystyle P_{W}}
is well-defined when
W
{\displaystyle W}
is closed and convex.
Fix
v
∈
X
{\displaystyle v\in X}
. Take some
w
∈
W
{\displaystyle w\in W}
, then let
r
:=
D
f
(
w
,
v
)
{\displaystyle r:=D_{f}(w,v)}
. Then draw the Bregman ball
B
f
(
v
,
r
)
∩
W
{\displaystyle B_{f}(v,r)\cap W}
. It is closed and bounded, thus compact. Since
D
f
(
⋅
,
v
)
{\displaystyle D_{f}(\cdot ,v)}
is continuous and strictly convex on it, and bounded below by
0
{\displaystyle 0}
, it achieves a unique minimum on it.
Pythagorean inequality.
By cosine law,
D
f
(
w
,
v
)
−
D
f
(
w
,
P
W
(
v
)
)
−
D
f
(
P
W
(
v
)
,
v
)
=
⟨
∇
y
D
f
(
y
,
v
)
|
y
=
P
W
(
v
)
,
w
−
P
W
(
v
)
⟩
{\displaystyle D_{f}(w,v)-D_{f}(w,P_{W}(v))-D_{f}(P_{W}(v),v)=\langle \nabla _{y}D_{f}(y,v)|_{y=P_{W}(v)},w-P_{W}(v)\rangle }
, which must be
≥
0
{\displaystyle \geq 0}
, since
P
W
(
v
)
{\displaystyle P_{W}(v)}
minimizes
D
f
(
⋅
,
v
)
{\displaystyle D_{f}(\cdot ,v)}
in
W
{\displaystyle W}
, and
W
{\displaystyle W}
is convex.
Pythagorean equality when
P
W
(
v
)
{\displaystyle P_{W}(v)}
is in the relative interior of
X
{\displaystyle X}
.
If
⟨
∇
y
D
f
(
y
,
v
)
|
y
=
P
W
(
v
)
,
w
−
P
W
(
v
)
⟩
>
0
{\displaystyle \langle \nabla _{y}D_{f}(y,v)|_{y=P_{W}(v)},w-P_{W}(v)\rangle >0}
, then since
w
{\displaystyle w}
is in the relative interior, we can move from
P
W
(
v
)
{\displaystyle P_{W}(v)}
in the direction opposite of
w
{\displaystyle w}
, to decrease
D
f
(
y
,
v
)
{\displaystyle D_{f}(y,v)}
, contradiction.
Thus
⟨
∇
y
D
f
(
y
,
v
)
|
y
=
P
W
(
v
)
,
w
−
P
W
(
v
)
⟩
=
0
{\displaystyle \langle \nabla _{y}D_{f}(y,v)|_{y=P_{W}(v)},w-P_{W}(v)\rangle =0}
.
= Classification theorems
=The only symmetric Bregman divergences on
X
⊂
R
n
{\displaystyle X\subset \mathbb {R} ^{n}}
are squared generalized Euclidean distances (Mahalanobis distance), that is,
D
f
(
y
,
x
)
=
(
y
−
x
)
T
A
(
y
−
x
)
{\displaystyle D_{f}(y,x)=(y-x)^{T}A(y-x)}
for some positive definite
A
{\displaystyle A}
.
The following two characterizations are for divergences on
Γ
n
{\displaystyle \Gamma _{n}}
, the set of all probability measures on
{
1
,
2
,
.
.
.
,
n
}
{\displaystyle \{1,2,...,n\}}
, with
n
≥
2
{\displaystyle n\geq 2}
.
Define a divergence on
Γ
n
{\displaystyle \Gamma _{n}}
as any function of type
D
:
Γ
n
×
Γ
n
→
[
0
,
∞
]
{\displaystyle D:\Gamma _{n}\times \Gamma _{n}\to [0,\infty ]}
, such that
D
(
x
,
x
)
=
0
{\displaystyle D(x,x)=0}
for all
x
∈
Γ
n
{\displaystyle x\in \Gamma _{n}}
, then:
The only divergence on
Γ
n
{\displaystyle \Gamma _{n}}
that is both a Bregman divergence and an f-divergence is the Kullback–Leibler divergence.
If
n
≥
3
{\displaystyle n\geq 3}
, then any Bregman divergence on
Γ
n
{\displaystyle \Gamma _{n}}
that satisfies the data processing inequality must be the Kullback–Leibler divergence. (In fact, a weaker assumption of "sufficiency" is enough.) Counterexamples exist when
n
=
2
{\displaystyle n=2}
.
Given a Bregman divergence
D
F
{\displaystyle D_{F}}
, its "opposite", defined by
D
F
∗
(
v
,
w
)
=
D
F
(
w
,
v
)
{\displaystyle D_{F}^{*}(v,w)=D_{F}(w,v)}
, is generally not a Bregman divergence. For example, the Kullback-Leiber divergence is both a Bregman divergence and an f-divergence. Its reverse is also an f-divergence, but by the above characterization, the reverse KL divergence cannot be a Bregman divergence.
Examples
The squared Mahalanobis distance
D
F
(
x
,
y
)
=
1
2
(
x
−
y
)
T
Q
(
x
−
y
)
{\displaystyle D_{F}(x,y)={\tfrac {1}{2}}(x-y)^{T}Q(x-y)}
is generated by the convex quadratic form
F
(
x
)
=
1
2
x
T
Q
x
{\displaystyle F(x)={\tfrac {1}{2}}x^{T}Qx}
.
The canonical example of a Bregman distance is the squared Euclidean distance
D
F
(
x
,
y
)
=
‖
x
−
y
‖
2
{\displaystyle D_{F}(x,y)=\|x-y\|^{2}}
. It results as the special case of the above, when
Q
{\displaystyle Q}
is the identity, i.e. for
F
(
x
)
=
‖
x
‖
2
{\displaystyle F(x)=\|x\|^{2}}
. As noted, affine differences, i.e. the lower orders added in
F
{\displaystyle F}
, are irrelevant to
D
F
{\displaystyle D_{F}}
.
The generalized Kullback–Leibler divergence
D
F
(
p
,
q
)
=
∑
i
p
(
i
)
log
p
(
i
)
q
(
i
)
−
∑
p
(
i
)
+
∑
q
(
i
)
{\displaystyle D_{F}(p,q)=\sum _{i}p(i)\log {\frac {p(i)}{q(i)}}-\sum p(i)+\sum q(i)}
is generated by the negative entropy function
F
(
p
)
=
∑
i
p
(
i
)
log
p
(
i
)
{\displaystyle F(p)=\sum _{i}p(i)\log p(i)}
When restricted to the simplex, the last two terms cancel, giving the usual Kullback–Leibler divergence for distributions.
The Itakura–Saito distance,
D
F
(
p
,
q
)
=
∑
i
(
p
(
i
)
q
(
i
)
−
log
p
(
i
)
q
(
i
)
−
1
)
{\displaystyle D_{F}(p,q)=\sum _{i}\left({\frac {p(i)}{q(i)}}-\log {\frac {p(i)}{q(i)}}-1\right)}
is generated by the convex function
F
(
p
)
=
−
∑
i
log
p
(
i
)
{\displaystyle F(p)=-\sum _{i}\log p(i)}
Generalizing projective duality
A key tool in computational geometry is the idea of projective duality, which maps points to hyperplanes and vice versa, while preserving incidence and above-below relationships. There are numerous analytical forms of the projective dual: one common form maps the point
p
=
(
p
1
,
…
p
d
)
{\displaystyle p=(p_{1},\ldots p_{d})}
to the hyperplane
x
d
+
1
=
∑
1
d
2
p
i
x
i
{\displaystyle x_{d+1}=\sum _{1}^{d}2p_{i}x_{i}}
. This mapping can be interpreted (identifying the hyperplane with its normal) as the convex conjugate mapping that takes the point p to its dual point
p
∗
=
∇
F
(
p
)
{\displaystyle p^{*}=\nabla F(p)}
, where F defines the d-dimensional paraboloid
x
d
+
1
=
∑
x
i
2
{\displaystyle x_{d+1}=\sum x_{i}^{2}}
.
If we now replace the paraboloid by an arbitrary convex function, we obtain a different dual mapping that retains the incidence and above-below properties of the standard projective dual. This implies that natural dual concepts in computational geometry like Voronoi diagrams and Delaunay triangulations retain their meaning in distance spaces defined by an arbitrary Bregman divergence. Thus, algorithms from "normal" geometry extend directly to these spaces (Boissonnat, Nielsen and Nock, 2010)
Generalization of Bregman divergences
Bregman divergences can be interpreted as limit cases of skewed Jensen divergences (see Nielsen and Boltz, 2011). Jensen divergences can be generalized using comparative convexity, and limit cases of these skewed Jensen divergences generalizations yields generalized Bregman divergence (see Nielsen and Nock, 2017).
The Bregman chord divergence is obtained by taking a chord instead of a tangent line.
Bregman divergence on other objects
Bregman divergences can also be defined between matrices, between functions, and between measures (distributions). Bregman divergences between matrices include the Stein's loss and von Neumann entropy. Bregman divergences between functions include total squared error, relative entropy, and squared bias; see the references by Frigyik et al. below for definitions and properties. Similarly Bregman divergences have also been defined over sets, through a submodular set function which is known as the discrete analog of a convex function. The submodular Bregman divergences subsume a number of discrete distance measures, like the Hamming distance, precision and recall, mutual information and some other set based distance measures (see Iyer & Bilmes, 2012 for more details and properties of the submodular Bregman.)
For a list of common matrix Bregman divergences, see Table 15.1 in.
Applications
In machine learning, Bregman divergences are used to calculate the bi-tempered logistic loss, performing better than the softmax function with noisy datasets.
Bregman divergence is used in the formulation of mirror descent, which includes optimization algorithms used in machine learning such as gradient descent and the hedge algorithm.