Properties of the Relative Entropy

Properties of the Relative Entropy

In this article, we will introduce some important properties of the relative entropy, including lower semi-continuity, convexity, compactness of sublevel sets based on the Donsker-Varadhan variational formula.

Let $X$ be a Polish space, and $\mathcal{B}$ be the corresponding Borel $\sigma-$algebra. We denote by $\mathcal{P}(X)$ the set of probability measures on $(X,\mathcal{B})$.

Definition 1. For $\nu\in \mathcal{P}(X)$, the relative entropy functional (also called KL divergence) is a mapping $H(\cdot \Vert\nu ): \mathcal{P}(X)\to \overline{\mathbb{R}}$, defined by

$$
H(\mu\Vert\nu):=\begin{cases}
\int_X \log \left(\frac{d\mu}{d\nu}\right)d\mu,& \text{if } \mu\ll \nu\\\
+\infty,& \text{otherwise}
\end{cases}.
$$

Remark 1. Since $s(\log s)^{-}$ is bounded for $s\in [0,+\infty)$, whenever $\mu\in \mathcal{P}(X)$ is absolutely continuous with respect to $\nu$,

$$
\int_X \log \left(\frac{d\mu}{d\nu}\right)^{-}d\mu=\int_X \frac{d\mu}{d\nu}\log \left(\frac{d\mu}{d\nu}\right)^{-}d\nu<+\infty.
$$

It follows that the relative entropy is well-defined.

Lemma 1. For $\nu\in \mathcal{P}(X)$, the relative entropy $H(\cdot\Vert\nu)$ functional is positive definite, namely, $H(\mu\Vert\nu)\ge 0$ and $H(\mu\Vert\nu)=0$ if and only if $\mu=\nu$.

Proof. It suffices to consider the case where $H(\mu\Vert\nu)<+\infty$. Since $s\log s\ge s-1$ with equality if and only if $s=1$,
$$
H(\mu\Vert\nu)=\int_X \frac{d\mu}{d\nu}\log \left(\frac{d\mu}{d\nu}\right) d\nu\ge \int_X \left(\frac{d\mu}{d\nu}-1\right) d\nu=\int_X d\mu-\int_Xd\nu=\mu(X)-\nu(X)=0.
$$
And the equality holds if and only if
$$
\frac{d\mu}{d\nu}=1\quad \nu-a.e.\ ,
$$
if and only if $\mu=\nu$. This complete the proof. $\square$

Proposition 1. Let $f:X\to\mathbb{R}$ be a bounded measurable function and $\nu\in \mathcal{P}(X)$. The following conclusions holds.

(a) We have the variational formula
$$
-\log \int_X e^{-f} d\nu=\inf_{\mu\in \mathcal{P}(X)} \left\{ H(\mu\Vert \nu)+\int_X f\ d\mu \right\}.\tag{1}
$$
(b) Let $\mu_0$ denotes the probability measure on $X$ which is absolutely continuous with respect to $\nu$ and satisfies
$$
\frac{d\mu_0}{d\nu}(x) = \frac{e^{-f(x)}}{\int_X e^{-f}d\nu}.
$$
Then the infimum in the variational formula (1) is uniquely attained at $\mu_0$.

Proof. For part (a), it suffices to prove that
$$
-\log \int_X e^{-f} d\nu=\inf \left\{ H(\mu\Vert \nu)+\int_X f\ d\mu: \mu \in \mathcal{P}(X),\ H(\mu\Vert\nu)<+\infty\right\}.
$$
If $H(\mu\Vert\nu)<+\infty$, then $\mu\ll\nu$ and since by definition $\nu\ll \mu_0$, we have also $\mu\ll\mu_0$. Thus

$$
\begin{aligned}
H(\mu\Vert \nu)+\int_X f\ d\mu&= \int_X \log \left(\frac{d\mu}{d\nu}\right)d\mu +\int_X f\ d\mu\\\
&=\int_X \log \left(\frac{d\mu}{d\mu_0}\right)d\mu+\int_X \log \left(\frac{d\mu_0}{d\nu}\right)d\mu +\int_X f\ d\mu\\\
&=H(\mu\Vert\mu_0)-\log \int_X e^{-f} d\nu.
\end{aligned}
$$
Hence by proposition 1, we complete the proof. $\square$

Now, we denote by $C_b(X)$ the space of bounded continuous functions mapping $X$ into $\mathbb{R}$ and by $B_b(X)$ the space of bounded Borel measurable functions mapping $X$ into $\mathbb{R}$.

Theorem 1 (Donsker-Varadhan variational formula). For each $\mu$ and $\nu$ in $\mathcal{P}(X)$,
$$
H(\mu\Vert\nu)=\sup_{g\in C_b(X)}\left\{\int_X g\ d\mu-\log\int_X e^{g}\ d\nu\right\}=\sup_{h\in B_b(X)}\left\{\int_X h\ d\mu-\log\int_X e^{h}\ d\nu\right\}.
$$

Proof. We first show that
$$
\sup_{g\in C_b(X)}\left\{\int_X g\ d\mu-\log\int_X e^{g}\ d\nu\right\}=\sup_{h\in B_b(X)}\left\{\int_X h\ d\mu-\log\int_X e^{h}\ d\nu\right\}.
$$
On the one hand, given $\varepsilon>0$, since $X$ is a Polish space, $\mu$ and $\nu$ are Borel probability measure, by Ulam’s Theorem, $\mu$ and $\nu$ are tight. Hence, there is a compact subset $K$ of $X$ such that
$$
\mu(K^c)\le \varepsilon\quad \text{and}\quad \nu(K^c)\le \varepsilon.
$$
By Lusin’s Theorem and the Tietze-Urysohn Extension Theorem, for any $h\in B_b(X)$, there exists a closed subset $F$ of $K$, which is also compact, such that
$$
\mu(K\setminus F)\le \varepsilon \quad \text{and}\quad \nu(K\setminus F)\le \varepsilon.
$$
And $g\in C_b(X)$ such that

$$
g|_F=h|_F
$$

and
$$
\lVert g\rVert_{\infty}\le \lVert h\rVert_{\infty}.
$$
It follows that
$$
\mu(F^c)\le \mu(K^c)+ \mu(K\setminus F)\le 2\varepsilon \quad \text{and}\quad \nu(F^c)\le \nu(K^c)+ \nu(K\setminus F)\le 2\varepsilon
$$
and that
$$
\begin{aligned}
\int_X h\ d\mu-\log\int_X e^{h}\ d\nu &=\int_{F} h\ d\mu-\log\int_{F} e^{h}\ d\nu+\int_{F^c} h\ d\mu-\log\int_{F^c} e^{h}\ d\nu\\\
&\le \int_{F} g\ d\mu-\log\int_{F} e^{g}\ d\nu+4\lVert h\rVert_{\infty}\varepsilon\\\
&=\int_{X} g\ d\mu-\log\int_{X} e^{g}\ d\nu-\int_{F^c} g\ d\mu+\log\int_{F^c} e^{g}\ d\nu+4\lVert h\rVert_{\infty}\varepsilon\\\
&\le \int_{X} g\ d\mu-\log\int_{X} e^{g}\ d\nu+ 8\lVert h\rVert_{\infty}\varepsilon.
\end{aligned}
$$

Taking the supremum over $g\in C_b(X)$ and letting $\varepsilon\to 0$ yields
$$
\int_X h\ d\mu-\log\int_X e^{h}\ d\nu \le \sup_{g\in C_b(X)}\left\{\int_X g\ d\mu-\log\int_X e^{g}\ d\nu\right\},
$$
and since $h\in B_b(X)$ is arbitrary, we conclude that
$$
\sup_{h\in B_b(X)}\left\{\int_X h\ d\mu-\log\int_X e^{h}\ d\nu\right\}\le \sup_{g\in C_b(X)}\left\{\int_X g\ d\mu-\log\int_X e^{g}\ d\nu\right\}.
$$
On the other hand, since $C_b(X)\subset B_b(X)$, we have
$$
\sup_{h\in B_b(X)}\left\{\int_X h\ d\mu-\log\int_X e^{h}\ d\nu\right\}\ge \sup_{g\in C_b(X)}\left\{\int_X g\ d\mu-\log\int_X e^{g}\ d\nu\right\}.
$$
Hence
$$
\sup_{g\in C_b(X)}\left\{\int_X g\ d\mu-\log\int_X e^{g}\ d\nu\right\}=\sup_{h\in B_b(X)}\left\{\int_X h\ d\mu-\log\int_X e^{h}\ d\nu\right\}.
$$
Next, we denote
$$
R(\mu,\nu):=\sup_{g\in C_b(X)}\left\{\int_X g\ d\mu-\log\int_X e^{g}\ d\nu\right\}=\sup_{h\in B_b(X)}\left\{\int_X h\ d\mu-\log\int_X e^{h}\ d\nu\right\}
$$
and we need to show
$$
R(\mu,\nu)=H(\mu\Vert \nu).
$$
Since the function $g_0\equiv 0$ on $X$ is also bounded continuous, we have
$$
R(\mu,\nu)\ge\int_X g_0\ d\mu-\log\int_X e^{g_0}\ d\nu=0.
$$
By Proposition 1, for any $f\in B_b(X)$,
$$
H(\mu\Vert\nu)\ge -\int_X f\ d\mu-\log \int_X e^{-f}\ d\nu.
$$
Replacing $f$ by $h:=-f$, and taking the supremum over $h\in B_b(X)$, we obtain
$$
H(\mu\Vert\nu)\ge\sup_{h\in B_b(X)}\left\{\int_X h\ d\mu-\log\int_X e^{h}\ d\nu\right\}=R(\mu,\nu).
$$
Now, we only need to show that
$$
R(\mu,\nu)\ge H(\mu\Vert\nu).
$$
We may assume that $R(\mu,\nu)<+\infty$ for otherwise there is nothing to prove.

We claim that under this condition, $\mu\ll\nu$. Indeed, let $A\in \mathcal{B}$ such that $\nu(A)=0$ and take $h=r1_A$ with $r>0$, we have
$$
r\mu(A)=\int_X h\ d\mu-\log\int_X e^{h}\ d\nu\le R(\mu,\nu)<\infty.
$$
Taking $r\to\infty$ gives $\mu(A)=0$ as claimed.

Since $\mu\ll\nu$, the Radon-Nikodym derivative
$$
f:=\frac{d\mu}{d\nu}
$$
exists.

Case 1: $f(x)>0$ for all $x\in X$ and bounded. Then $h=\log f$ is bounded and measurable. Hence by
$$
\log\int_X e^{h}\ d\nu=\log\int_X f\ d\nu=\log\int_X d\mu=0,
$$

we have

$$
H(\mu\Vert \nu)=\int_X \log f\ d\mu= \int_X h\ d\mu- \log\int_X e^{h}\ d\nu \le R(\mu,\nu).
$$
Case 2: $f(x)>0$ for all $x\in X$ but $f$ is not bounded. Then for $n\in \mathbb{N}$, we set $f_n=f\wedge n$ and $h=\log f_n$, we obtain by Monotone Convergence Theorem
$$
H(\mu\Vert \nu)=\int_X \log f\ d\mu=\lim_{n\to\infty}\int_X \log f_n\ d\mu\le R(\mu,\nu)+\lim_{n\to\infty} \log\int_X f_n\ d\nu=R(\mu,\nu).
$$
Case 3: neither $f(x)>0$ for all $x\in X$ nor bounded. For $t\in [0,1]$, define
$$
\mu_t:=t\nu+(1-t)\mu\quad\text{and}\quad f_t:= \frac{d\mu_t}{d\nu}=t\cdot 1+(1-t)f.
$$
For each $t\in (0,1]$, $f_t>0$ for all $x\in X$ and so by Case 2, we have
$$
H(\mu_t\Vert\nu)\le R(\mu_t,\nu).
$$
We now prove that
$$
\lim_{t\to 0} H(\mu_t\Vert\nu)=H(\mu\Vert\nu)\quad \text{and}\quad \lim_{t\to 0}R(\mu_t,\nu)=R(\mu,\nu),
$$
which will complete the proof.

Since $s\log s$ is convex on $[0,+\infty)$,
$$
H(\mu_t\Vert \nu)=\int_X f_t\log f_t\ d\nu\le (1-t) \int_X f\log f\ d\nu=(1-t)H(\mu\Vert \nu).
$$
Therefore,
$$
\limsup_{t\to 0} H(\mu_t\Vert \nu)\le H(\mu\Vert \nu).
$$

Moreover, since $f_t\ge t$ we have $\log f_t\ge \log t$ and $\log s$ is concave, we have
$$
\log f_t\ge (1-t)\log f,
$$

which follows that
$$
H(\mu_t\Vert \nu)=\int_X f_t\log f_t\ d\nu=t\int_X \log f_t \ d\nu+(1-t) \int_X f\log f_t\ d\nu\ge t\log t +(1-t)^2 H(\mu\Vert\nu).
$$
Therefore,
$$
\liminf_{t\to 0} H(\mu_t\Vert \nu)\ge H(\mu\Vert \nu).
$$

Thus we have
$$
\lim_{t\to 0} H(\mu_t\Vert \nu)=H(\mu\Vert \nu).
$$
For $R(\mu,\nu)$, by Jensen’s inequality, for $g\in C_b(X)$,
$$
-\log\int_X e^g\ d\nu\le \int_X (-\log e^g)\ d\nu=-\int_X g\ d\nu,
$$
which follows that
$$
0\le R(\nu,\nu)=\sup_{g\in C_b(X)}\left\{\int_X g\ d\nu-\log\int_X e^{g}\ d\nu\right\}\le 0.
$$
Therefore $R(\nu,\nu)=0$. It is easy to check that the mapping $t\in [0,1]\mapsto R(\mu_t,\nu)$ is convex and lower semi-continuous. Furthermore,
$$
0\le R(\mu_t,\nu)\le tR(\nu,\nu)+(1-t) R(\mu,\nu)=(1-t) R(\mu,\nu)\le R(\mu,\nu)<+\infty
$$
and so it is also bounded. By convex analysis, we know that $t\in [0,1]\mapsto R(\mu_t,\nu)$ is continuous and therefore
$$
\lim_{t\to 0}R(\mu_t,\nu)=R(\mu,\nu).
$$
This complete the proof. $\square$

Theorem 2. The relative entropy $H(\mu\Vert \nu)$ is a convex, lower semi-continuous function of $(\mu,\nu)\in \mathcal{P}(X)\times \mathcal{P}(X)$. In particular, $H(\mu\Vert \nu)$ is a convex, lower semi-continuous function of each $\mu$ and $\nu$ separately. In addition, for fixed $\nu\in \mathcal{P}(X)$, the relative entropy $H(\cdot\Vert\nu)$ is strictly convex on the set
$$
\{\mu\in \mathcal{P}(X): H(\mu\Vert\nu)<+\infty\}.
$$

Proof. By Donsker-Varadhan variational formula, we have
$$
H(\mu\Vert\nu)=\sup_{g\in C_b(X)}\left\{\int_X g\ d\mu-\log\int_X e^{g}\ d\nu\right\}.
$$
Since for each fixed $g\in C_b(X)$, the mapping
$$
(\mu,\nu)\in \mathcal{P}(X)\times \mathcal{P}(X)\mapsto \int_X g\ d\mu-\log\int_X e^{g}\ d\nu
$$
is convex and continuous. As the supremum over $g\in C_b(X)$, we have $H(\mu\Vert \nu)$ is a convex, lower semi-continuous function of $(\mu,\nu)\in \mathcal{P}(X)\times \mathcal{P}(X)$.

To prove the strict convexity on
$$
\{\mu\in \mathcal{P}(X): H(\mu\Vert\nu)<+\infty\},
$$
where we have
$$
H(\mu\Vert\nu)=\int_X \frac{d\mu}{d\nu}\log \left(\frac{d\mu}{d\nu}\right) d\nu.
$$
Then the strict convexity follows from the strict convexity of $s\log s$ for $s\in [0,+\infty)$. This complete the proof. $\square$

Remark. The relative entropy is often not continuous even with respect to the strong convergence of probability measures.

Conterexample. Let
$$
\mu_n:=\frac{1}{a_n}\operatorname{Unif}[1,n]+\left(1-\frac{1}{a_n}\right)\operatorname{Unif}[-2,-1],
$$
where $a_n=\log\log n$ and
$$
\mu:=\operatorname{Unif}[-2,-1].
$$
Then it is easy to show that
$$
\operatorname{TV}(\mu_n,\mu)=\frac{1}{a_n}\to 0.
$$
However, let $\gamma$ be the Gaussian measure on $\mathbb{R}$, then it is easy to show that
$$
H(\mu_n\Vert \gamma)\to +\infty\neq H(\mu\Vert \gamma)\quad \text{as } n\to\infty.
$$

Theorem 3. For each $\nu\in \mathcal{P}(X)$, the relative entropy has compact sublevel sets. That is, for each $M<+\infty$ the sublevel set
$$
\{\mu\in \mathcal{P}(X): H(\mu\Vert\nu)\le M\}
$$
is a compact subset of $\mathcal{P}(X)$.

Proof. Let $\{\mu_n,\ n\in \mathbb{N}\}$ be any sequence in the sublevel set
$$
\{\mu\in \mathcal{P}(X): H(\mu\Vert\nu)\le M\},
$$
which implies
$$
\sup_{n\in \mathbb{N}} H(\mu_n\Vert \nu)\le M<+\infty.
$$
By Donsker-Varadhan variational formula, for any $h\in B_b(X)$, we have for each $n\in \mathbb{N}$
$$
\int_X h\ d\mu_n-\log\int_X e^h\ d\nu\le H(\mu_n\Vert \nu)\le M.
$$

Let $\delta>0$ and $\varepsilon>0$ be given. Since $\nu$ is a Borel probability measure on Polish space $X$, by Ulam’s Theorem, $\nu$ is tight, which means there exists a compact set $K$ such that
$$
\nu(K^c)\le \delta.
$$
Take
$$
h(x)=\begin{cases}
0,& x\in K\\\
\log (1+\frac{1}{\delta}),& x\in K^c
\end{cases},
$$
which is bounded and Borel measurable, we have for each $n\in \mathbb{N}$,
$$
\int_X h\ d\mu_n-\log\int_X e^h\ d\nu=\log (1+\frac{1}{\delta})\ \mu_n(K^c)-\log\nu(K)-\log \left[(1+\frac{1}{\delta})\nu(K^c)\right] \le M.
$$
It follows that
$$
\mu_n(K^c)\le \frac{1}{\log (1+\frac{1}{\delta})} \left(M+\log \left[ \nu(K)+(1+\frac{1}{\delta}) \nu(K^c)\right]\right)\le \frac{M+\log 2}{\log (1+\frac{1}{\delta})}.
$$
Hence we can choose $\delta>0$ such that
$$
\frac{M+\log 2}{\log (1+\frac{1}{\delta})}\le \varepsilon,
$$

which implies that $\{\mu_n,\ n\in \mathbb{N}\}$ is tight. By Prohorov’s Theorem, there exists $\mu\in \mathcal{P}(X)$ and a subsequence $\mu_{n_k}$ such that $\mu_{n_k}$ weak converge to $\mu$. The lower semi-continuity of $H(\cdot\Vert\nu)$ yields
$$
H(\mu\Vert\nu)\le\liminf_{k\to\infty} H(\mu_{n_k}\Vert\nu)\le M,
$$
which means $\mu$ also lies in the sublevel set. This complete the proof. $\square$

The cover image of this article was taken while taking a helicopter sightseeing flight over Aoraki / Mount Cook in New Zealand.

Author

Handstein Wang

Posted on

2026-03-17

Updated on

2026-03-18

Licensed under