Handstein Wang

Gradient Flows in Wasserstein Space

2026-04-15T16:00:00.000Z

In this article, we will introduce the gradient flows in Wasserstein space.

For Wasserstein space $W_p := (\mathcal P_p(X), W_p)$, the standard considerations from fluid mechanics tell us that the density $\mu_t$ of a family of particles may be interpreted as the continuity
$$
\partial_t \mu_t + \nabla \cdot (\mu_t v_t) = 0
$$
with $L^p$ vector $v_t$. Moreover, we want to connect the $L^p$ norm of $v_t$ with the metric derivative $|\mu’|(t)$.

Theorem 1. Let $(\mu_t)_{t\in[0,1]}$ be an absolutely continuous curve in $W_p(\Omega)$ (for $p>1$, and $\Omega\subset \mathbb R^d$ an open domain). Then for a.e. $t\in[0,1]$, there exists a vector field $v_t\in L^p(\mu_t;\mathbb R^d)$ such that

the continuity equation $\partial_t \mu_t + \nabla\cdot(\mu_t v_t)=0$ is satisfied in the sense of distributions,
for a.e. $t$, we have
$$
\Vert v_t\Vert_{L^p(\mu_t)} \le |\mu’|(t),
$$
where $|\mu’|(t)$ denotes the metric derivative at the time $t$ of the curve $t\mapsto \mu_t$ w.r.t. the distance $W_p$.

Conversely, if $(\mu _t)$ is a family of measures in $\mathcal P_p (\Omega)$ and for each $t$, we have a vector field $v_t\in L^p(\mu_t;\mathbb R^d)$ with $\int_0^1 \Vert v_t\Vert _{L^p(\mu_t)}\ dt<+\infty$ solving

$$
\partial_t \mu_t + \nabla\cdot(\mu_t v_t)=0,
$$

then $(\mu _t)$ is absolutely continuous in $W _p(\Omega)$ and for a.e. $t$, we have

$$
|\mu’|(t)\le \Vert v_t\Vert_{L^p(\mu_t)}.
$$

Note that as a consequence of the second part of the statement, the vector field $v_t$ introduced in the first part must satisfy
$$
\Vert v_t\Vert_{L^p(\mu_t)} = |\mu’|(t).
$$

McCann’s displacement interpolation

Theorem 2. If $\Omega\subset \mathbb R^d$ is convex, then all the spaces $W_p(\Omega)$ are length spaces and if $\mu$ and $\nu$ belong to $W_p(\Omega)$ and $\gamma$ is the optimal transport plan from $\mu$ to $\nu$ for the cost $c_p(x,y)=|x-y|^p$, then the curve
$$
\mu^\gamma(t):=(\pi_t)_\sharp \gamma
$$
where $\pi_t:\Omega\times\Omega\to\Omega$ is given by
$$
\pi_t(x,y)=(1-t)x+ty
$$
is a constant-speed geodesic from $\mu$ to $\nu$. In the case $p>1$, all constant-speed geodesics are of this form, and, if $\mu$ is absolutely continuous, then there is only one geodesic and it has the form
$$
\mu_t=(T_t)_\sharp \mu,\qquad T_t=(1-t)\operatorname{id}+tT,
$$
where $T$ is the optimal transport map from $\mu$ to $\nu$. In this case, the velocity field $V_t$ of geodesic $\mu_t$ is given by
$$
v_t=(T-\operatorname{id})\circ (T_t)^{-1}.
$$

In particular, for $t=0$, we have $v_0=-\nabla\varphi$ and for $t=1$, we have $v_1=-\nabla\psi$, where $\varphi$ is the Kantorovich potential in the transport from $\mu$ to $\nu$ and $\psi=\varphi^c$.

Using the characterization of constant speed geodesics as minimizers of a strictly convex kinetic energy, we have

Looking for an optimal transport for the cost $c(x,y)=|x-y|^p$ is equivalent to looking for constant-speed geodesics in $W_p$.
Constant-speed geodesics may be found by minimizing $\int_0^1 |\mu’|(t)^p\ dt$.
In the case of $W_p$, we have $|\mu’|(t)^p=\int_\Omega |v_t|^p\ d\mu_t$, where $v$ is a velocity field solving the continuity equation together with $\mu$.

As a consequence of these considerations, for $p>1$, solving the kinetic energy minimization problem
$$
\min \left\{ \int_0^1 \int_\Omega |v_t|^p\ d\rho_t\ dt \ ; \ \partial_t \rho_t+\nabla\cdot(\rho_t v_t)=0,\ \rho_0=\mu,\ \rho_1=\nu \right\}
$$
selects constant-speed geodesics connecting $\mu$ to $\nu$ and hence allows to find the optimal transport between $\mu$ and $\nu$. This is what is usually called Benamou–Brenier formula.

Minimizing Movement Scheme in the Wasserstein Space and Evolution PDEs

From now on, we consider $W_2(\Omega)$, consider the MMS,
$$
\rho_{k+1}^\tau \in \operatorname{argmin}_\rho \left\{ F(\rho)+\frac{W_2^2(\rho,\rho_k^\tau)}{2\tau} \right\}. \tag{1}
$$

and denote
$$
\mathcal T_c(\rho,\nu):=\min\left\{ \int c(x,y)\ d\gamma,\ \gamma\in\Pi(\rho,\nu) \right\},
$$
for $\nu=\rho_k^\tau$, $c(x,y)=|x-y|^2$.

Given a functional $G:\mathcal P(\Omega)\to\mathbb R$, we call $\dfrac{\delta G}{\delta \rho}(\rho)$, if it exists, the unique (up to additive constants) function such that
$$
\left.\frac{d}{d\varepsilon}G(\rho+\varepsilon\chi)\right|_{\varepsilon=0}
=
\int \frac{\delta G}{\delta \rho}(\rho)\ d\chi,
$$
for every perturbation $\chi$ such that at least for $\varepsilon\in[0,\bar\varepsilon]$, $\rho+\delta\chi\in\mathcal P(\Omega)$. The function $\dfrac{\delta G}{\delta \rho}(\rho)$ is called the first variation of functional $G$ at $\rho$.

Examples: Let $f:\mathbb R\to\mathbb R$ be a convex superlinear function, $V:\Omega\to\mathbb R$, $W:\mathbb R^d\to\mathbb R$ be regular enough, and $W$ is taken symmetric i.e.
$$
W(z)=W(-z).
$$

We have three functionals
$$
\mathcal F(\rho)=
\begin{cases}
\displaystyle \int f(\rho(x))\ dx & \text{if }\rho\ll \operatorname{leb},\
+\infty & \text{otherwise},
\end{cases}
$$
$$
\mathcal V(\rho)=\int V\ d\rho,\qquad
\mathcal W(\rho)=\frac12\iint W(x-y)\ d\rho(x)\ d\rho(y).
$$

Then we have
$$
\frac{\delta\mathcal F}{\delta \rho}(\rho)=f’(\rho),\qquad
\frac{\delta\mathcal V}{\delta \rho}(\rho)=V,\qquad
\frac{\delta\mathcal W}{\delta \rho}(\rho)=W\star \rho.
$$

Proof : We have

$$
\frac{d}{d\varepsilon} \mathcal F(\rho+\varepsilon\chi)\Big| _{\varepsilon=0}
=
\lim _{\varepsilon\to 0} \frac{\mathcal F(\rho+\varepsilon\chi)-\mathcal F(\rho)}{\varepsilon}
=
\lim _{\varepsilon\to 0}\int \frac{f(\rho(x)+\varepsilon\chi(x))-f(\rho(x))}{\varepsilon}\ dx
=
\int f^\prime(\rho(x))\chi(x)\ dx
=
\int f^\prime(\rho(x))\ d\chi(x).
$$

Hence

$$
\frac{\delta\mathcal F}{\delta \rho}(\rho)=f’(\rho).
$$

Moreover,
$$
\frac{d}{d\varepsilon}\mathcal V(\rho+\varepsilon\chi)\Big| _{\varepsilon=0}
=
\lim _{\varepsilon\to 0}\frac{\mathcal V(\rho+\varepsilon\chi)-\mathcal V(\rho)}{\varepsilon}
=
\lim _{\varepsilon\to 0}\int V\ \frac{d(\rho+\varepsilon\chi)-d\rho}{\varepsilon}
=
\int V\ d\chi.
$$
Hence
$$
\frac{\delta\mathcal V}{\delta \rho}(\rho)=V.
$$

Finally,

$$
\begin{aligned}
\frac{d}{d\varepsilon}\mathcal W(\rho+\varepsilon\chi)\Big| _{\varepsilon=0}
&=
\lim _{\varepsilon\to 0}\frac{\mathcal W(\rho+\varepsilon\chi)-\mathcal W(\rho)}{\varepsilon}
=
\lim _{\varepsilon\to 0}\frac{1}{2\varepsilon}\iint W(x-y)\Big[d((\rho+\varepsilon\chi)(x))\ d((\rho+\varepsilon\chi)(y))-d\rho(x)\ d\rho(y)\Big]\\\
&=
\frac12\iint W(x-y)\ d\chi(x)\ d\rho(y)+\frac12\iint W(x-y)\ d\rho(x)\ d\chi(y)
=
\iint W(x-y)\ d\rho(y)\ d\chi(x).
\end{aligned}
$$

Hence
$$
\frac{\delta\mathcal W}{\delta \rho}(\rho)
=
\int W(x-y)\rho(y)\ dy
=
W\star \rho.
$$

Proposition 1. Let $c:\Omega\times\Omega\to\mathbb R$ be a continuous cost function. Then the functional
$$
\rho\mapsto \mathcal T_c(\rho,\nu)
$$
is convex and its subdifferential at $\rho_0$ coincides with the set of Kantorovich potentials
$$
\left\{ \varphi\in C^0(\Omega):\ \int \varphi\ d\rho_0+\int \varphi^c\ d\nu = \mathcal T_c(\rho,\nu) \right\}.
$$

Moreover, if there is a unique $c$-concave Kantorovich potential $\varphi$ from $\rho_0$ to $\nu$, up to additive constants, then we also have
$$
\frac{\delta \mathcal T_c(\rho,\nu)}{\delta \rho}(\rho_0)=\varphi.
$$

For (1) the objection function is just
$$
F(\rho)+\frac{1}{\tau} \mathcal T_{c/2}(\rho,\rho_k^\tau).
$$

The optimal condition is
$$
\frac{\delta F}{\delta \rho}(\rho_{k+1}^\tau)+\frac{\varphi}{\tau}=\text{const},
$$
where $\varphi$ is the Kantorovich potential for the cost $\dfrac{c}{2}=\dfrac12|x-y|^2$.

Combining the fact that the optimal transport map $T(x)=x-\nabla\varphi(x)$, we get

$$
-v(x) := \frac{T(x)-x}{\tau}=-\frac{\nabla\varphi(x)}{\tau} = \nabla\left( \frac{\delta F}{\delta \rho} (\rho) \right)(x).
$$

This suggest that at the limit $\tau\to 0$, we will find a solution of
$$
\partial_t \rho_t-\nabla\cdot\left(\rho \nabla\left[\frac{\delta F}{\delta \rho}(\rho)\right]\right)=0.
$$

Examples:

For $\mathcal F(\rho)=\int f(\rho(x))\ dx$ with $f(u)=u\log u$, we have

$$
\frac{\delta\mathcal F}{\delta \rho}(\rho)=f’(\rho)=\log\rho+1,\qquad
\nabla\frac{\delta\mathcal F}{\delta \rho}(\rho)=\frac{\nabla\rho}{\rho}
$$
we have
$$
\partial_t \rho_t=\Delta \rho_t
$$
which is just the Heat equation.

For $F(\rho)=\int f(\rho(x))\ dx+\int V(x)\ d\rho(x)$, we get
$$
\frac{\delta F}{\delta \rho}(\rho)=\log\rho+1+V
\Longrightarrow
\nabla\frac{\delta F}{\delta \rho}=\frac{\nabla\rho}{\rho}+\nabla V.
$$
We get the Fokker–Planck equation
$$
\partial_t \rho_t-\Delta \rho-\nabla\cdot(\rho\nabla V)=0.
$$

Reference

Santambrogio, F. {Euclidean, metric, and Wasserstein} gradient flows: an overview. Bull. Math. Sci. 7, 87–154 (2017).

The cover image in this article was taken on North Stradbroke Island, Brisbane, Australia.

The General Theory in Metric Spaces

2026-04-14T16:00:00.000Z

In this article, we will introduce the general theory in metric spaces.

Preliminaries

Metric derivative. Given a curve $x:[0,T]\to X$ valued in a metric space, we can define the speed

$$
|x’|(t):=\lim_{h\to0}\frac{d(x(t),x(t+h))}{|h|}
$$
provided the limit exists.

Slope and modulus of the gradient.
Upper gradient. If $g:X\to\mathbb{R}$ such that for every Lipschitz curve $x$,

$$
|F(x(t_1))-F(x(t_0))|\le \int_0^1 g(x(t))|x’|(t)\ dt.
$$

If $F$ is Lipschitz continuous, a possible choice is the local Lipschitz constant
$$
|\nabla F|(x):=\limsup_{y\to x}\frac{|F(y)-F(x)|}{d(x,y)}.
$$

Descending slope (slope in short)

$$
|\nabla^-F|(x):=\limsup_{y\to x}\frac{[F(x)-F(y)]_+}{d(x,y)}.
$$

Geodesic convexity. On a geodesic metric space, we say a function $F$ is geodesically convex if for every pair $(x(0),x(1))$ there exists a geodesic $x$ with constant speed connecting these two points and such that

$$
F(x(t))\le (1-t)F(x(0))+tF(x(1)).
$$

We can also define $\lambda$-convex by
$$
F(x(t))\le (1-t)F(x(0))+tF(x(1))-\lambda\frac{t(1-t)}{2}d^2(x(0),x(1)).
$$

Existence of a gradient flow

Let us suppose that the space $X$ and the function $F$ are such that every sub-level set $\{F\le c\}$ is compact in $X$, either for the topology induced by the distance $d$, or for a weaker topology such that $d$ is lower semi-continuous. $F$ is required to be l.s.c. in the same topology. This is the minimal framework to guarantee existence of the minimizers at each step.

Even if estimate (11) is enough to provide compactness and thus the existence of GMM, it will be never enough to characterize the limit curve (indeed, it is satisfied by any discrete evolution where $x_{k+1}^\tau$ gives a better value than $x_k^\tau$, without any need for optimality).

Variational interpolation (by De Giorgi). Once we fix $x_k^\tau$, for every $\theta\in(0,1]$, consider
$$
\min_x\ F(x)+\frac{d^2(x,x_k^\tau)}{2\theta\tau}
$$
and call $x(\theta)$ any minimizer of this problem and $\varphi(\theta)$ the minimal value.

Then we have

(1) for $\theta\to0^+$, we have $x(\theta)\to x_k^\tau$ and $\varphi(\theta)\to F(x_k^\tau)$.

Proof. Since $x(\theta)$ is the minimizer, we have
$$
F(x(\theta))\le F(x(\theta))+\frac{d^2(x(\theta),x_k^\tau)}{2\theta\tau}\le F(x_k^\tau).
$$
That is
$$
\frac{d^2(x(\theta),x_k^\tau)}{2\theta\tau}\le F(x_k^\tau)-F(x(\theta)).
$$
Since $x(\theta)\in\{F\le F(x_k^\tau)\}$ which is compact, the RHS above is bounded. Therefore $d(x(\theta),x_k^\tau)\to0$.

For $\varphi(\theta)$, we have $\varphi(\theta)\le F(x_k^\tau)$.

On the other hand
$$
\varphi(\theta)=F(x(\theta))+\frac{d^2(x(\theta),x_k^\tau)}{2\theta\tau}\ge F(x(\theta)).
$$
By the l.s.c. of $F$, we have
$$
F(x_k^\tau)\le \liminf_{\theta\to0^+}F(x(\theta))
\le \liminf_{\theta\to0^+}\varphi(\theta)
\le \limsup_{\theta\to0^+}\varphi(\theta)\le F(x_k^\tau).
$$
Hence
$$
\lim_{\theta\to0^+}\varphi(\theta)=F(x_k^\tau).\quad \square
$$

(2) for $\theta=1$, we get back to the original problem with minimizer $x_{k+1}^\tau$.

(3) the function $\varphi$ is non-increasing and hence a.e. differentiable. Moreover
$$
\varphi’(\theta)=-\frac{d^2(x(\theta),x_k^\tau)}{2\theta^2\tau}.
$$
which also means $d(x(\theta),x_k^\tau)$ does not depend on the minimizer $x(\theta)$ for all $\theta$ such that $\varphi’(\theta)$ exists.

Proof. Let $G_\theta(x):=F(x)+\dfrac{d^2(x,x_k^\tau)}{2\theta\tau}$, then
$$
\varphi(\theta)=\min_x G_\theta(x)=G_\theta(x(\theta)).
$$

For sufficiently small $h>0$, we have
$$
\varphi(\theta+h)\le G_{\theta+h}(x(\theta))
=F(x(\theta))+\frac{d^2(x(\theta),x_k^\tau)}{2(\theta+h)\tau}
$$
and
$$
\varphi(\theta)=F(x(\theta))+\frac{d^2(x(\theta),x_k^\tau)}{2\theta\tau}.
$$
Hence
$$
\frac{\varphi(\theta+h)-\varphi(\theta)}{h}
\le \frac{d^2(x(\theta),x_k^\tau)}{2\tau}\frac{\frac{1}{\theta+h}-\frac{1}{\theta}}{h}
= -\frac{d^2(x(\theta),x_k^\tau)}{2\theta(\theta+h)\tau}.
$$
Then
$$
\varphi’(\theta)=\lim_{h\to0}\frac{\varphi(\theta+h)-\varphi(\theta)}{h}
\le -\frac{d^2(x(\theta),x_k^\tau)}{2\theta^2\tau}.
\tag{$\star$}
$$

On the other hand
$$
\varphi(\theta)\le G_\theta(x(\theta+h))
=F(x(\theta+h))+\frac{d^2(x(\theta+h),x_k^\tau)}{2\theta\tau}.
$$
Meanwhile,
$$
\varphi(\theta+h)=F(x(\theta+h))+\frac{d^2(x(\theta+h),x_k^\tau)}{2(\theta+h)\tau}.
$$
We have
$$
\frac{\varphi(\theta+h)-\varphi(\theta)}{h}
\ge \frac{d^2(x(\theta+h),x_k^\tau)}{2\tau}\frac{\frac{1}{\theta+h}-\frac{1}{\theta}}{h}
= -\frac{d^2(x(\theta+h),x_k^\tau)}{2\theta(\theta+h)\tau}.
$$
Let $h\to0$, we have
$$
\varphi’(\theta)\ge -\limsup_{h\to0}\frac{d^2(x(\theta+h),x_k^\tau)}{2\theta(\theta+h)\tau}.
\tag{$\star\star$}
$$

Since all $x(\theta+h)\in\{F\le F(x_k^\tau)\}$ which is compact, there exists $h_j\to0$ such that
$$
x(\theta+h_j)\to \bar x.
$$

Since $F$ and $d$ are l.s.c., then $G_\theta$ is also l.s.c. Hence
$$
\begin{aligned}
G_\theta(\bar x)\le \liminf_{j\to\infty}G_\theta(x(\theta+h_j))&=\liminf_{j\to\infty}\left[G_{\theta+h_j}(x(\theta+h_j))
+\frac{d^2(x(\theta+h_j),x_k^\tau)}{2\tau}
\left(\frac{1}{\theta}-\frac{1}{\theta+h_j}\right)\right]\\\
&=\lim_{j\to\infty}\varphi(\theta+h_j)=\varphi(\theta)\le G_\theta(\bar x),
\end{aligned}
$$

where, we can similarly show that $d^2(x(\theta+h_j),x_k^\tau)$ is bounded and since $\varphi$ is differentiable at $\theta$, and then also continuous at $\theta$, and that $x(\theta)$ is the minimizer of $G_\theta$. Therefore
$$
G_\theta(\bar x)=\varphi(\theta).
$$
$\bar x$ is the minimizer at $\theta$. On the other hand
$$
\varphi(\theta+h_j)=F(x(\theta+h_j))
+\frac{d^2(x(\theta+h_j),x_k^\tau)}{2(\theta+h_j)\tau},
$$
then by the l.s.c. of $F$ and $d$,
$$
\varphi(\theta)=\liminf_{j\to\infty}\varphi(\theta+h_j)
\ge F(\bar x)+\frac{d^2(\bar x,x_k^\tau)}{2\theta\tau}
=G_\theta(\bar x)=\varphi(\theta).
$$
Hence all the inequality by l.s.c. becomes equality, especially
$$
\lim_{j\to\infty}\frac{d^2(x(\theta+h_j),x_k^\tau)}{2(\theta+h_j)\tau}
=\frac{d^2(\bar x,x_k^\tau)}{2\theta\tau}.
$$
Therefore,
$$
d(x(\theta+h_j),x_k^\tau)\to d(\bar x,x_k^\tau),
$$
and then by ($\star\star$),
$$
\varphi’(\theta)\ge -\lim_{j\to\infty}\frac{d^2(x(\theta+h_j),x_k^\tau)}{2(\theta+h_j)\tau}
= -\frac{d^2(\bar x,x_k^\tau)}{2\theta^2\tau}.
$$

Combine ($\star$) and ($\star\star$), we get (note that ($\star$) is true for all minimizer $x(\theta)$ and $\bar x$ is also a minimizer)
$$
\varphi’(\theta)= -\frac{d^2(x(\theta),x_k^\tau)}{2\theta^2\tau}.\quad \square
$$

(4) we have
$$
|\nabla^-F|(x(\theta))\le \frac{d(x(\theta),x_k^\tau)}{\theta\tau}.
$$

Proof. Since $x(\theta)$ is optimal, for all $y$
$$
F(y)+\frac{d^2(y,x_k^\tau)}{2\theta\tau}
\ge F(x(\theta))+\frac{d^2(x(\theta),x_k^\tau)}{2\theta\tau}.
$$
We have
$$
F(x(\theta))-F(y)\le \frac{1}{2\theta\tau}
\big(d^2(y,x_k^\tau)-d^2(x(\theta),x_k^\tau)\big)=\frac{1}{2\theta\tau}\big(d(x(\theta),x_k^\tau)+d(y,x_k^\tau)\big)d(x(\theta),y).
$$

Therefore by the l.s.c. of $d$
$$
\limsup_{y\to x(\theta)}
\frac{[F(x(\theta))-F(y)]_+}{d(x(\theta),y)}
\le \frac{1}{2\theta\tau}\cdot 2d(x(\theta),x_k^\tau)
=\frac{d(x(\theta),x_k^\tau)}{\theta\tau}.\quad \square
$$

(5) due to the possible singular part of the derivative for monotone functions, we have
$$
\varphi(0)-\varphi(1)\ge -\int_0^1 \varphi’(\theta)\ d\theta.
$$
Together with the inequality
$$
-\varphi’(\theta)=\frac{d^2(x(\theta),x_k^\tau)}{2\theta^2\tau}
\ge \frac{\tau}{2}|\nabla^-F|^2(x(\theta)),
$$
we get
$$
F(x_k^\tau)-\left(F(x_{k+1}^\tau)+\frac{d^2(x_{k+1}^\tau,x_k^\tau)}{2\tau}\right)
\ge \frac{\tau}{2}\int_0^1 |\nabla^-F|^2(x(\theta))\ d\theta.
$$

If we sum up for $k=0,1,\cdots$ and take the limit $\tau\to0$, under some suitable assumptions, we can prove for every GMM $x$ we have
$$
F(x(t))+\frac12\int_0^t |x’|^2(r)\ dr
+\frac12\int_0^t |\nabla^-F|^2(x(r))\ dr
\le F(x(0)).
$$

But it is not exactly the EDE

it is an inequality
just compare instants $t$ and $0$ instead of $t$ and $s$

Magically, as it happens that the assumption that $F$ is $\lambda$-geodesically convex make all the assumptions hold true.

Uniqueness and contractivity

Between EDE and EVI, by Savaré, we have

All curves which are gradient flows in EVI sense also satisfy the EDE condition
The EDE condition is not in general enough to guarantee uniqueness of the gradient flow. A simple example is $X=\mathbb{R}^2$, with $l^\infty$ distance
$$
d((x_1,y_1),(x_2,y_2))=|x_1-y_1|\vee |x_2-y_2|
$$
and take $F(x_1,x_2)=x_1$, then any curve $(x_1(t),x_2(t))$ with $x_1’(t)=-1$ and $|x_2’(t)|\le1$ satisfies EDE.

Proof. It is easy to see that $|\nabla^-F|(x)=1$ for all $x\in X$, and
$$
|x’|(t)=\lim_{h\to0}\frac{d(x(t+h),x(t))}{|h|}
=\lim_{h\to0}\frac{\max\{|x_1(t+h)-x_1(t)|,\ |x_2(t+h)-x_2(t)|\}}{|h|}
=\max\{|x_1’(t)|,\ |x_2’(t)|\}.
$$
If $x_1’(t)=1$, $|x_2’(t)|\le1$, $|x’|(t)=1$. Now, we need to check EDE: for $s$$
F(x(s))-F(x(t))
=\int_s^t \frac12|x’|^2(r)+\frac12|\nabla^-F|^2(x(r))\ dr.
$$
LHS $=x_1(s)-x_1(t)$, by $x_1’(t)=1$, we have $x_1(t)=x_1(0)-t$ for all $t$, hence
$$
\text{LHS}=(x_1(0)-s)-(x_1(0)-t)=t-s. \quad \text{RHS}=\int_s^t \frac12+\frac12\ dr=t-s.
$$
Hence,the EDE holds. $\square$

Existence of gradient flow of EDE sense is easy to get
The EVI condition is general too strong to get existence, but always guarantees uniqueness and stability.

Proposition. If two curves $x,y:[0,T]\to X$ satisfy the $\mathrm{EVI}_\lambda$ condition, then we have
$$
\frac{d}{dt}d(x(t),y(t))^2\le -2\lambda\ d(x(t),y(t))^2
$$
and
$$
d(x(t),y(t))\le e^{-\lambda t}d(x(0),y(0)).
$$

Proof. By $\mathrm{EVI}_\lambda$ condition, for all $y\in X$
$$
\frac{d}{dt}\frac12 d(x(t),y)^2\le F(y)-F(x(t))-\frac{\lambda}{2}d(x(t),y)^2.
$$

Take $y=y(t_0)$.
$$
\left.\frac{d}{dt}\frac12 d(x(t),y(t_0))^2\right|_{t=t_0}
\le F(y(t_0))-F(x(t_0))-\frac{\lambda}{2}d(x(t_0),y(t_0))^2.
$$

Similarly,
$$
\left.\frac{d}{ds}\frac12 d(x(t_0),y(s))^2\right|_{s=t_0}
\le F(x(t_0))-F(y(t_0))-\frac{\lambda}{2}d(x(t_0),y(t_0))^2.
$$

Add them up, we get
$$
\frac{d}{dt}\frac12 d(x(t),y(t_0))^2\Big| _{t=t_0}
=\frac{d}{dt}\frac12 d(x(t),y(t_0))^2\Big| _{t=t_0}
+\frac{d}{ds}\frac12 d(x(t_0),y(s))^2\Big| _{s=t_0}\le -\lambda\ d(x(t_0),y(t_0))^2.
$$

Hence
$$
\frac{d}{dt}d(x(t),y(t))^2\le -2\lambda\ d(x(t),y(t))^2.
$$
By Gronwall inequality
$$
d(x(t),y(t))\le e^{-\lambda t}d(x(0),y(0)).\quad \square
$$

If we want a satisfying theory for gradient flows which includes uniqueness, we just need to prove the existence of curves which satisfy the EVI condition, accepting that this will probably require additional assumptions.

This assumption, that we will call $C^2G^2$ (Compatible Convexity along Generalized Geodesics), is the following: suppose that for every pair $(x_0,x_1)$, and every $y\in X$, there is a curve $x(t)$ connecting $x(0)=x_0$ to $x(1)=x_1$, such that

($F$ is $\lambda$-convex)

$$
F(x(t))\le (1-t)F(x_0)+tF(x_1)-\lambda\frac{t(1-t)}{2}d^2(x_0,x_1)
$$

($x\mapsto d^2(x,y)$ is $2$-convex)

$$
d^2(x(t),y)\le (1-t)d^2(x_0,y)+t\ d^2(x_1,y)-t(1-t)d^2(x_0,x_1).
$$

Reference

Santambrogio, F. {Euclidean, metric, and Wasserstein} gradient flows: an overview. Bull. Math. Sci. 7, 87–154 (2017).

The cover image in this article was taken on Tinian Island, a U.S. territory in the Northern Mariana Islands.

Introduction to the Metric Setting

2026-04-13T16:00:00.000Z

In this article, we will introduce the gradient flows in the metric setting.

If one has a metric space $(X,d)$ and a l.s.c. function $F:X\to \mathbb{R}\cup\{+\infty\}$ (under suitable compactness assumptions to guarantee existence of the minimum), one can define
$$
x_{k+1}^{\tau}\in \arg\min_{x\in X}\left\{F(x)+\frac{d(x,x_k^{\tau})^2}{2\tau}\right\} \tag{9}
$$
and study the limit as $\tau\to 0$. Then we use the piecewise constant interpolation
$$
x^{\tau}(t):=x_k^{\tau}\qquad \text{for every } t\in ((k-1)\tau,k\tau].\tag{10}
$$
and study the limit of $x^{\tau}$ as $\tau\to 0$.

Definition 1. A curve $x:[0,T]\to X$ is called Generalized Minimizing Movements (GMM) if there exists a sequence of time steps $\tau_j\to 0$ such that the sequence of curves $x^{\tau_j}$, defined in (10) using the iterated solutions of (9), uniformly converges to $x$ in $[0,T]$.

We start from
$$
F(x_{k+1}^{\tau})+\frac{d(x_{k+1}^{\tau},x_k^{\tau})^2}{2\tau}\le F(x_k^{\tau})
$$
and
$$
\sum_{k=0}^{\ell}\frac{d(x_{k+1}^{\tau},x_k^{\tau})^2}{2\tau}\le F(x_0^{\tau})-F(x_{\ell+1}^{\tau})\le C\tau.
$$

The Cauchy–Schwarz inequality gives for $t$$
d(x^{\tau}(t),x^{\tau}(s))
\le d(x_k^{\tau},x_{k’}^{\tau})
\le \sum_{j=k}^{k’-1} d(x_{j+1}^{\tau},x_j^{\tau})
\le \left(\sum_{j=k}^{k’-1} d(x_{j+1}^{\tau},x_j^{\tau})^2\right)^{1/2}(k’-k)^{1/2}.
$$

For $t

$$
d(x^{\tau}(t),x^{\tau}(s))
\le d(x_{i+1}^{\tau},x_{j+1}^{\tau})
\le \sum_{k=i+1}^{j} d(x_{k+1}^{\tau},x_k^{\tau})\le \left(\sum_{k=i+1}^{j} d(x_{k+1}^{\tau},x_k^{\tau})^2\right)^{1/2}(j-i)^{1/2}
\le \left(\sum_{k=0}^{\ell} d(x_{k+1}^{\tau},x_k^{\tau})^2\right)^{1/2}\left(\frac{|t-s|}{\tau}\right)^{1/2}
\le C|t-s|^{1/2}.
$$

This shows the curves $x^{\tau}$ are morally equi-Hölder with exponent $\frac12$. Since they start from the same $x^{\tau}(0)=x_0$, they are also equibounded. By applying Ascoli–Arzelà theorem, we can extract a converging subsequence.

Curves and geodesics in metric spaces

Definition 2. If $w:[0,1]\to X$ is a curve valued in the metric space $(X,d)$, we define the metric derivative of $w$ at the time $t$, denoted by $|w’|(t)$ through
$$
|w’|(t):=\lim_{h\to 0}\frac{d(w(t+h),w(t))}{|h|}
$$
provided this limit exists.

In the spirit of Rademacher theorem, it is possible to prove that if $w:[0,1]\to X$ is Lipschitz continuous, then the metric derivative $|w’|(t)$ exists for a.e. $t$. Moreover, we have for $t_0$$
d(w(t_0),w(t_1))\le \int_{t_0}^{t_1}|w’|(s)\ ds.
$$

Definition 3. A curve $w:[0,1]\to X$ is said to be absolutely continuous whenever there exists $g\in L^1[0,1]$ such that
$$
d(w(t_0),w(t_1))\le \int_{t_0}^{t_1} g(s)\ ds
\qquad \text{for every } t_0$$

The set of absolutely continuous curves defined on $[0,1]$ and valued in $X$ is denoted by $AC(X)$.

It is well-known that every absolutely continuous curve can be reparameterized in time (through a monotone increasing reparametrization) and become Lipschitz continuous, and the existence of the metric derivative is also true for $w\in AC(X)$ with i.e.t.

Definition 4. For a curve $w:[0,1]\to X$, let us define
$$
\operatorname{Length}(w):=\sup\left\{\sum_{k=0}^{n-1} d\bigl(w(t_k),w(t_{k+1})\bigr): n\ge 1,\ 0=t_0$$

It is easy to see that all curves $w\in AC(X)$ satisfy
$$
\operatorname{Length}(w)\le \int_0^1 g(t)\ dt<+\infty.
$$

Also we can prove that for any curve $w\in AC(X)$, we have
$$
\operatorname{Length}(w)=\int_0^1 |w’|(t)\ dt.
$$

Definition 5. A curve $w:[0,1]\to X$ is said to be a geodesic between $x_0$ and $x_1\in X$ if $w(0)=x_0$, $w(1)=x_1$, and
$$
\operatorname{Length}(w)=\min\left\{\operatorname{Length}(\widetilde w): \widetilde w(0)=x_0,\ \widetilde w(1)=x_1\right\}.
$$

A space $(X,d)$ is said to be a length space if for every $x$ and $y$, we have
$$
d(x,y)=\inf\left\{\operatorname{Length}(w): w\in AC(X),\ w(0)=x,\ w(1)=y\right\}.
$$

A space $(X,d)$ is said to be a geodesic space if for every $x$ and $y$, we have
$$
d(x,y)=\min\left\{\operatorname{Length}(w): w\in AC(X),\ w(0)=x,\ w(1)=y\right\}.
$$
i.e. if it is a length space and there exist geodesics between arbitrary points.

In a length space, a curve $w:[t_0,t_1]\to X$ is said to be a constant-speed geodesic between $w(t_0)$ and $w(t_1)$ if it satisfies
$$
d(w(t),w(s))=\frac{|t-s|}{t_1-t_0}d(w(t_0),w(t_1))
\qquad \text{for all } t,s\in [t_0,t_1].
$$

The following three facts are equivalent:

$w$ is a constant-speed geodesic defined on $[t_0,t_1]$ and joining $x_0$ and $x_1$.
$w\in AC(X)$ and
$$
|w’|(t)=\frac{d(w(t_0),w(t_1))}{|t_1-t_0|}
\qquad \text{a.e.}
$$
$w$ solves
$$
\min\left\{\int_{t_0}^{t_1}|w’(t)|^p\ dt:\ w(t_0)=x_0,\ w(t_1)=x_1\right\}
\qquad \text{for all } p>1.
$$

Come back to the interpolation of the points obtained through the Minimizing Movement Scheme, if $(X,d)$ is a geodesic space, then the piecewise affine interpolation that we used in the Euclidean space may be helpfully replaced via a piecewise geodesic interpolation. This means defining a curve
$$
x^{\tau}:[0,T]\to X
$$
such that
$$
x^{\tau}(k\tau)=x_k^{\tau}
$$
and such that $x^{\tau}$ restricted to any interval $[k\tau,(k+1)\tau]$ is a constant-speed geodesic with speed equal to $\frac{d(x_k^{\tau},x_{k+1}^{\tau})}{\tau}$. Then the same equicontinuity will hold.

The next question is how to characterize the limit curve obtained when $\tau\to 0$. In a general metric space
$$
x’(t)=-\nabla F(x(t))
$$
has no meaning!

EDE (Energy Dissipation Equality) viewpoint.

Here the first inequality becomes equality if and only if $x’(r)$ and $\nabla F(x(r))$ are vectors with opposite direction for a.e. $r$, and the second inequality becomes equality if and only if their norms are the same. Therefore
$$
F(x(s))-F(x(t))
=\int_s^t \left(\frac12|x’(r)|^2+\frac12|\nabla F(x(r))|^2\right)\ dr
\qquad \text{for all } s$$
if and only if
$$
x’(t)=-\nabla F(x(t))
\qquad \text{a.e. } t.
$$

EVI (Evolution Variational Inequality) viewpoint.

If $F$ is $\lambda$-convex, the inequality that characterize the gradient is
$$
F(y)\ge F(x)+\frac{\lambda}{2}|y-x|^2+p\cdot (y-x)
\qquad \text{for all } y\in \mathbb{R}^n.
$$

Therefore
$$
\frac{d}{dt}\frac12|x(t)-y|^2=(y-x(t))\cdot \bigl(-x’(t)\bigr)
\le F(y)-F(x(t))-\frac{\lambda}{2}|x(t)-y|^2
\qquad \text{for all } y.
$$
which will be equivalent to
$$
-x’(t)\in \partial F(x(t)).
$$

By $\mathrm{EVI}_{\lambda}$, take two curves $x(t)$ and $y(s)$, we have
$$
\frac{d}{dt}\frac12 d(x(t),y(s))^2
\le F(y(s))-F(x(t))-\frac{\lambda}{2}d(x(t),y(s))^2
\tag{12}
$$
and
$$
\frac{d}{ds}\frac12 d(x(t),y(s))^2
\le F(x(t))-F(y(s))-\frac{\lambda}{2}d(x(t),y(s))^2.
\tag{13}
$$

Define
$$
E(t)=\frac12 d(x(t),y(t))^2,
\qquad
G(t,s)=\frac12 d(x(t),y(s))^2.
$$
Then
$$
\frac{d}{dt}E(t)=\frac{d}{dt}G(t,t)=\partial_t G(t,t)+\partial_s G(t,t)
\le -\lambda d(x(t),y(t))^2=-2\lambda E(t).
$$

By Gronwall inequality, this provides uniqueness and stability.

Reference

Santambrogio, F. {Euclidean, metric, and Wasserstein} gradient flows: an overview. Bull. Math. Sci. 7, 87–154 (2017).

The cover image in this article was taken at Château de Chillon in Switzerland.

Gradient Flows in the Euclidean Space

2026-04-12T16:00:00.000Z

In this article, we will introduce some basic results of gradient flows in the Euclidean space.

Given a function $F:\mathbb{R}^n \to \mathbb{R}$, smooth enough and a point $x_0 \in \mathbb{R}^n$. A gradient flow is just defined as a curve $X(t)$, with starting point at $t=0$ given by $x_0$, which moves by choosing at each instant of time the direction which makes the function $F$ decrease as much as possible. More precisely, we consider the solution of the Cauchy problem
$$
\begin{cases}
x’(t) = -\nabla F(x(t)) & \text{for } t>0, \\\
x(0)=x_0
\end{cases}
\tag{1}
$$

This is a standard Cauchy problem which has a unique solution if $\nabla F$ is Lipschitz continuous.

If $F$ is convex, we can replace the gradient with subdifferential, looking for absolutely continuous curve $x:[0,T]\to \mathbb{R}^n$
$$
\begin{cases}
x’(t)\in -\partial F(x(t)) & \text{for a.e. } t>0, \\\
x(0)=x_0
\end{cases}
\tag{2}
$$
where
$$
\partial F(x):=\left\{ p\in \mathbb{R}^n \mid F(y)\ge F(x)+p\cdot (y-x)\ \text{for all } y\in \mathbb{R}^n \right\}.
$$

Properties of subdifferential

$F$ is differentiable at $x$ iff $\partial F(x)=\{\nabla F(x)\}$.
$\partial F(x)$ is a convex set.
$\partial F(x)$ is non-empty for $x\in \operatorname{dom} F$.
For every $x_1,x_2$ and $p_1\in \partial F(x_1)$, $p_2\in \partial F(x_2)$,
$$
(x_1-x_2)\cdot (p_1-p_2)\ge 0.
$$

We denote by $\partial^\circ F(x)$ its element of minimal norm.

Proposition 1. Suppose that $F$ is convex and let $x_1,x_2$ be two solutions of (2). Then we have
$$
|x_1(t)-x_2(t)|\le |x_1(0)-x_2(0)|
$$
for every $t\ge 0$. In particular, this gives uniqueness of solution of the Cauchy problem.

Proof. Let us consider
$$
g(t)=\frac12 |x_1(t)-x_2(t)|^2.
$$
Then we have
$$
g’(t)=(x_1(t)-x_2(t))\cdot (x_1’(t)-x_2’(t)).
$$
Since $x_1’(t)\in -\partial F(x_1(t))$, $x_2’(t)\in -\partial F(x_2(t))$, we have
$$
g’(t)\le 0.
$$
Therefore
$$
g(t)\le g(0).\quad \square
$$

Remark. We recall that $F$ is semi-convex means that there exists $\lambda\in \mathbb{R}$ such that
$$
x\mapsto F(x)-\frac{\lambda}{2}|x|^2
$$
is convex.

For $\lambda$-convex function $F$, we can define the subdifferential by
$$
\partial F(x)=\left\{ p\in \mathbb{R}^n \ \middle| \ F(y)\ge F(x)+p\cdot (y-x)+\frac{\lambda}{2}|y-x|^2 \ \text{for all } y\in \mathbb{R}^n \right\}.
$$

Again, we define $\partial^\circ F$ the element of minimal norm of $\partial F$.

If $F$ is $\lambda$-convex, then for $x_1,x_2,p_1,p_2$ with $p_i\in \partial F(x_i)$, $i=1,2$, one has
$$
(x_1-x_2)\cdot (p_1-p_2)\ge \lambda |x_1-x_2|^2.
$$

This implies
$$
g’(t)\le -2\lambda g(t).
$$
By Gronwall’s inequality
$$
g(t)\le g(0)e^{-2\lambda t}.
$$

If $\lambda>0$, then $F$ is strictly convex and admits a unique minimizer $\bar{x}$.

Take $x_2(t)\equiv \bar{x}$, which is also a solution since $0\in \partial F(\bar{x})$. Then we get
$$
|x_1(t)-\bar{x}|\le e^{-\lambda t}|x_1(0)-\bar{x}|.
$$

Proposition 2. Suppose that $F$ is $\lambda$-convex and let $x$ be a solution of (2). Then, for all times $t_0$ such that both $t\mapsto x(t)$ and $t\mapsto F(x(t))$ are differentiable at $t=t_0$, the subdifferential $\partial F(x(t_0))$ is contained in a hyperplane orthogonal to $x’(t_0)$. In particular, we have
$$
x’(t)=-\partial^\circ F(x(t)) \qquad \text{for a.e. } t.
$$

Proof. Let $t_0$ be as in the statement and $p\in \partial F(x(t_0))$. By the definition of subdifferential
$$
F(x(t))\ge F(x(t_0))+p\cdot (x(t)-x(t_0))+\frac{\lambda}{2}|x(t)-x(t_0)|^2
\qquad \text{for all } t,
$$
and it is equality when $t=t_0$. Hence
$$
t\mapsto F(x(t))-F(x(t_0))-p\cdot (x(t)-x(t_0))-\frac{\lambda}{2}|x(t)-x(t_0)|^2
$$
is minimal for $t=t_0$, and then
$$
\frac{d}{dt}F(x(t))\Big|_{t=t_0}=p\cdot x’(t_0).
$$
Hence
$$
\partial F(x(t_0))\subseteq \left\{ p\in \mathbb{R}^n \mid p\cdot x’(t_0)=\mathrm{const} \right\},
$$
which is a hyperplane orthogonal to $x’(t_0)$.

Since for a.e. $t_0$,
$$
-x’(t_0)\in \partial F(x(t_0)),
$$
we have
$$
\mathrm{const}=-x’(t_0)\cdot x’(t_0)=-|x’(t_0)|^2.
$$
The hyperplane is
$$
\left\{ p\in \mathbb{R}^n \mid p\cdot x’(t_0)=-|x’(t_0)|^2 \right\}.
$$
Hence the orthogonal projection of $0$ onto the hyperplane which contains $\partial F(x(t_0))$ is
$$
\frac{x’(t_0)}{|x’(t_0)|^2}\cdot \big(-|x’(t_0)|^2\big)=-x’(t_0).
$$
This provides
$$
x’(t_0)=-\partial^\circ F(x(t_0))
$$
for a.e. $t_0$. $\square$

Discretization in time

Fix a small step time parameter $\tau>0$, define

$$
x_{k+1}^\tau \in \operatorname*{argmin} _x \left\{ F(x)+\frac{|x-x_k^\tau|^2}{2\tau} \right\}.
\tag{3}
$$

and we say $(x_k^\tau) _k$ the Minimizing Movement Scheme.

If $F$ is smooth, then

$$
x _{k+1}^\tau \in \operatorname*{argmin} _x \left\{ F(x)+\frac{| x-x_k^\tau |^2}{2\tau} \right\}
\implies
\nabla F(x _{k+1}^\tau)+\frac{x _{k+1}^\tau-x _k^\tau}{\tau}=0.
$$

Hence,
$$
\frac{x_{k+1}^\tau-x_k^\tau}{\tau}=-\nabla F(x_{k+1}^\tau),
$$
which is exactly the discrete-time implicit Euler scheme.

Given ODE
$$
\begin{cases}
x’(t)=V(x(t)),\\\
x(0)=x_0,
\end{cases}
$$

Explicit scheme
$$
x_{k+1}^\tau=x_k^\tau+\tau V(x_k^\tau), \qquad x_0=x_0.
$$
Implicit scheme
$$
x_{k+1}^\tau=x_k^\tau+\tau V(x_{k+1}^\tau), \qquad x_0=x_0.
$$

Now, we want to show that for $\tau\to 0$, the sequence $(x_k^\tau)$ we found, suitably interpolated, converges to the solution of (2).

First, define
$$
v_{k+1}^\tau:=\frac{x_{k+1}^\tau-x_k^\tau}{\tau}.
$$
Then we set
$$
x^\tau(t):=x_{k+1}^\tau, \qquad \text{for } t\in (k\tau,(k+1)\tau],
$$
$$
\tilde x^\tau(t):=x_k^\tau+(t-k\tau)v_{k+1}^\tau, \qquad \text{for } t\in (k\tau,(k+1)\tau],
$$
$$
v^\tau(t):=v_{k+1}^\tau, \qquad \text{for } t\in (k\tau,(k+1)\tau].
$$

It is easy to see that $\tilde x^\tau$ is continuous and pairwise affine, satisfying
$$
(\tilde x^\tau)’=v^\tau.
$$

What’s more, $x^\tau$ is not continuous, but by definition
$$
v^\tau(t)\in -\partial F(x^\tau(t)).
$$

By the definition of $x_{k+1}^\tau$, we have
$$
F(x_{k+1}^\tau)+\frac{|x_{k+1}^\tau-x_k^\tau|^2}{2\tau}\le F(x_k^\tau).
\tag{4}
$$

If $F(x_0)<+\infty$ and $\inf F>-\infty$, summing over $k$, we get
$$
\sum_{k=0}^{\ell}\frac{|x_{k+1}^\tau-x_k^\tau|^2}{2\tau}
\le
\big(F(x_0^\tau)-F(x_{\ell+1}^\tau)\big)\le C.
\tag{5}
$$

Note that
$$
\frac{|x_{k+1}^\tau-x_k^\tau|^2}{2\tau}
=
\frac{\tau}{2}\left|\frac{x_{k+1}^\tau-x_k^\tau}{\tau}\right|^2
=
\frac{\tau}{2}|v_{k+1}^\tau|^2
=
\frac12\int_{k\tau}^{(k+1)\tau}|(\tilde x^\tau)’(t)|^2\ dt.
$$

This means that we have
$$
\int_0^T \frac12 |(\tilde x^\tau)’(t)|^2\ dt\le C.
\tag{6}
$$

Hence $\tilde x^\tau$ is bounded in $H^1$ and $v^\tau$ in $L^2$. By Hölder inequality, for $t\ge s$,
$$
|\tilde x^\tau(t)-\tilde x^\tau(s)|
=
\left|\int_s^t (\tilde x^\tau)’(u)\ du\right|
\le
\int_s^t |(\tilde x^\tau)’(u)|\ du
\le
\left(\int_s^t |(\tilde x^\tau)’(u)|^2\ du\right)^{1/2}|t-s|^{1/2}
\le C|t-s|^{1/2}.
\tag{7}
$$

Therefore $\tilde x^\tau$ is equicontinuous for all $\tau>0$. Moreover,
$$
|\tilde x^\tau(t)-x_0|
=
|\tilde x^\tau(t)-\tilde x^\tau(0)|
\le Ct^{1/2}\le CT^{1/2}.
$$
Hence $\tilde x^\tau$ is also equibounded. Furthermore, since $x^\tau(t)=\tilde x^\tau(s)$ for a certain $s=k\tau$ with $|s-t|\le \tau$, we have
$$
|x^\tau(t)-\tilde x^\tau(t)|\le C\tau^{1/2}.
\tag{8}
$$

Proposition 3. Let $\tilde x^\tau$, $x^\tau$ and $v^\tau$ be constructed as above using the minimizing movement scheme. Suppose $F(x_0)<+\infty$ and $\inf F>-\infty$, and lower-semicontinuous. Then, up to a subsequence $\tau_j\to 0$ (still denoted by $\tau$), both $\tilde x^\tau$ and $x^\tau$ converge uniformly to a same curve $x\in H^1$, and $v^\tau$ weakly converges in $L^2$ to a vector function $v$ such that $x’=v$ and

if $F$ is $\lambda$-convex, we have
$$
v(t)\in -\partial F(x(t)) \qquad \text{for a.e. } t,
$$
i.e. $x$ is a solution of (2);
if $F$ is $C^1$, we have
$$
v(t)=-\nabla F(x(t)) \qquad \text{for all } t,
$$
i.e. $x$ is a solution of (1).

Proof. By the above analysis, $\tilde x^\tau$ is continuous, equicontinuous and equibounded. By applying Arzelà–Ascoli theorem, we get a uniformly converging subsequence (still denoted by $\tau$)
$$
\tilde x^{\tau}\to x
\qquad \text{as } \tau\to 0
$$

By (8), $x^\tau$ also uniformly converges to the same limit $x$. Moreover by (6), we know that $v^\tau$ is equibounded in $L^2$ which is a reflexive Banach space. Thus, up to an extra subsequence extraction (still denoted by $\tau$),
$$
v^\tau \rightharpoonup v \qquad \text{in } L^2 \quad \text{as } \tau\to 0.
$$

Therefore, as a consequence of distributional convergence,
$$
x’=v.
$$

To prove (1), fix a point $y\in \mathbb{R}^n$. By $v^\tau(t)\in -\partial F(x^\tau(t))$, we have
$$
F(y)\ge F(x^\tau(t)) - v^\tau(t)\cdot (y-x^\tau(t)) + \frac{\lambda}{2}|y-x^\tau(t)|^2.
$$

For all positive measurable function $a:[0,T]\to \mathbb{R}_+$,
$$
\int_0^T a(t)\left( F(y)-F(x^\tau(t))+v^\tau(t)\cdot (y-x^\tau(t))-\frac{\lambda}{2}|y-x^\tau(t)|^2 \right)\ dt \ge 0.
$$

Since $x^\tau\to x$ uniformly and hence $L^2$ strong, $v^\tau\rightharpoonup v$, and $F$ is lower-semicontinuous, by letting $\tau\to 0$, we get
$$
\int_0^T a(t)\left( F(y)-F(x(t))+v(t)\cdot (y-x(t))-\frac{\lambda}{2}|y-x(t)|^2 \right)\ dt \ge 0.
$$

From the arbitrariness of $a$, we get
$$
F(y)\ge F(x(t)) - v(t)\cdot (y-x(t)) + \frac{\lambda}{2}|y-x(t)|^2
\qquad \text{for a.e. } t.
$$

i.e. there exists a negligible set $N_y$ such that
$$
G_y(t):=F(y)-F(x(t))+v(t)\cdot (y-x(t))-\frac{\lambda}{2}|y-x(t)|^2 \ge 0
\qquad \text{for } t\notin N_y.
$$

Let $D$ be a countable dense set in the interior of $\operatorname{dom}F$, where $F$ is continuous, and
$$
N=\bigcup_{y\in D}N_y
$$
which is also negligible. For all $t\notin N$, and all $y\in \operatorname{dom}F$, there exists $y_m\in D$ such that $y_m\to y$ as $m\to\infty$ and then
$$
G_y(t)=\lim_{m\to\infty} G_{y_m}(t)\ge 0
\qquad \text{for all } t\notin N.
$$
Hence,
$$
v(t)\in -\partial F(x(t)).
$$

To prove (2), we have
$$
-\nabla F(x^\tau(t))=v^\tau(t)=(\tilde x^\tau)’(t).
$$
Since $x^\tau\to x$ uniformly and $F\in C^1$, we get, $v^\tau\rightharpoonup v$, we get
$$
v(t)=-\nabla F(x(t)) \qquad \text{a.e. } t.
$$
Since $t\mapsto -\nabla F(x(t))$ is uniformly continuous, the above equality holds for all $t$. $\square$

Remark. If the sequence $x_k^\tau$ is defined by

$$
x_{k+1}^\tau \in \operatorname*{argmin} _x \left\{ 2F\left(\frac{x+x _k^\tau}{2}\right)+\frac{|x-x _k^\tau |^2}{2\tau} \right\},
$$

then we have

$$
\frac{x_{k+1}^\tau-x_k^\tau}{\tau}
=
-\nabla F\left(\frac{x_{k+1}^\tau+x_k^\tau}{2}\right),
$$

and the convergence is of order $\tau^2$.

Reference

Santambrogio, F. {Euclidean, metric, and Wasserstein} gradient flows: an overview. Bull. Math. Sci. 7, 87–154 (2017).

The cover image in this article was taken in Hobbiton, New Zealand.

Monotone Transport Maps and Plans in 1D

2026-04-06T16:00:00.000Z

In this article, we will introduce monotone transport maps and plans in 1D.

Definition 2.1 Given a nondecreasing and right continuous function $F:\mathbb{R}\to[0,1]$, its pseudo-inverse is the function $F^{[-1]}:[0,1]\to\overline{\mathbb{R}}$ given by
$$
F^{[-1]}(x)=\inf\{t\in\mathbb{R}\mid F(t)\ge x\}.
$$
Thanks to right continuity of $F$, the infimum is minimum as soon as the set is nonempty.

Lemma 2.1 We have

$F^{[-1]}(x)\le a \iff F(a)\ge x$,
$F^{[-1]}(x)>a \iff F(a)

Proof : We only prove (1). If $F^{[-1]}(x)\le a$, then the set
$$
\{t\in\mathbb{R}\mid F(t)\ge x\}\neq\varnothing
$$
and then the infimum is minimum. Suppose $F(a)$$
b\notin\{t\in\mathbb{R}\mid F(t)\ge x\}\Rightarrow (-\infty,a]\subset\{t\in\mathbb{R}\mid F(t)$$
Hence by definition, $F^{[-1]}(x)>a$, which gives a contradiction. Conversely, if $F(a)\ge x$, it is obvious that $F^{[-1]}(x)\le a$. $\square$

Proposition 2.1 If $\mu\in\mathcal{P}(\mathbb{R})$ and $F_\mu^{[-1]}$ is the pseudo-inverse of its cumulative distribution function $F_\mu$, then

$$
(F_\mu^{[-1]}) _{\sharp}\big(\mathcal{L}^1\llcorner[0,1]\big) =\mu.
$$

Moreover, given $\mu,\nu\in\mathcal{P}(\mathbb{R})$, if we set
$$
\eta:=\big(F_\mu^{[-1]},F_\nu^{[-1]}\big)_{\sharp}\big(\mathcal{L}^1\llcorner[0,1]\big),
$$
then $\eta\in\Pi(\mu,\nu)$ and
$$
\eta\big((-\infty,a]\times(-\infty,b]\big)=F_\mu(a)\wedge F_\nu(b).
$$

Proof :
$$
\mathcal{L}^1\big(\{x\in[0,1]\ ;\ F_\mu^{[-1]}(x)\le a\}\big)
=\mathcal{L}^1\big(\{x\in[0,1]\ ;\ x\le F_\mu(a)\}\big)=F_\mu(a).
$$
Hence
$$
\{(-\infty,a)\ ;\ a\in\mathbb{R}\}\subset \mathcal{G}:=\{A\in\mathcal{B}(\mathbb{R})\ ;\ (\mathcal{L}^1\llcorner[0,1])((F_\mu^{[-1]})^{-1}(A))=\mu(A)\}.
$$
By monotone class theorem, we get $\mathcal{G}=\mathcal{B}(\mathbb{R})$, which gives the desired result. Moreover $\eta\in\Pi(\mu,\nu)$ is just a consequence of the above statement.

Further,
$$
\eta((-\infty,a]\times(-\infty,b])
=\mathcal{L}^1\big(\{x\in[0,1]\ ;\ F_\mu^{[-1]}(x)\le a,\ F_\nu^{[-1]}(x)\le b\}\big)
$$
and by Lemma 2.1 this equals
$$
\mathcal{L}^1\big(\{x\in[0,1]\ ;\ x\le F_\mu(a),\ x\le F_\nu(b)\}\big)
=F_\mu(a)\wedge F_\nu(b),
$$
which is the desired equality. $\square$

Definition 2.2 We will call the transport plan

$$
\eta:=\big(F_\mu^{[-1]},F_\nu^{[-1]}\big) _{\sharp}\big(\mathcal{L}^1\llcorner[0,1]\big)
$$

the co-monotone transport plan between $\mu$ and $\nu$ and denote it by
$$
\gamma_{\mathrm{mon}}:=\eta.
$$

Lemma 2.2 If $\mu\in\mathcal{P}(\mathbb{R})$ is atomless, then $(F_\mu)_{\sharp}\mu=\mathcal{L}^1\llcorner[0,1]$. As a consequence, for every $\ell\in[0,1]$, the set $\{x:F_\mu(x)=\ell\}$ is $\mu$-negligible.

Proof : First, $F_\mu$ is continuous since $\mu$ is atomless. Hence for $a\in(0,1)$, the set
$$
\{x\ ;\ F_\mu(x)\le a\}=(-\infty,x_a],
$$
where $F_\mu(x_a)=a$. Hence
$$
\mu\big(F_\mu^{-1}((-\infty,a])\big)=\mu((-\infty,x_a])=F_\mu(x_a)=a.
$$
Therefore

$$
(F_\mu) _{\sharp}\mu=\mathcal{L}^1\llcorner[0,1].
$$

As a consequence, for $\ell\in[0,1]$, the set
$$
\{x\ ;\ F_\mu(x)=\ell\}
$$
is $\mu$-negligible. Otherwise, if
$$
\mu(\{x\ ;\ F_\mu(x)=\ell\})>0,
$$
then
$$
0=\mathcal{L}^1(\{\ell\})=(F_\mu)_{\sharp}\mu(\{\ell\})>0,
$$
which is a contradiction. $\square$

Theorem 2.1 Given $\mu,\nu\in\mathcal{P}(\mathbb{R})$, suppose that $\mu$ is atomless. Then there exists a unique nondecreasing map $T_{\mathrm{mon}}:\mathbb{R}\to\mathbb{R}$ such that
$$
(T_{\mathrm{mon}})_{\sharp}\mu=\nu.
$$

Proof : Define
$$
T_{\mathrm{mon}}(x):=F_\nu^{[-1]}(F_\mu(x)).
$$
By Lemma 2.2, it is easy to see that $T_{\mathrm{mon}}$ is well-defined. The fact that $T_{\mathrm{mon}}$ is monotone nondecreasing is obvious. Now, we only need to show $(T_{\mathrm{mon}})_{\sharp}\mu=\nu$.

In fact,

$$
(T_{\mathrm{mon}}) _{\sharp}\mu
=(F_\nu^{[-1]}\circ F_\mu) _{\sharp}\mu
=(F_\nu^{[-1]}) _{\sharp}\big((F_\mu) _{\sharp}\mu\big)
=(F_\nu^{[-1]}) _{\sharp}\big(\mathcal{L}^1\llcorner[0,1]\big)=\nu.
$$

Finally, we need to show the uniqueness. From monotonicity,
$$
T^{-1}((-\infty,T(x)])\supset (-\infty,x].
$$
We deduce
$$
F_\mu(x)=\mu((-\infty,x])\le \mu\big(T^{-1}((-\infty,T(x)])\big)
=\nu((-\infty,T(x)])=F_\nu(T(x)),
$$
which means
$$
T(x)\ge F_\nu^{[-1]}(F_\mu(x)).
$$
Suppose $T(x)>F_\nu^{[-1]}(F_\mu(x))$. This means there exists $\varepsilon_0>0$ such that
$$
F_\nu(T(x)-\varepsilon)\ge F_\mu(x)\qquad\text{for all }\varepsilon\in(0,\varepsilon_0).
$$
On the other hand, from
$$
T^{-1}((-\infty,T(x)-\varepsilon])\subset (-\infty,x],
$$
we get
$$
F_\nu(T(x)-\varepsilon)=\nu((-\infty,T(x)-\varepsilon])
=\mu\big(T^{-1}((-\infty,T(x)-\varepsilon])\big)\le \mu((-\infty,x])=F_\mu(x).
$$
Therefore
$$
F_\nu(T(x)-\varepsilon)=F_\mu(x)\qquad\text{for every }\varepsilon\in(0,\varepsilon_0).
$$
Thus $F_\nu(T(x)-\varepsilon)$ is the value that $F_\nu$ takes on an interval where it is constant. These intervals are countable, hence the values of $F_\nu$ on those intervals are also countable. We can call $\ell_i$ these values. As a consequence, the points $x$ where
$$
T(x)>F_\nu^{[-1]}(F_\mu(x))
$$
are contained in
$$
\bigcup_i \{x\ ;\ F_\mu(x)=\ell_i\},
$$
which is $\mu$-negligible. Thus, $T(x)=F_\nu^{[-1]}(F_\mu(x))$ $\mu$-a.e. $\square$

Lemma 2.3 Let $\gamma\in\Pi(\mu,\nu)$ be a transport plan between two measures $\mu,\nu\in\mathcal{P}(\mathbb{R})$. Suppose that it satisfies the property:
$$
(x,y),(x’,y’)\in\operatorname{Supp}(\gamma),\ x$$
Then we have $\gamma=\gamma_{\mathrm{mon}}$. In particular, there is a unique $\gamma$ satisfying (1). Moreover, if $\mu$ is atomless, then $\gamma=T_{\mathrm{mon}}$.

Proof : First, we need to prove
$$
\gamma((-\infty,a]\times(-\infty,b])=F_\mu(a)\wedge F_\nu(b).
$$
Consider
$$
A:=(-\infty,a]\times(b,+\infty),\qquad B:=(a,+\infty)\times(-\infty,b].
$$
Claim: It is not possible to have both $\gamma(A)>0$ and $\gamma(B)>0$.

Indeed, if not, $\gamma(A)>0$ and $\gamma(B)>0$, then for some $(x,y)\in A$ and $(x’,y’)\in B$ we have
$$
xy’,
$$
with $(x,y),(x’,y’)\in\operatorname{Supp}(\gamma)$, which contradicts (1). Therefore,
$$
\gamma((-\infty,a]\times(-\infty,b])
=\gamma\big(((-\infty,a]\times\mathbb{R})\setminus A\big)
\wedge
\gamma\big((\mathbb{R}\times(-\infty,b])\setminus B\big)
$$
and hence
$$
\gamma((-\infty,a]\times(-\infty,b])=F_\mu(a)\wedge F_\nu(b).
$$
Hence $\gamma=\gamma_{\mathrm{mon}}$.

Now, assume $\gamma$ is atomless, then we can define $I_x$ as the minimal interval such that
$$
\operatorname{Supp}(\gamma)\cap(\{x\}\times\mathbb{R})\subset \{x\}\times I_x.
$$
The assumption (1) shows that the interior of all these intervals are disjoint. There can be at most countable points such that $I_x$ is not singleton. Since $\mu$ is atomless, we can define a map $T$ such that $\gamma$ is concentrated on graph of $T$, $\mu$-a.e. By Theorem 2.1, $T=T_{\mathrm{mon}}$. $\square$

Reference

Santambrogio, Filippo. Optimal transport for applied mathematicians. Birkäuser, NY 55.58-63 (2015): 94.

The cover image of this article was taken in Sydney, Australia.

Definition of Reproducing Kernel Hilbert Space

2026-04-05T16:00:00.000Z

In this article we will introduce the definition of Reproducing Kernel Hilbert Space (RKHS).

Let $E$ be a non empty set. Let $\mathcal H$ be a vector space of functions defined on $E$ and taking their values in the set $\mathbb C$ of complex numbers. $\mathcal H$ is endowed with the structure of Hilbert space defined by an inner product $\langle \cdot,\cdot\rangle_{\mathcal H}$,

$$
\begin{aligned}
\mathcal H\times \mathcal H &\longrightarrow \mathbb C\\\
(\varphi,\psi)&\longmapsto \langle \varphi,\psi\rangle_{\mathcal H}.
\end{aligned}
$$

Let $\Vert\cdot\Vert_{\mathcal H}$ denote the associated norm,

$$
\forall \varphi\in\mathcal H,\qquad \Vert\varphi\Vert_{\mathcal H}=\langle \varphi,\varphi\rangle_{\mathcal H}^{1/2}.
$$

For any $t\in E$, we will denote by $e_t$ the evaluation functional at the point $t$, i.e. the mapping

$$
\begin{aligned}
e_t:\mathcal H&\longrightarrow \mathbb C\\\
g&\longmapsto e_t(g):=g(t).
\end{aligned}
$$

Definition 1. (Reproducing kernel) A function

$$
\begin{aligned}
K:E\times E&\longrightarrow \mathbb C\\\
(s,t)&\longmapsto K(s,t)
\end{aligned}
$$

is a reproducing kernel of the Hilbert space $\mathcal H$ if and only if

(a) $\forall t\in E,\quad K(\cdot,t)\in\mathcal H$,

(b) $\forall t\in E,\ \forall \varphi\in\mathcal H$,

$$
\langle \varphi,K(\cdot,t)\rangle_{\mathcal H}=\varphi(t).
$$

Note. The property (b) is called the “reproducing property”: the value of $\varphi$ at point $t$ is reproduced by the inner product of $\varphi$ with $K(\cdot,t)$.

Remark: It is clear that $\forall (s,t)\in E\times E$,

$$
K(s,t)=\langle K(\cdot,t),K(\cdot,s)\rangle.
$$

A Hilbert space of complex-valued functions which possesses a reproducing kernel is called a “reproducing kernel Hilbert space (RKHS)”, or a “proper Hilbert space”.

Example 1: Let $e_1,\ldots,e_n$ be an orthonormal basis in $\mathcal H$ and define

$$
K(x,y)=\sum_{i=1}^n e_i(x)\overline{e_i(y)}.
$$

Then for any $y\in E$,

$$
K(\cdot,y)=\sum_{i=1}^n \overline{e_i(y)},e_i(\cdot)\in\mathcal H.
$$

and for any function

$$
\varphi(\cdot)=\sum_{i=1}^n \lambda_i e_i(\cdot)\in\mathcal H
$$

we have $\forall y\in E$,

$$
\langle \varphi,K(\cdot,y)\rangle_{\mathcal H}
=\left\langle \sum_{i=1}^n \lambda_i e_i(\cdot),\sum_{i=1}^n \overline{e_i(y)},e_i(\cdot)\right\rangle_{\mathcal H}
=\sum_{i=1}^n \lambda_i e_i(y)
=\varphi(y).
$$

Hence any finite dimensional Hilbert space of functions has a reproducing kernel.

Example 2 Let $E=\mathbb N$ be the set of positive integers, and let $\mathcal H=\ell^2(E)$ be the set of complex sequences $(x_i)$ such that

$$
\sum_{i\in\mathbb N} |x_i|^2 <\infty.
$$

$\mathcal H$ is a Hilbert space with inner product, for $x=(x_i)$ and $y=(y_i)$.
$$
\langle x,y\rangle_{\ell^2(E)}=\sum_{i\in\mathbb N} x_i\overline{y_i}
$$

Let $K(i,j)=\delta_{ij}$ (the Kronecker symbol). Then,

$$
\forall j\in\mathbb N,\qquad K(\cdot,j)=(0,\ldots,0,1,0,\ldots)\in\mathcal H\quad \text{($1$ at the $j$-th place),}
$$

and

$$
\forall j\in\mathbb N\quad \forall x=(x_i),
\quad
\langle x,K(\cdot,j)\rangle _{\mathcal H}
=\sum _{i\in\mathbb N} x_i\delta _{ij}
=x_j.
$$

Hence $K$ is the reproducing kernel of $\mathcal H$.

Theorem 1. A Hilbert space of complex valued functions on $E$ has a reproducing kernel if and only if all the evaluation functions $e_t$, $t\in E$ are continuous on $\mathcal H$.

Proof : If $\mathcal H$ has a reproducing kernel $K$, then for any $t\in E$, we have for all $\varphi\in\mathcal H$

$$
e_t(\varphi)=\langle \varphi,K(\cdot,t)\rangle_{\mathcal H}.
$$

Thus, $e_t$ is linear and

$$
|e_t(\varphi)|
\le \Vert\varphi\Vert_{\mathcal H}\Vert K(\cdot,t)\Vert_{\mathcal H}
=\Vert\varphi\Vert_{\mathcal H},\langle K(\cdot,t),K(\cdot,t)\rangle_{\mathcal H}^{1/2}
=\Vert\varphi\Vert_{\mathcal H},[K(t,t)]^{1/2}.
$$

Hence $e_t$ is continuous.

Conversely, if $e_t$, $t\in E$ are continuous, by Riesz’s representation theorem, there exists a function $N_t(\cdot)\in\mathcal H$ such that for all $\varphi\in\mathcal H$

$$
\varphi(t)=e_t(\varphi)=\langle \varphi,N_t(\cdot)\rangle_{\mathcal H}
\qquad \text{for all } t\in E.
$$

Hence $K(s,t)=N_t(s)$ is the reproducing kernel of $\mathcal H$. $\square$

Remark: From the proof above, we get the above $e_t$ has norm $\Vert e_t\Vert_{\mathcal H’}=[K(t,t)]^{1/2}$.

Corollary 1 In a RKHS, a sequence converging in the norm sense converges pointwise to the same limit.

Proof : If $\varphi_n \xrightarrow{\Vert\cdot\Vert_{\mathcal H}} \varphi$, then for each $t\in E$,

$$
|\varphi_n(t)-\varphi(t)|
=|e_t(\varphi_n)-e_t(\varphi)|
\le \Vert e_t\Vert,\Vert \varphi_n-\varphi\Vert_{\mathcal H}
\to 0.
$$

Hence $\varphi_n\to\varphi$ pointwise. $\square$

Question: When is a complex-valued function $K$ defined on $E\times E$ a reproducing kernel?

Answer: If and only if $K$ is a positive type function.

Definition 2. (Positive type function). A function $K:E\times E\to\mathbb C$ is called a positive type function (or a positive definite function) if

$$
\forall n\ge 1,\qquad \forall (a_1,\ldots,a_n)\in\mathbb C^n,\qquad \forall (x_1,\ldots,x_n)\in E^n,
$$

we have

$$
\sum_{i=1}^n\sum_{j=1}^n a_i\overline{a_j},K(x_i,x_j)\in \mathbb R^+,
$$

where $\mathbb R^+$ denotes the set of nonnegative real numbers.

It is worth noting that $K$ is a positive type function if and only if the matrix

$$
\bigl(K(x_i,x_j)\bigr)_{1\le i,j\le n}
$$

is positive definite for any choice of $n\in\mathbb N$ and $(x_1,\ldots,x_n)\in E^n$.

Examples.

(1) Any constant non negative function on $E\times E$ is of positive type.

Proof : If $K:E\times E\to\mathbb C$

$$
K(s,t)=c\ge 0
$$

then for all $n\in\mathbb N$, $(a_1,\ldots,a_n)\in\mathbb C^n$, $(x_1,\ldots,x_n)\in E^n$,

$$
\sum_{i=1}^n\sum_{j=1}^n a_i\overline{a_j},K(x_i,x_j)
=c\sum_{i=1}^n\sum_{j=1}^n a_i\overline{a_j}
=c\left|\sum_{i=1}^n a_i\right|^2\in \mathbb R^+.
$$

(2). The delta function.

$$
\begin{aligned}
\delta:E\times E&\longrightarrow \mathbb C\\\
(x,y)&\longmapsto \delta_{xy}
=
\begin{cases}
1,&\text{if }x=y,\
0,&\text{if }x\ne y.
\end{cases}
\end{aligned}
$$

is of positive type.

Proof : Let $n\in\mathbb N$, $(a_1,\ldots,a_n)\in\mathbb C^n$, $(x_1,\ldots,x_n)\in E^n$.

Let $\{\alpha_1,\ldots,\alpha_p\}$ be the set of different values among $x_1,\ldots,x_n$.

We can write

$$
\sum_{i=1}^n\sum_{j=1}^n a_i\overline{a_j},\delta_{x_i x_j}
=
\sum_{i=1}^n\sum_{x_j=x_i} a_i\overline{a_j}
=
\sum_{k=1}^p \sum_{x_i=x_j=\alpha_k} a_i\overline{a_j}
=
\sum_{k=1}^p \left|\sum_{x_j=\alpha_k} a_j\right|^2
\in \mathbb R^+.
$$

(3) The product $\alpha K$ of a positive type function $K$ with a non negative constant $\alpha$ is a positive type function.

Question: How to prove that a given function is of positive type?

Lemma 1. Let $\mathcal H$ be some Hilbert space with inner product $\langle \cdot,\cdot\rangle_{\mathcal H}$ and let $\varphi:E\to\mathcal H$. Then the function

$$
\begin{aligned}
K:E\times E&\longrightarrow \mathbb C\\\
(x,y)&\longmapsto K(x,y)=\langle \varphi(x),\varphi(y)\rangle_{\mathcal H}
\end{aligned}
$$

is of positive type.

Proof : The conclusion follows easily from the following equalities

$$
\sum_{i=1}^n\sum_{j=1}^n a_i\overline{a_j},K(x_i,x_j)
=
\sum_{i=1}^n\sum_{j=1}^n \langle a_i\varphi(x_i),a_j\varphi(x_j)\rangle_{\mathcal H}
=
\left\Vert \sum_{i=1}^n a_i\varphi(x_i)\right\Vert_{\mathcal H}^2
\in \mathbb R^+.\quad \square
$$

Lemma 1 tells us that writing

$$
K(x,y)=\langle \varphi(x),\varphi(y)\rangle_{\mathcal H}
$$

in some space $\mathcal H$ is sufficient to prove positive definiteness of $K$. $\square$

Reference

Berlinet, A., & Thomas-Agnan, C. (2011). Reproducing kernel Hilbert spaces in probability and statistics. Springer Science & Business Media.

The cover image of this article was taken at Lake Tekapo, New Zealand.

Sufficient Conditions for Optimality and Stability

2026-04-04T16:00:00.000Z

In this article, we will show some sufficient conditions for optimality and stability.

Theorem 1.12 Let $\Omega \subset \mathbb{R}^d$ be compact and $c$ be $C^1$ cost function satisfying the twist condition on $\Omega \times \Omega$. Suppose that $\mu \in \mathcal{P}(\Omega)$ and $\varphi \in c\text{-cone}(\Omega)$ are given, that $\varphi$ is differentiable $\mu$-a.e. and that $\mu(\partial \Omega)=0$. Suppose that the map $T$ satisfies
$$
\nabla_x c(x,T(x))=\nabla \varphi(x).
$$
Then $T$ is optimal for the transport cost $c$ between the measures $\mu$ and
$$
\nu:=T_{\sharp}\mu.
$$

Proof : Since $\varphi \in c\text{-cone}(\Omega)$, $\varphi^{cc}=\varphi$, take $\psi=\varphi^c$ we have $\varphi=\psi^c$.

Fix $x_0\in \Omega$ such that $\nabla \varphi(x_0)$ exists and $x_0\notin \partial \Omega$. By compactness and continuity $\exists\ y_0\in \Omega$ such that
$$
\varphi(x_0)=\inf_y [c(x_0,y)-\psi(y)]=c(x_0,y_0)-\psi(y_0).
$$
This gives,
$$
\varphi(x)\leq c(x,y_0)-\psi(y_0)
$$
and
$$
\varphi(x_0)=c(x_0,y_0)-\psi(y_0).
$$
and hence
$$
x\mapsto c(x,y_0)-\varphi(x)
$$
is minimal at $x=x_0$. As a consequence, we get
$$
\nabla \varphi(x_0)=\nabla_x c(x_0,y_0).
$$
By assumption, $y\mapsto \nabla_x c(x,y)$ is injective, and $\nabla_x c(x_0,T(x_0))=\nabla \varphi(x_0)$. We have $y_0=T(x_0)$. This proves
$$
\varphi(x_0)+\psi(T(x_0))=c(x_0,T(x_0)),
$$
and this same equality is true for $\mu$-a.e. $x_0$.

Integrate with respect to $\mu$, we have
$$
\int_{\Omega} c(x,T(x))\ d\mu(x)
=
\int_{\Omega} \varphi(x)+\psi(T(x))\ d\mu(x)
=
\int_{\Omega} \varphi\ d\mu+\int_{\Omega} \psi\ d\nu.
$$
Hence
$$
\max(DP)\geq \int_{\Omega}\varphi\ d\mu+\int_{\Omega}\psi\ d\nu
=
\int_{\Omega} c(x,T(x))\ d\mu(x)
\geq \min(KP)\geq \max(DP).
$$
We have $T$ is optimal. $\square$

Theorem 1.13 Suppose $\mu\in \mathcal{P}(\mathbb{R}^d)$ is such that
$$
\int_{\mathbb{R}^d} |x|^2\ d\mu(x)<+\infty,
$$
and that $u:\mathbb{R}^d\to \mathbb{R}\cup\{+\infty\}$ is convex and differentiable $\mu$-a.e.

Set
$$
T=\nabla u
$$
and suppose
$$
\int_{\mathbb{R}^d} |T(x)|^2\ d\mu(x)<+\infty.
$$
Then $T$ is optimal for the transport cost
$$
c(x,y):=\frac{1}{2}|x-y|^2
$$
between the measures $\mu$ and
$$
\nu:=T_{\sharp}\mu.
$$

Proof : Note that for a convex function $u$
$$
u(x)+u^{\star}(y)\geq x\cdot y
$$
for all $x,y\in \mathbb{R}^d$ and
$$
u(x)+u^{\star}(y)=x\cdot y
$$
if $y=\nabla u(x)$.

Now consider $\gamma\in \Pi(\mu,\nu)$
$$
\int_{\mathbb{R}^d\times \mathbb{R}^d} x\cdot y\ d\gamma(x,y)
\leq
\int_{\mathbb{R}^d\times \mathbb{R}^d} (u(x)+u^{\star}(y))\ d\gamma(x,y)
=
\int_{\mathbb{R}^d} u(x)\ d\mu(x)+\int_{\mathbb{R}^d} u^{\star}(T(x))\ d\mu(x)=
\int_{\mathbb{R}^d} x\cdot T(x)\ d\mu(x)
=
\int_{\mathbb{R}^d\times \mathbb{R}^d} x\cdot y\ d\gamma_T.
$$

Therefore,
$$
\int_{\mathbb{R}^d\times \mathbb{R}^d} \frac{1}{2}|x-y|^2\ d\gamma(x,y)
=
\int_{\mathbb{R}^d\times \mathbb{R}^d} \frac{1}{2}(|x|^2+|y|^2)\ d\gamma(x,y)
-
\int_{\mathbb{R}^d\times \mathbb{R}^d} x\cdot y\ d\gamma(x,y)=
\int_{\mathbb{R}^d\times \mathbb{R}^d} \frac{1}{2}(|x|^2+|y|^2)\ d\gamma_T
-
\int_{\mathbb{R}^d\times \mathbb{R}^d} x\cdot y\ d\gamma_T=
\int_{\mathbb{R}^d\times \mathbb{R}^d} \frac{1}{2}|x-y|^2\ d\gamma_T,
$$

which shows $T$ is optimal. $\square$

Theorem 1.14 Suppose that $\gamma\in \mathcal{P}(X\times Y)$ is given, that $X$ and $Y$ are Polish spaces, that $c:X\times Y\to \mathbb{R}$ is uniformly continuous and bounded, and that $\operatorname{supp}(\gamma)$ is $c$-CM.

Then $\gamma$ is an optimal transport plan between its marginals
$$
\mu=(\pi_x) _{\sharp}\gamma
\quad \text{and} \quad
\nu=(\pi_y) _{\sharp}\gamma
$$
for the cost $c$.

Proof : By Theorem 1.6, there exists a $c$-concave function $\varphi$ such that
$$
\operatorname{supp}(\gamma)\subset \{(x,y):\varphi(x)+\varphi^c(y)=c(x,y)\}
$$
and both $\varphi$ and $\varphi^c$ are continuous and bounded.

By Theorem 1.8, we know the duality result, hence
$$
\min(KP)\leq \int_{X\times Y} c(x,y)\ d\gamma
=
\int_{X\times Y} \varphi(x)+\varphi^c(y)\ d\gamma
=
\int_X \varphi\ d\mu+\int_Y \varphi^c\ d\nu\leq \max(DP)\leq \min(KP),
$$

which shows that $\gamma$ is optimal. $\square$

Definition 1.9 (Hausdorff distance) For a compact metric space $X$, we define the Hausdorff distance on pair of compact subsets of $X$ by
$$
d_H(A,B):=
\max\left\{
\max\{d(x,A):x\in B\},
\max\{d(x,B):x\in A\}
\right\}.
$$

Remark : We have the following equivalent definition.

$d_H(A,B)=\max\{|d(x,A)-d(x,B)|:x\in X\}$.
$d_H(A,B)=\inf\{\varepsilon>0:A\subset B_{\varepsilon},\ A_{\varepsilon}\supset B\}$, where $A_{\varepsilon},B_{\varepsilon}$ stands for the $\varepsilon$-neighborhood of $A$ and $B$ respectively.

Theorem 1.15 (Blaschke) $d_H$ is a distance.

Proposition 1.5 If $d_H(A_n,A)\to 0$ and $\mu_n$ is a sequence of positive measures such that
$$
\operatorname{supp}(\mu_n)\subset A_n
$$
with $\mu_n\rightharpoonup \mu$, then
$$
\operatorname{supp}(\mu)\subset A.
$$

Theorem 1.16 Suppose that $X$ and $Y$ are compact metric spaces and that $c:X\times Y\to \mathbb{R}$ is continuous. Suppose that $\gamma_n\in \mathcal{P}(X\times Y)$ is a sequence of transport plan which are optimal for cost $c$ between their own marginals

$$
\mu_n=(\pi_x) _{\sharp}\gamma_n
\quad \text{and} \quad
\nu_n=(\pi_y) _{\sharp}\gamma_n
$$
and suppose $\gamma_n\rightharpoonup \gamma$. Then

$$
\mu_n \to \mu_{\infty}:=(\pi_x) _{\sharp} \gamma,
\quad
\nu_n \to \nu:= (\pi_y) _{\sharp} \gamma,
$$

and $\gamma$ is optimal in the transport between $\mu$ and $\nu$.

Proof : Set $\Gamma_n:=\operatorname{supp}(\gamma_n)$, up to subsequence, we can assume $\Gamma_n\to \Gamma$ in Hausdorff distance. Each $\Gamma_n$ is a $c$-CM set, now we claim $\Gamma$ is also a $c$-CM set.

Fix $(x_1,y_1),\dots,(x_k,y_k)\in \Gamma$, there are points $(x_1^n,y_1^n),\dots,(x_k^n,y_k^n)\in \Gamma_n$ such that
$$
(x_i^n,y_i^n)\to (x_i,y_i),\qquad i=1,2,\dots,k.
$$
The cyclical monotonicity of $\Gamma_n$ gives
$$
\sum_{i=1}^k c(x_i^n,y_i^n)\leq \sum_{i=1}^k c(x_i^n,y_{\sigma(i)}^n).
$$
Take $n\to\infty$
$$
\sum_{i=1}^k c(x_i,y_i)\leq \sum_{i=1}^k c(x_i,y_{\sigma(i)}).
$$
Hence $\Gamma$ is a $c$-CM set. Since $\gamma_n\rightharpoonup \gamma$ and $\Gamma_n\to \Gamma$ in Hausdorff distance, we have
$$
\operatorname{supp}(\gamma)\subset \Gamma,
$$
which is also a $c$-CM set and implies the optimality of $\gamma$. $\square$

Notation : For a given cost $c:X\times Y\to \mathbb{R}$ and $\mu\in \mathcal{P}(X)$, $\nu\in \mathcal{P}(Y)$, define
$$
J_c(\mu,\nu):=
\min\left\{
\int_{X\times Y} c(x,y)\ d\gamma;\ \gamma\in \Pi(\mu,\nu)
\right\}.
$$

Theorem 1.17 Suppose that $X$ and $Y$ are compact metric spaces and that $c:X\times Y\to \mathbb{R}$ is continuous. Suppose that $\mu_n\in \mathcal{P}(X)$, $\nu_n\in \mathcal{P}(Y)$ with
$$
\mu_n\to \mu
\qquad \text{and} \qquad
\nu_n\to \nu.
$$
Then we have
$$
J_c(\mu_n,\nu_n)\to J_c(\mu,\nu).
$$

Proof : Let $\gamma_n$ be an optimal transport plan from $\mu_n$ to $\nu_n$ for the cost $c$. Up to subsequences, we can assume
$$
\gamma_n\rightharpoonup \gamma.
$$
By Theorem 1.16, $\gamma$ is optimal for $\mu$ to $\nu$. This means
$$
J_c(\mu_n,\nu_n)=\int_{X\times Y} c(x,y)\ d\gamma_n
\to
\int_{X\times Y} c\ d\gamma
=
J_c(\mu,\nu),
$$
which proves the claim.

Theorem 1.18 Suppose that $X$ and $Y$ are compact metric spaces, and that $c:X\times Y\to \mathbb{R}$ is continuous. Suppose that $\mu_n\in \mathcal{P}(X)$ and $\nu_n\in \mathcal{P}(Y)$, with
$$
\mu_n\to \mu
\qquad \text{and} \qquad
\nu_n\to \nu.
$$
Let $(\varphi_n,\psi_n)$ be a pair of $c$-concave Kantorovich potentials for the cost $c$ in the transport from $\mu_n$ to $\nu_n$. Then up to a subsequences, we have
$$
\varphi_n\to \varphi
\qquad \text{and} \qquad
\psi_n\to \psi
$$
where convergence is uniform and $(\varphi,\psi)$ is a pair of Kantorovich potentials for $\mu$ and $\nu$.

Proof : It is easy to see that $\{\varphi_n\}$ and $\{\psi_n\}$ are equibounded and equicontinuity.

By Arzelà–Ascoli theorem, we have
$$
\varphi_n\to \varphi,\qquad \psi_n\to \psi
$$
up to a subsequence, and the convergence is uniform.

Since
$$
\varphi_n(x)+\psi_n(y)\leq c(x,y),
$$
we have
$$
\varphi(x)+\psi(y)\leq c(x,y).
$$
Moreover
$$
J_c(\mu_n,\nu_n)
=
\int_X \varphi_n\ d\mu_n+\int_Y \psi_n\ d\nu_n
\to
\int_X \varphi\ d\mu+\int_Y \psi\ d\nu.
$$
By Theorem 1.17, we also have
$$
J_c(\mu_n,\nu_n)\to J_c(\mu,\nu).
$$
Hence
$$
J_c(\mu,\nu)=\int_X \varphi\ d\mu+\int_Y \psi\ d\nu,
$$
which implies they are Kantorovich potentials. $\square$

Reference

Santambrogio, Filippo. Optimal transport for applied mathematicians. Birkäuser, NY 55.58-63 (2015): 94.

The cover image of this article was taken in Sydney, Australia.

Duality Results for Kantorovich Problem

2026-04-03T16:00:00.000Z

In this article, we will prove the duality result for Kantorovich problem.

Theorem 1.8 Suppose that $X$ and $Y$ are Polish spaces, and that $c:X \times Y \to \mathbb{R}$ is uniformly continuous and bounded. Then the problem $(\mathrm{DP})$ admits a solution $(\varphi,\varphi^c)$ and we have
$$
\max (\mathrm{DP}) = \min (\mathrm{KP}).
$$

Proof. Consider the minimization problem $(\mathrm{KP})$. Since $c$ is continuous, it admits a solution $\gamma$. By Theorem 1.7, the support $\Gamma=\operatorname{supp}(\gamma)$ is $c$-cyclically monotone. Since $c$ is real-valued, we can apply Theorem 1.6 and obtain the existence of a $c$-concave function $\varphi$ such that
$$
\Gamma \subset \{(x,y)\in X\times Y:\ \varphi(x)+\varphi^c(y)=c(x,y)\}.
$$

Since $c$ is continuous, we have $\varphi$ and $\varphi^c$ are continuous. Moreover, from the global upper bound on $\varphi^c$, which turns into a lower bound on $\varphi$,

$$
\varphi^c(y)=\inf_{x\in X}[c(x,y)-\varphi(x)].
$$

Symmetrically, we can also obtain upper bounds on $\varphi$ and lower bounds on $\varphi^c$, which proves that $\varphi$ and $\varphi^c$ are both continuous and bounded.

Hence, $(\varphi,\varphi^c)$ can be used as an admissible pair in $(\mathrm{DP})$. Consider now

$$
\int_X \varphi\ d\mu + \int_Y \varphi^c\ d\nu
= \int_{X\times Y} (\varphi(x)+\varphi^c(y))\ d\gamma
= \int_{\operatorname{supp}(\gamma)} (\varphi(x)+\varphi^c(y))\ d\gamma= \int_{\operatorname{supp}(\gamma)} c(x,y)\ d\gamma
= \int_{X\times Y} c(x,y)\ d\gamma.
$$

Hence

$$
(\mathrm{DP})\ge \int_X \varphi\ d\mu+\int_Y \varphi^c\ d\nu
= \int_{X\times Y} c(x,y)\ d\gamma
= (\mathrm{KP}).
$$

Since we already saw $(\mathrm{DP})\le (\mathrm{KP})$, we get $(\mathrm{DP})=(\mathrm{KP})$ and $(\varphi,\varphi^c)$ is optimal for $(\mathrm{DP})$. $\square$

Theorem 1.9 Let $\mu$ and $\nu$ be probability measures over $\mathbb{R}^d$, and $c(x,y):=\frac12|x-y|^2$. Suppose
$$
\int_X |x|^2\ \mu(dx)<+\infty
\qquad\text{and}\qquad
\int_Y |y|^2\ \nu(dy)<+\infty.
$$

Consider the following variant of $(\mathrm{DP})$:

$$
(\mathrm{DP-var})\qquad
\sup\left\{
\int_{\mathbb{R}^d}\varphi\ d\mu+\int_{\mathbb{R}^d}\psi\ d\nu
:\ \varphi\in L^1(\mu),\ \psi\in L^1(\nu),\ \varphi\oplus\psi\le c
\right\}.
$$

Then $(\mathrm{DP-var})$ admits a solution $(\varphi,\psi)$, and the functions

$$
x\mapsto \frac12|x|^2-\varphi(x)
\qquad\text{and}\qquad
y\mapsto \frac12|y|^2-\psi(y)
$$

are convex and conjugate to each other for the Legendre transform. Moreover, we have

$$
\max (\mathrm{DP-var})=\min (\mathrm{KP}).
$$

Proof. Since $c$ is continuous, by Theorem 1.7 the support $\Gamma=\operatorname{supp}(\gamma)$ is $c$-cyclically monotone and then by Theorem 1.6, we get the existence of a pair $(\varphi,\psi)$ with $\psi=\varphi^c$ such that

$$
\varphi(x)+\psi(y)=c(x,y)
\qquad\text{for all }(x,y)\in \Gamma.
$$

From Proposition 1.2, we know

$$
x\mapsto \frac12|x|^2-\varphi(x),\qquad
y\mapsto \frac12|y|^2-\psi(y)
$$

are convex and conjugate to each other, hence bounded from below by a linear function; hence $\varphi$ and $\psi$ are bounded by a second order polynomial. Since

$$
\int_X |x|^2\ d\mu<+\infty
\qquad\text{and}\qquad
\int_Y |y|^2\ d\nu<+\infty,
$$

we know $\varphi_+\in L^1(\mu)$ and $\psi_+\in L^1(\nu)$.

Since

$$
\int_{\mathbb{R}^d}\varphi\ d\mu+\int_{\mathbb{R}^d}\psi\ d\nu
=
\int_{\mathbb{R}^d\times\mathbb{R}^d} \varphi\oplus\psi\ d\gamma
=
\int_{\mathbb{R}^d\times\mathbb{R}^d} c(x,y)\ d\gamma
\ge 0,
$$

this proves

$$
\int_{\mathbb{R}^d}\varphi\ d\mu,\ \int_{\mathbb{R}^d}\psi\ d\nu >-\infty.
$$

Hence $\varphi\in L^1(\mu)$, $\psi\in L^1(\nu)$.

Therefore

$$
\sup(\mathrm{DP-var})\ge
\int_{\mathbb{R}^d}\varphi\ d\mu+\int_{\mathbb{R}^d}\psi\ d\nu
=
\int_{\mathbb{R}^d\times\mathbb{R}^d} c(x,y)\ d\gamma
=
\min(\mathrm{KP}).
$$

However, by $\varphi\oplus\psi\le c$, we know $\sup(\mathrm{DP-var})\le \min(\mathrm{KP})$. We get

$$
\max(\mathrm{DP-var})=\min(\mathrm{KP})
$$

and $(\varphi,\psi)$ is the solution of $(\mathrm{DP-var})$. $\square$

Lemma 1.8 Suppose that $c_k$ and $c$ are lower semi-continuous, and bounded from below and that $c_k$ converges increasingly to $c$. Then
$$
\lim_{k\to\infty}\min\left\{\int_{X\times Y} c_k\ d\gamma:\ \gamma\in\Pi(\mu,\nu)\right\}
=
\min\left\{\int_{X\times Y} c\ d\gamma:\ \gamma\in\Pi(\mu,\nu)\right\}.
$$

Proof. Since $c_k\le c$, it is obvious that $LHS\le RHS$.

Now consider a sequence $\gamma_k\in\Pi(\mu,\nu)$ which is the minimizer for each cost $c_k$.

Up to subsequences, due to the tightness of $\Pi(\mu,\nu)$, we can suppose

$$
\gamma_k \rightharpoonup \overline{\gamma}.
$$

Fix now an index $j$. Since for $k\ge j$ we have $c_k\ge c_j$, and then

$$
\lim_k \min\left\{\int_{X\times Y} c_k\ d\gamma:\ \gamma\in\Pi(\mu,\nu)\right\}
=
\lim_k \int_{X\times Y} c_k\ d\gamma_k
\ge
\liminf_k \int_{X\times Y} c_j\ d\gamma_k.
$$

Since $c_j$ is lower semi-continuous, and $\gamma_k\rightharpoonup \overline{\gamma}$,

$$
\liminf_k \int_{X\times Y} c_j\ d\gamma_k
\ge
\int_{X\times Y} c_j\ d\overline{\gamma}.
$$

Hence, we obtain

$$
\lim_k \min\left\{\int_{X\times Y} c_k\ d\gamma:\ \gamma\in\Pi(\mu,\nu)\right\}
\ge
\int_{X\times Y} c_j\ d\overline{\gamma}.
$$

Since $j$ is arbitrary and

$$
\lim_{j\to\infty}\int_{X\times Y} c_j\ d\overline{\gamma}
=
\int_{X\times Y} c\ d\overline{\gamma}
$$

by monotone convergence theorem, we also have

$$
\lim_k \min\left\{\int_{X\times Y} c_k\ d\gamma:\ \gamma\in\Pi(\mu,\nu)\right\}
\ge
\int_{X\times Y} c\ d\overline{\gamma}
\ge
\min\left\{\int_{X\times Y} c\ d\gamma:\ \gamma\in\Pi(\mu,\nu)\right\}.
$$

This completes the proof and $\overline{\gamma}$ is optimal for cost $c$. $\square$

Theorem 1.10 If $X,Y$ are Polish spaces and $c:X\times Y\to \mathbb{R}\cup\{+\infty\}$ is lower semi-continuous and bounded from below, then the duality formula
$$
\min(\mathrm{KP})=\sup(\mathrm{DP})
$$

holds.

Proof. Consider a sequence $c_k$ of bounded $k$-Lipschitz functions approaching $c$ increasingly. Then the dual result holds for $c_k$ and then

$$
\min\left\{\int_{X\times Y} c_k\ d\gamma:\ \gamma\in\Pi(\mu,\nu)\right\}
=
\max\left\{
\int_X \varphi\ d\mu+\int_Y \psi\ d\nu:\ \varphi\in C_b(X),\ \psi\in C_b(Y),\ \varphi\oplus\psi\le c_k
\right\}\le
\sup\left\{
\int_X \varphi\ d\mu+\int_Y \psi\ d\nu:\ \varphi\oplus\psi\le c
\right\}.
$$

Let $k\to\infty$, by Lemma 1.8, we get the desired result. $\square$

Remark: For the cost $c$, we can not guarantee the existence of a maximizing pair $(\varphi,\psi)$.

Theorem 1.11 If $c$ is lower semi-continuous and $\gamma$ is an optimal transport plan, then $\gamma$ is concentrated on a $c$-CM set $\Gamma$.

Proof. By Theorem 1.10, we can take a sequence of maximizing pairs $(\varphi_n,\psi_n)$ in the dual problem, we have

$$
\int_{X\times Y} (\varphi_n(x)+\psi_n(y))\ d\gamma
=
\int_X \varphi_n(x)\ d\mu+\int_Y \psi_n(y)\ d\nu
\to
\int_{X\times Y} c\ d\gamma,
$$

where $\gamma$ is optimal for $c$.

Moreover, since

$$
f_n(x,y):=c(x,y)-\varphi_n(x)-\psi_n(y)\ge 0,
$$

we have

$$
\int_{X\times Y} |f_n(x,y)|\ d\gamma
=
\int_{X\times Y} f_n(x,y)\ d\gamma
\to 0
\qquad\text{as }n\to\infty,
$$

that is $f_n\to 0$ in $L^1(X\times Y,\gamma)$. Therefore, up to a subsequence, they also converge pointwisely $\gamma$-a.e. to $0$. Let $\Gamma\subset X\times Y$ be the set with $\gamma(\Gamma)=1$, where the convergence holds.

Take any $k$, $\sigma$ and $(x_1,y_1),\cdots,(x_k,y_k)\in \Gamma$, we have

$$
\sum_{i=1}^k c(x_i,y_i)
=
\lim_{n\to\infty}\sum_{i=1}^k(\varphi_n(x_i)+\psi_n(y_i))
=
\lim_{n\to\infty}\sum_{i=1}^k(\varphi_n(x_i)+\psi_n(y_{\sigma(i)}))
\le
\sum_{i=1}^k c(x_i,y_{\sigma(i)}),
$$

which proves that $\Gamma$ is $c$-CM. $\square$

Reference

Santambrogio, Filippo. Optimal transport for applied mathematicians. Birkäuser, NY 55.58-63 (2015): 94.

The cover image of this article was taken from the Luzern–Interlaken Express while passing Lake Lungern in Switzerland.

$c$-Concavity and Cyclical Monotonicity

2026-04-02T16:00:00.000Z

In this article, we will introduce some results for $c$-concavity and cyclical monotonicity, which will be used to prove the duality result in the next article.

Proposition 1.3 Let $c:X \times Y \to \mathbb{R}$. For $\varphi:X \to \mathbb{R} \cup \{-\infty\}$, we have
$$
\varphi^{c\bar c} \ge \varphi.
$$

Moreover, the equality holds if and only if $\varphi$ is $c$-concave. In general, $\varphi^{c\bar c}$ is the smallest $c$-concave function that is larger than $\varphi$.

Proof : First, we show $\varphi^{c\bar c} \ge \varphi$. For any $x \in X$, $y \in Y$,
$$
\varphi^c(y)=\inf_{x\in X}[c(x,y)-\varphi(x)] \le c(x,y)-\varphi(x).
$$

Therefore,
$$
\varphi^{c\bar c}(x)=\inf_{y\in Y}[c(x,y)-\varphi^c(y)]
\ge \inf_{y\in Y}[c(x,y)-(c(x,y)-\varphi(x))]=\varphi(x).
$$

Then, let us prove $\varphi^{c\bar c}=\varphi$ if and only if $\varphi$ is $c$-concave. It is obvious that if $\varphi^{c\bar c}=\varphi$ then $\varphi=(\varphi^c)^{\bar c}$, which shows that $\varphi$ is $c$-concave. If $\varphi$ is $c$-concave, then there exists a function $\zeta:Y\to\mathbb{R}\cup\{-\infty\}$ such that
$$
\varphi=\zeta^{\bar c}.
$$

Hence
$$
\varphi^c=\zeta^{\bar c c}\ge \zeta,
$$
and then
$$
\varphi^{c\bar c}\le \zeta^{\bar c}=\varphi,
$$
which implies $\varphi^{c\bar c}=\varphi$.

Finally, if $\psi\ge \varphi$ and $\psi$ is $c$-concave, we need to show $\psi\ge \varphi^{c\bar c}$. Suppose $\psi=\eta^{\bar c}$, then
$$
\psi=\eta^{\bar c}\ge \varphi
\Rightarrow \eta^{\bar c c}\le \varphi^c
\Rightarrow \eta\le \varphi^c
\Rightarrow \psi=\eta^{\bar c}\ge \varphi^{c\bar c}.\quad \square
$$

Recall the sub-differential for convex functions

Definition 1.5 For every convex function $f:\mathbb{R}^d\to\mathbb{R}\cup\{+\infty\}$, we define its subdifferential at $x$ as the set
$$
\partial f(x):=\{p\in\mathbb{R}^d:\ f(y)\ge f(x)+p\cdot (y-x),\ \forall y\in\mathbb{R}^d\}.
$$

Proposition 1.4 For every convex function $f:\mathbb{R}^d\to\mathbb{R}$, we have

(1) $\partial f(x)\neq \emptyset$ if $x\in \operatorname{int}(\operatorname{dom} f)$.

(2) If $f$ is differentiable at $x$, then $\partial f(x)=\{\nabla f(x)\}$.

(3) $p\in \partial f(x)$ if and only if $x\in \partial f^\star(p)$ if and only if
$$
f(x)+f^\star(p)=x\cdot p.
$$

(4) If $p_1\in \partial f(x_1)$, $p_2\in \partial f(x_2)$, then
$$
(p_1-p_2)\cdot (x_1-x_2)\ge 0.
$$

Definition 1.6 Let us define the graph of the subdifferential of a convex function as
$$
\operatorname{Graph}(\partial f):=\{(x,p):\ p\in \partial f(x)\}
=\{(x,p):\ f(x)+f^\star(p)=x\cdot p\}.
$$

Definition 1.7 A set $A\subset \mathbb{R}^d\times \mathbb{R}^d$ is said to be cyclically monotone if for every $k\in \mathbb{N}$, every permutation $\sigma$ and every finite family of points $(x_1,p_1),\ldots,(x_k,p_k)\in A$, we have
$$
\sum_{i=1}^k x_i\cdot p_i \ge \sum_{i=1}^k x_i\cdot p_{\sigma(i)}.
$$

Note that if we take $k=2$, we get the usual definition of monotonicity
$$
x_1\cdot p_1+x_2\cdot p_2\ge x_1\cdot p_2+x_2\cdot p_1
\Leftrightarrow
(p_1-p_2)\cdot (x_1-x_2)\ge 0.
$$

By Rockafellar theorem, every cyclically monotone set is contained in the graph of the subdifferential of a convex function.

c-Cyclical Monotonicity

Definition 1.8 Once a function $c:X\times Y\to \mathbb{R}\cup\{+\infty\}$ is given, we say that a set $\Gamma\subset X\times Y$ is $c$-cyclically monotone (briefly $c$-CM) if for every $k\in\mathbb{N}$, every permutation $\sigma$ and every finite family of points $(x_1,y_1),\ldots,(x_k,y_k)\in \Gamma$, we have
$$
\sum_{i=1}^k c(x_i,y_i)\le \sum_{i=1}^k c(x_i,y_{\sigma(i)}).
$$

Remark : Take $c(x,y)=\frac12|x-y|^2$, we get $\Gamma$ is cyclically monotone.

Theorem 1.6 If $\Gamma\neq \emptyset$ is a $c$-CM set in $X\times Y$ and $c:X\times Y\to\mathbb{R}$ (note that $c$ is required not to take the value $+\infty$), then there exists a $c$-concave function $\varphi:X\to \mathbb{R}\cup\{-\infty\}$ (different from the constant $-\infty$ function) such that
$$
\Gamma\subset \{(x,y)\in X\times Y:\ \varphi(x)+\varphi^c(y)=c(x,y)\}.
$$

Proof : Let us fix a point $(x_0,y_0)\in \Gamma$. For $x\in X$ set
$$
\varphi(x)=\inf \Big\{
c(x,y_n)-c(x_n,y_n)+c(x_n,y_{n-1})-c(x_{n-1},y_{n-1})
+\cdots +c(x_1,y_0)-c(x_0,y_0): n\in \mathbb{N},\ (x_i,y_i)\in \Gamma \text{ for all } i=1,2,\ldots,n
\Big\}.
$$

Since $c$ is real valued and $\Gamma$ is nonempty, $\varphi$ never takes the value $+\infty$.

If we set for $y\in Y$
$$
-\psi(y)=\inf \Big\{
-c(x_n,y)+c(x_n,y_n)-c(x_{n-1},y_n)+c(x_{n-1},y_{n-1})-\cdots +c(x_1,y_0)-c(x_0,y_0): n\in \mathbb{N},\ (x_i,y_i)\in \Gamma \text{ for all } i=1,2,\ldots,n,\ \text{and } y_n=y
\Big\}.
$$

By definition
$$
\psi(y)>-\infty
\Leftrightarrow
y\in (\pi_Y)(\Gamma)=\{y\in Y:\ \exists x\in X \text{ s.t. } (x,y)\in \Gamma\}.
$$

Next, we prove that $\varphi=\psi^c$, which implies $\varphi$ is $c$-concave. Indeed, if $y\notin (\pi_Y)(\Gamma)$, $\psi(y)=-\infty$, then $c(x,y)-\psi(y)=+\infty$. Hence, we only need to consider $y\in (\pi_Y)(\Gamma)$. For $y\in (\pi_Y)(\Gamma)$,

$$
\begin{aligned}
c(x,y)-\psi(y)
&=
c(x,y)+\inf \Big\{
-c(x_n,y)+c(x_n,y_n)-c(x_{n-1},y_n)+\cdots +c(x_1,y_0)-c(x_0,y_0):(x_i,y_i)\in \Gamma \text{ for all } i\ge 1,\ldots,n,\ \text{and } y_n=y
\Big\}\\\
&=
\inf \Big\{
c(x,y_n)-c(x_n,y_n)+c(x_n,y_{n-1})-c(x_{n-1},y_{n-1})+\cdots +c(x_1,y_0)-c(x_0,y_0):(x_i,y_i)\in \Gamma \text{ for all } i=1,2,\ldots,n,\ \text{and } y_n=y
\Big\}.
\end{aligned}
$$

Take the infimum of $y=y_n$, we get
$$
\inf_y[c(x,y)-\psi(y)]=\varphi(x),
$$
that is
$$
\varphi=\psi^c.
$$

Then, we claim: $\varphi(x_0)\ge 0$ and hence $\varphi \not\equiv -\infty$. In fact, for all $n\in \mathbb{N}$ and $(x_i,y_i)\in \Gamma$, $i=1,2,\ldots,n$,
$$
c(x_0,y_n)-c(x_n,y_n)+c(x_n,y_{n-1})-c(x_{n-1},y_{n-1})+\cdots +c(x_1,y_0)-c(x_0,y_0)=
\sum_{i=0}^n c(x_{i+1},y_i)-\sum_{i=0}^n c(x_i,y_i)\ge 0.
$$

Hence $\varphi(x_0)\ge 0$.

Now, we need to show $\varphi(x)+\varphi^c(y)\ge c(x,y)$ for $(x,y)\in \Gamma$. Since $\varphi=\psi^c$ and we have $\varphi^c=\psi^{c\bar c}\ge \psi$, we only need to show
$$
\varphi(x)+\psi(y)\ge c(x,y)
\qquad \text{for } (x,y)\in \Gamma.
$$

Fix $(x,y)\in \Gamma$. Since
$$
\varphi(x)=\psi^c(x)=\inf_{y\in (\pi_Y)(\Gamma)}[c(x,y)-\psi(y)],
$$
for all $\varepsilon>0$, there exists $\bar y\in (\pi_Y)(\Gamma)$ such that
$$
c(x,\bar y)-\psi(\bar y)<\varphi(x)+\varepsilon. \tag{1}
$$

Claim :
$$
-\psi(y)\le -c(x,y)+c(x,\bar y)-\psi(\bar y).
$$

In fact, since $\bar y\in (\pi_Y)(\Gamma)$, there exists a chain
$$
(x_1,y_1),\ldots,(x_n,y_n)\in \Gamma,
\qquad \text{with } y_n=\bar y.
$$

Since $(x,y)\in \Gamma$, we can add it to the chain above, and hence
$$
-\psi(y)\le -c(x,y)+c(x,\bar y)-c(x_n,\bar y)+c(x_n,y_n)-c(x_{n-1},y_n)+c(x_{n-1},y_{n-1})+\cdots +c(x_1,y_0)-c(x_0,y_0).
$$

Take the infimum, we get
$$
-\psi(y)\le -c(x,y)+c(x,\bar y)-\psi(\bar y). \tag{2}
$$

Therefore, combined by (1) and (2), we have
$$
-\psi(y)<-c(x,y)+\varphi(x)+\varepsilon,
$$
by the arbitrarity of $\varepsilon>0$, we get
$$
\varphi(x)+\psi(y)\ge c(x,y).
$$

Since $\varphi^c\ge \psi$, which completes the prove. $\square$

Theorem 1.7 If $\gamma$ is an optimal transport plan for the cost $c$ and $c$ is continuous, then $\operatorname{Supp}(\gamma)$ is a $c$-CM set.

Proof : Suppose by contradiction that there exists $k$, $\sigma$ and $(x_1,y_1),\ldots,(x_k,y_k)\in \operatorname{Supp}(\gamma)$ such that
$$
\sum_{i=1}^k c(x_i,y_i)>\sum_{i=1}^k c(x_i,y_{\sigma(i)}).
$$

Take
$$
0<\varepsilon<\frac{1}{2k}\left(\sum_{i=1}^k c(x_i,y_i)-\sum_{i=1}^k c(x_i,y_{\sigma(i)})\right).
$$

By continuity of $c$, there exists $r>0$ such that for all $i=1,2,\ldots,k$, we have
$$
c(x,y)>c(x_i,y_i)-\varepsilon
\qquad \text{for all } (x,y)\in B(x_i,r)\times B(y_i,r),
$$
and
$$
c(x,y)\qquad \text{for all } (x,y)\in B(x_i,r)\times B(y_{\sigma(i)},r).
$$

Now, consider
$$
V_i:=B(x_i,r)\times B(y_i,r).
$$

Note that since $(x_i,y_i)\in \operatorname{supp}(\gamma)$, we have
$$
\gamma(V_i)>0
\qquad \text{for all } i=1,\ldots,k.
$$

Define measures

$$
\gamma_i:=\gamma \llcorner V_i/\gamma(V_i)
\quad \text{and} \quad
\mu_i=(\pi_x)_\sharp \gamma_i,\ \nu_i=(\pi_y)_\sharp \gamma_i.
$$

Take
$$
\varepsilon_0<\frac{1}{k}\min_i \gamma(V_i).
$$

For every $i$, build a measure $\widetilde\gamma_i\in \Pi(\mu_i,\nu_{\sigma(i)})$ and define
$$
\widetilde\gamma:=\gamma-\varepsilon_0\sum_{i=1}^k \gamma_i+\varepsilon_0\sum_{i=1}^k \widetilde\gamma_i.
$$

Claim : $\widetilde\gamma\in \Pi(\mu,\nu)$ and
$$
\int_{X\times Y} c\ d\widetilde\gamma<\int_{X\times Y} c\ d\gamma.
$$

First, we check $\widetilde\gamma$ is a positive measure. It is sufficient to prove that $\gamma-\varepsilon_0\sum_{i=1}^k \gamma_i$ is positive, and for that, the condition
$$
\varepsilon_0\gamma_i<\frac1k\gamma
$$
is enough.

Since
$$
\varepsilon_0\gamma_i=\frac{\varepsilon_0}{\gamma(V_i)}\ \gamma\llcorner V_i,
$$
and
$$
\frac{\varepsilon_0}{\gamma(V_i)}<\frac1k,
$$
we get the desired result.

Now, let us check the marginals of $\widetilde\gamma$. We have

$$
(\pi_x)_\sharp \widetilde \gamma = \mu-\varepsilon_0 \sum _{i=1} ^k \mu_i+\varepsilon_0\sum _{i=1} ^k \mu_i = \mu
$$

and

$$
(\pi_y)_\sharp \widetilde\gamma = \nu-\varepsilon _0 \sum _{i=1} ^k \nu_i + \varepsilon_0 \sum _{i=1} ^k \nu _{\sigma(i)} = \nu.
$$

Finally, let us estimate
$$
\int_{X\times Y} c\ d\gamma-\int_{X\times Y} c\ d\widetilde\gamma.
$$
We have
$$
\begin{aligned}
\int_{X\times Y} c\ d\gamma-\int_{X\times Y} c\ d\widetilde\gamma
&=
\varepsilon_0\sum_{i=1}^k \int_{X\times Y} c\ d\gamma_i
-\varepsilon_0\sum_{i=1}^k \int_{X\times Y} c\ d\widetilde\gamma_i\\\
&\ge
\varepsilon_0\sum_{i=1}^k (c(x_i,y_i)-\varepsilon)
-\varepsilon_0\sum_{i=1}^k (c(x_i,y_{\sigma(i)})+\varepsilon)\\\
&=
\varepsilon_0\left(
\sum_{i=1}^k c(x_i,y_i)-\sum_{i=1}^k c(x_i,y_{\sigma(i)})-2k\varepsilon
\right)>0,
\end{aligned}
$$

where we use the fact that $\gamma_i$ is concentrated on $B(x_i,r)\times B(y_i,r)$ and $\widetilde\gamma_i$ is concentrated on $B(x_i,r)\times B(y_{\sigma(i)},r)$. $\square$

Reference

Santambrogio, Filippo. Optimal transport for applied mathematicians. Birkäuser, NY 55.58-63 (2015): 94.

The cover image of this article was taken in Cromwell, New Zealand.

The case $c(x,y)=h(x-y)$ for $h$ Strictly Convex and the Existence of an Optimal Transport Map $T$

2026-03-30T16:00:00.000Z

In this article, we will consider $X=Y=\Omega \subset \mathbb{R}^d$ with $\Omega$ compact and $c(x,y)=h(x-y)$ for $h$ strictly convex.

From now on, we assume the duality result
$$
\min (\mathrm{KP})=\max (\mathrm{DP})
$$
holds true, which means for the optimal transport plan $\gamma$ and a Kantorovich potential $\varphi$, we have
$$
\int_{\Omega \times \Omega} c \ d\gamma
=
\int_{\Omega} \varphi \ d\mu + \int_{\Omega} \varphi^c \ d\nu.
$$

Since we have $\varphi(x)+\varphi^c(y)\le c(x,y)$ for every $x,y\in \Omega$, we have
$$
\varphi(x)+\varphi^c(y)=c(x,y)\quad \gamma-a.s.
$$

Furthermore, since $\varphi^c$ and $c$ are continuous, we have $\varphi(x)+\varphi^c(y)=c(x,y)$ on $\operatorname{supp}(\gamma)$.

Fix $(x_0,y_0)\in \operatorname{supp}(\gamma)$. By the definition of $\varphi^c$,
$$
\varphi^c(y_0)=\inf_{x\in \Omega}[c(x,y_0)-\varphi(x)]\le c(x,y_0)-\varphi(x)
\qquad \forall x\in \Omega.
$$

On the other hand, since $(x_0,y_0)\in \operatorname{supp}(\gamma)$,
$$
\varphi(x_0)+\varphi^c(y_0)=c(x_0,y_0),
$$
which implies
$$
c(x_0,y_0)-\varphi(x_0)=\varphi^c(y_0)\le c(x,y_0)-\varphi(x)
\qquad \forall x\in \Omega.
$$

Therefore
$$
x\longmapsto c(x,y_0)-\varphi(x)
$$
is minimal at $x=x_0$.

If $\varphi$ and $c(\cdot,y_0)$ are differentiable at $x_0$ and $x_0\notin \partial \Omega$, then the first order condition tells us
$$
\nabla_x c(x_0,y_0)-\nabla \varphi(x_0)=0,
$$
namely
$$
\nabla \varphi(x_0)=\nabla_x c(x_0,y_0).
$$

Proposition 1.1 If $c$ is $C^1$, $\varphi$ is a Kantorovich potential for the cost $c$ in the transport from $\mu$ to $\nu$, and $(x_0,y_0)$ belongs to the support of an optimal transport plan $\gamma$. Then
$$
\nabla \varphi(x_0)=\nabla_x c(x_0,y_0),
$$
provided $\varphi$ is differentiable at $x_0$.

In particular, the gradients of two different Kantorovich potentials coincide on every point $x_0\in \operatorname{supp}(\mu)$ where both the potentials are differentiable.

Proof : The proof is contained in the above considerations.

Definition 1.4 (Twist condition) For $\Omega \subset \mathbb{R}^d$, we say that $c:\Omega \times \Omega \to \mathbb{R}$ satisfies the twist condition whenever $c$ is differentiable w.r.t. $x$ at every point and the map
$$
y\longmapsto \nabla_x c(x,y)
$$
is injective for every $x_0\in \Omega$.

Remark By Proposition 1.1, we know for $(x_0,y_0)\in \operatorname{supp}(\gamma)$,
$$
\nabla_x c(x_0,y_0)=\nabla \varphi(x_0).
$$

If $c$ satisfies the twist condition, for given $x_0$, there is a unique $y_0$ such that
$$
(x_0,y_0)\in \operatorname{supp}(\gamma).
$$

This shows that $\gamma$ is concentrated on a graph.

For $c(x,y)=h(x-y)$ with $h$ strictly convex, if $\varphi$ and $h$ are differentiable at $x_0$ and $x_0-y_0$, respectively, and $x_0\notin \partial \Omega$. Then
$$
\nabla_x c(x,y)=\nabla h(x-y).
$$

If $\nabla_x c(x,y_1)=\nabla_x c(x,y_2)$, then
$$
\nabla h(x-y_1)=\nabla h(x-y_2).
$$
Since $h$ is strictly convex,
$$
x-y_1=x-y_2,
$$
which implies $y_1=y_2$ and therefore $h$ satisfies twist condition.

Since
$$
\nabla \varphi(x_0)=\nabla_x c(x_0,y_0)=\nabla h(x_0-y_0),
$$
then
$$
x_0-y_0=(\nabla h)^{-1}(\nabla \varphi(x_0)).
$$

Hence
$$
y_0=x_0-(\nabla h)^{-1}(\nabla \varphi(x_0)).
$$

By Rademacher theorem, $\varphi$ is differentiable $\mathcal{L}^d$-a.e. If $\mu\ll \mathcal{L}^d$, then $\nabla \varphi$ exists $\mu$-a.s.

Theorem 1.5 Given $\mu$ and $\nu$ probability measures on a compact domain $\Omega\subset \mathbb{R}^d$, there exists an optimal transport plan $\gamma$ for the cost $c(x,y)=h(x-y)$ with $h$ strictly convex. It is unique and of the form $(id \times T)_\sharp \mu$ provided $\mu\ll \mathcal{L}^d$ and $\partial \Omega$ is negligible. Moreover, there exists a Kantorovich potential $\varphi$ and $T$ and the potentials $\varphi$ are linked by
$$
T(x)=x-(\nabla h)^{-1}(\nabla \varphi(x)).
$$

Proof : The proof is contained in the previous considerations and the uniqueness is by Proposition 1.1, which implies $\nabla \varphi$ is unique, then the optimal $T$. $\square$

Remark. Every time we know that any optimal $\gamma$ must be induced by a map $T$, then we have uniqueness.

Proof : If we have two different plans
$$
\gamma_1=\gamma_{T_1}, \qquad \gamma_2=\gamma_{T_2}
$$
are optimal. Then
$$
\bar{\gamma}:=\frac12\gamma_1+\frac12\gamma_2
$$
is also optimal, but it cannot be induced by a map unless $T_1=T_2$, $\mu$-a.e., which gives a contradiction.

Indeed, let
$$
A:=\{x\in X:\ T_1(x)\ne T_2(x)\}.
$$
Suppose $\mu(A)>0$ and there exists a map $T:X\to Y$ such that
$$
\bar{\gamma}=\gamma_T=(id\times T)_\sharp\mu.
$$

For $B\in \mathcal{B}(X)$, consider
$$
\operatorname{Graph}(T|_B):=\{(x,T(x)):\ x\in B\}.
$$
We have
$$
\begin{aligned}
\bar{\gamma}(\operatorname{Graph}(T|_B)) &= (id\times T)_\sharp\mu\big(\{(x,T(x)):\ x\in B\}\big)\\\
&=\mu\big((id\times T)^{-1}\{(x,T(x)):\ x\in B\}\big)\\\
&=\mu(B).
\end{aligned}
$$

Define
$$
B_1=\{x\in A:\ T(x)=T_1(x)\}, \qquad B_2=\{x\in A:\ T(x)=T_2(x)\}.
$$
By the definition of $A$, we have $B_1\cap B_2=\varnothing$.

First, we claim $\mu\big(A\setminus (B_1\cup B_2)\big)=0.$

If not, $\mu(C)>0$, where $C=A\setminus (B_1\cup B_2)$. Consider
$$
E:=\operatorname{Graph}(T|_C)=\{(x,T(x)):\ x\in C\}.
$$
We have
$$
\bar{\gamma}(E)=\mu(C)>0.
$$

However,
$$
\gamma_1(E)
=
\mu\big(\{x\in C:\ (x,T_1(x))\in E\}\big)
=
\mu\big(\{x\in C:\ T_1(x)=T(x)\}\big)=0.
$$
Similarly $\gamma_2(E)=0$. Therefore,
$$
\bar{\gamma}(E)=\frac12\gamma_1(E)+\frac12\gamma_2(E)=0,
$$
which gives a contradiction. Therefore
$$
\mu\big(A\setminus (B_1\cup B_2)\big)=0.
$$

Next, since
$$
A=B_1\cup B_2\cup A\setminus (B_1\cup B_2)
$$
and $B_1\cap B_2=\varnothing$,
$$
\mu(A)=\mu(B_1)+\mu(B_2)>0.
$$
Therefore, either $\mu(B_1)>0$ or $\mu(B_2)>0$. Without loss of generality, we assume $\mu(B_1)>0$.

Finally, consider
$$
F:=\operatorname{Graph}(T_2|_{B_1})=\{(x,T_2(x)):\ x\in B_1\}.
$$
We have
$$
\gamma_2(F)=\mu\big(\{x\in B_1:\ (x,T_2(x))\in F\}\big)=\mu(B_1),
$$
therefore
$$
\bar{\gamma}(F)\ge \frac12\gamma_2(F)=\frac12\mu(B_1)>0.
$$

However,
$$
\bar{\gamma}(F)
=
\mu\big(\{x\in X:\ (x,T(x))\in F\}\big)
=
\mu\big(\{x\in B_1:\ T(x)=T_2(x)\}\big)
=
\mu(\varnothing)=0,
$$
which gives a contradiction. $\square$

The quadratic case in $\mathbb{R}^d$: $c(x,y)=\frac12|x-y|^2$

Proposition 1.2 Given a function $\chi:\mathbb{R}^d\to \mathbb{R}\cup\{+\infty\}$, let us define
$$
u_\chi:\mathbb{R}^d\to \mathbb{R}\cup\{+\infty\}
$$
$$
u_\chi(x)=\frac12|x|^2-\chi^c(x).
$$

Then we have
$$
u_{\chi^c}=(u_\chi)^\star
$$
where $f^\star$ denotes the Legendre-Fenchel transform of $f$.

In particular, a function $\varphi$ is $c$-concave if and only if
$$
x\longmapsto \frac12|x|^2-\varphi(x)
$$
is convex and lower semi-continuous.

Proof :
$$
\begin{aligned}
u_{\chi^c}(x)& =\frac12|x|^2-\chi^c(x)\\\
&=\sup_y\left[\frac12|x|^2-\frac12|x-y|^2+\chi(y)\right]\\\
&=
\sup_y\left[x\cdot y-\left(\frac12|y|^2-\chi(y)\right)\right]\\\
&=(u_\chi)^\star.
\end{aligned}
$$

If $\varphi$ is $c$-concave, which means there exists a function $\chi$ s.t.
$$
\varphi=\chi^c,
$$
then
$$
u_\varphi(x)=\frac12|x|^2-\varphi(x)=(u_\chi)^\star
$$
which is convex and lower semi-continuous.

Conversely, if $u_\varphi=\frac12|x|^2-\varphi(x)$ is convex and lower-semi-continuous, then there exists a function $\chi$ such that
$$
u_\varphi=\chi^\star.
$$

Then
$$
\begin{aligned}
\varphi(x)&=\frac12|x|^2-\chi^\star=
\frac12|x|^2-\sup_y[x\cdot y-\chi(y)]\\\
&=
\inf_y\left[\frac12|x-y|^2-\left(\frac12|y|^2-\chi(y)\right)\right]=
\left(\frac12|y|^2-\chi(y)\right)^c,
\end{aligned}
$$

which shows that $\varphi$ is $c$-concave. $\square$

By Theorem 1.5, there exists a optimal transport map
$$
T(x)=x-\nabla \varphi(x)=\nabla\left(\frac12|x|^2-\varphi(x)\right)=\nabla u_\varphi(x).
$$

By Proposition 1.2, $u_\varphi$ is convex.

Reference

Santambrogio, Filippo. Optimal transport for applied mathematicians. Birkäuser, NY 55.58-63 (2015): 94.

The cover image of this article was taken upon arriving in Kagoshima, Japan, aboard a Royal Caribbean cruise.

Dual Problem of Kantorovich

2026-03-29T16:00:00.000Z

In this article, we will talk about the dual problem Kantorovich.

Let us express the constraint $\gamma \in \Pi(\mu,\nu)$ in the following way: if $\gamma \in M_+(X\times Y)$ then we have
$$
\sup_{\varphi,\psi}
\left[
\int_X \varphi \ d\mu
+\int_Y \psi \ d\nu
-\int_{X\times Y} (\varphi(x)+\psi(y)) \ d\gamma
\right]
=
\begin{cases}
0, & \text{if } \gamma \in \Pi(\mu,\nu),\\\
+\infty, & \text{otherwise.}
\end{cases}
$$
where $\varphi,\psi$ are bounded continuous.

Proof : If $\gamma \in \Pi(\mu,\nu)$, it is trivial. If $\gamma \notin \Pi(\mu,\nu)$, then there exists a pair $(\varphi_0,\psi_0)$ such that
$$
I(\varphi_0,\psi_0)
:=
\int_X \varphi_0 \ d\mu
+\int_Y \psi_0 \ d\nu
-\int_{X\times Y} (\varphi_0(x)+\psi_0(y)) \ d\gamma
\neq 0.
$$
W.L.O.G. we can assume $I>0$, then if we choose $(a\varphi_0,a\psi_0)$ for $a>0$,
$$
I(a\varphi_0,a\psi_0)=a I(\varphi_0,\psi_0).
$$
Hence
$$
\mathrm{LHS}=\sup_{\varphi,\psi} I(\varphi,\psi)=+\infty.
$$

Hence the Kantorovich problem is equivalent to the following problem
$$
\min_{\gamma \in M_+(X\times Y)}
\left\{
\int_{X\times Y} c \ d\gamma
+\sup_{\varphi,\psi}
\left[
\int_X \varphi \ d\mu
+\int_Y \psi \ d\nu
-\int_{X\times Y} (\varphi(x)+\psi(y)) \ d\gamma
\right]
\right\}.
$$

Consider interchanging $\sup$ and $\inf$
$$
\sup_{\varphi,\psi}
\left\{
\int_X \varphi \ d\mu
+\int_Y \psi \ d\nu
+\inf_{\gamma \in M_+(X\times Y)}
\left[
\int_{X\times Y} \bigl(c(x,y)-(\varphi(x)+\psi(y))\bigr)\ d\gamma
\right]
\right\}.
$$

Now we denote $(\varphi \oplus \psi)(x,y):=\varphi(x)+\psi(y)$, then the infimum in $\gamma$ can be written as
$$
\inf_{\gamma \in M_+(X\times Y)} \int_{X\times Y} (c-\varphi\oplus\psi)\ d\gamma
=
\begin{cases}
0, & \text{if } \varphi\oplus\psi \le c \text{ on } X\times Y,\\\
-\infty, & \text{otherwise.}
\end{cases}
$$

Proof : Let
$$
\widetilde{J}(\gamma):=\int_{X\times Y} (c-\varphi\oplus\psi)\ d\gamma.
$$
If $\varphi\oplus\psi \le c$ on $X\times Y$, then for every $\gamma \in M_+(X\times Y)$,
$$
\widetilde{J}(\gamma)\ge 0.
$$
Since for all $a>0$,
$$
\widetilde{J}(a\gamma)=a\widetilde{J}(\gamma),
$$
we have
$$
\inf_{\gamma \in M_+(X\times Y)} \widetilde{J}(\gamma)=0.
$$

If $\varphi+\psi \le c$ is not true, then for every $\gamma \in M_+(X\times Y)$, $\widetilde{J}(\gamma)<0$. Since for all $a>0$ we have $\widetilde{J}(a\gamma)=a\widetilde{J}(\gamma)$, we have
$$
\inf_{\gamma \in M_+(X\times Y)} \widetilde{J}(\gamma)=-\infty.
$$

Therefore, we get the following dual optimization problem:

Problem 1.3. Given $\mu \in \mathcal{P}(X)$, $\nu \in \mathcal{P}(Y)$ and the cost function $c:X\times Y \to [0,+\infty]$, we consider the problem
$$
(\mathrm{DP})\qquad
\sup
\left\{
\int_X \varphi \ d\mu + \int_Y \psi \ d\nu
\ \middle|\
\varphi \in C_b(X),\ \psi \in C_b(Y),\ \varphi\oplus\psi \le c
\right\}.
$$

Lemma 1.3. $\sup(\mathrm{DP}) \le \min(\mathrm{KP})$.

Proof : For $\gamma \in \Pi(\mu,\nu)$, then for every $\varphi \in C_b(X)$, $\psi \in C_b(Y)$ with $\varphi\oplus\psi \le c$, we have
$$
\int_X \varphi \ d\mu + \int_Y \psi \ d\nu
=
\int_{X\times Y} \varphi\oplus\psi \ d\gamma
\le
\int_{X\times Y} c \ d\gamma.
$$

Definition 1.1. Given a function $\chi:X\to \overline{\mathbb R}$, we define its $c$-transform (also called $c$-conjugate function)
$$
\chi^c:Y\to \overline{\mathbb R}
$$
by
$$
\chi^c(y):=\inf_{x\in X}\ [c(x,y)-\chi(x)].
$$

We also define the $\overline{c}$-transform of $\xi:Y\to \overline{\mathbb R}$ by
$$
\xi^{\overline{c}}:X\to \overline{\mathbb R}
$$
$$
\xi^{\overline{c}}(x):=\inf_{y\in Y}\ [c(x,y)-\xi(y)].
$$

Moreover, we say that a function $\psi$ defined on $Y$ is $\overline{c}$-concave if there exists $\chi:X\to \overline{\mathbb R}$ such that $\psi=\chi^c$, and a function $\varphi$ on $X$ is said to be $c$-concave if there exists $\xi:Y\to \overline{\mathbb R}$ such that $\varphi=\xi^{\overline{c}}$.

We denote by $c\text{-conc}(X)$ and $\overline{c}\text{-conc}(Y)$ the sets of $c$-concave and $\overline{c}$-concave functions, respectively.

(When $X=Y$, and $c$ is symmetric, the distinction between $c$ and $\overline{c}$ will play no more any rule and will be dropped as soon as possible.)

Definition 1.2. A function $f:X\to \mathbb R$ is said to have modulus of continuity $\omega:\mathbb R_+\to \mathbb R_+$ if for all $x,y\in X$

$$
|f(x)-f(y)|\le \omega(d(x,y)).
$$

Lemma 1.4. If $c:X\times Y\to \overline{\mathbb R}$ is continuous and finite on a compact set, then there exists an increasing continuous function $\omega:\mathbb R_+\to \mathbb R_+$ with $\omega(0)=0$ such that
$$
|c(x,y)-c(x’,y’)|\le \omega\bigl(d(x,x’)+d(y,y’)\bigr).
$$

Proof : Let $\omega:\mathbb R_+\to \mathbb R_+$ be defined by
$$
\omega(t):=
\sup\left\{
|c(x,y)-c(x’,y’)|
\ :\
d(x,x’)+d(y,y’)\le t
\right\}.
$$
It is easy to see that $\omega(0)=0$ and $\omega(t)$ is increasing.

Let $r=d(x,x’)+d(y,y’)$, then
$$
|c(x,y)-c(x’,y’)|\le \omega(r)=\omega\bigl(d(x,x’)+d(y,y’)\bigr).
$$
Since $c$ is uniformly continuous on its domain, $\omega$ is continuous.

Lemma 1.5. Let $(f_\alpha)$ be a family of functions, all satisfying the same modulus of continuity $\omega:\mathbb R_+\to \mathbb R_+$,

$$
|f_\alpha(x)-f_\alpha(x’)|\le \omega(d(x,x’)),\qquad \forall \alpha.
$$

Then
$$
f:=\inf_\alpha f_\alpha
$$
also satisfies the same modulus of continuity.

Proof : For each $\alpha$,
$$
f_\alpha(x)\le f_\alpha(x’)+\omega(d(x,x’)),
$$
which implies
$$
f(x)\le f_\alpha(x’)+\omega(d(x,x’))
$$
since $f\le f_\alpha$. Taking the infimum at the RHS, we get
$$
f(x)\le f(x’)+\omega(d(x,x’)).
$$
Interchanging $x$ and $x’$, we obtain
$$
|f(x)-f(x’)|\le \omega(d(x,x’)).
$$

Lemma 1.6. If $c:X\times Y\to \overline{\mathbb R}$ has the modulus of continuity $\omega:\mathbb R_+\to \mathbb R_+$, then $\chi^c$ shares the same modulus of continuity.

Proof : Since
$$
\chi^c(y)=\inf_{x\in X}\bigl(c(x,y)-\chi(x)\bigr)=: \inf_{x\in X} g_x(y),
$$
and for $x\in X$, $y,y’\in Y$,
$$
|g_x(y)-g_x(y’)|=|c(x,y)-c(x,y’)|\le \omega(d(y,y’)),
$$
by Lemma 1.5 we get the desired result.

Lemma 1.7. Let
$$
DP(\varphi,\psi):=\int_X \varphi \ d\mu+\int_Y \psi \ d\nu.
$$
For $\varphi\in C_b(X)$, $\psi\in C_b(Y)$ with $\varphi\oplus\psi\le c$. Then
$$
DP(\varphi,\varphi^c)\ge DP(\varphi,\psi).
$$

Proof : Since $\varphi\oplus\psi\le c$, we have
$$
\psi(y)\le c(x,y)-\varphi(x).
$$
Furthermore
$$
\psi(y)\le \inf_x [c(x,y)-\varphi(x)]=\varphi^c(y).
$$
Hence
$$
DP(\varphi,\psi)\le \int_X \varphi \ d\mu+\int_Y \varphi^c \ d\nu = DP(\varphi,\varphi^c).
$$

Remark. Similarly, we will have
$$
DP(\varphi,\psi)\le DP(\varphi,\varphi^c)\le DP(\varphi^{c\overline{c}},\varphi^c)\le DP(\psi^{\overline{c}},\psi^{\overline{c}c})\le DP(\varphi^{c\overline{c}c},\varphi^{c\overline{c}})\le \cdots
$$

Theorem 1.4. Suppose that $X$ and $Y$ are compact and $c$ is continuous. Then there exists a solution $(\varphi,\psi)$ to problem $(\mathrm{DP})$ and it has the form $\varphi\in c\text{-conc}(X)$, $\psi\in \overline{c}\text{-conc}(X)$ and $\psi=\varphi^c$. In particular,
$$
\max(\mathrm{DP})
=
\max_{\varphi\in c\text{-conc}(X)}
\left[
\int_X \varphi \ d\mu+\int_Y \varphi^c \ d\nu
\right].
\tag{$\star$}
$$

Proof : From the considerations above, we can take a maximizing sequence $(\varphi_n,\psi_n)$ and improve it by means of $c$- and $\overline{c}$-transforms, and we can assume they have the same modulus of continuity as $c$. Instead of renaming the sequence, we will still call $(\varphi_n,\psi_n)$ the new sequence obtained after these transforms.

By Ascoli-Arzelà’s theorem, we only need to check the equiboundedness. For every constant $a\in \mathbb R$,
$$
DP(\varphi,\psi)=DP(\varphi+a,\psi-a)
$$
and
$$
(\varphi+a)\oplus(\psi-a)\le c.
$$
Since $\varphi_n$ is continuous on a compact set and then bounded, W.L.O.G. we can assume
$$
\min \varphi_n=0
$$
and then we get
$$
\max \varphi_n\le \omega(\operatorname{diam} X).
$$
If we have chosen $\psi_n=\varphi_n^c$, we also have
$$
\psi_n(y)=\inf_{x\in X}[c(x,y)-\varphi_n(x)]
\in
\bigl[\min c-\omega(\operatorname{diam} X),\ \max c\bigr].
$$

Hence equibounded. By Ascoli-Arzelà’s theorem, there exists a subsequence
$$
\varphi_{n_k}\to \varphi,\qquad \psi_{n_k}\to \psi\quad \text{(uniform convergence).}
$$

Therefore
$$
\int_X \varphi_{n_k}\ d\mu+\int_Y \psi_{n_k}\ d\nu
\to
\int_X \varphi\ d\mu+\int_X \psi\ d\nu.
$$
and
$$
\varphi_{n_k}(x)+\psi_{n_k}(y)\le c(x,y)
\quad\Rightarrow\quad
\varphi(x)+\psi(y)\le c(x,y).
$$
This shows that $(\varphi,\psi)$ is an admissible pair for $(\mathrm{DP})$ and then optimal.

And if $\psi\neq \varphi^c$, we can improve the objective functional in $(\mathrm{DP})$, which contradicts to $(\varphi,\psi)$ is optimal. Similarly $\varphi=\psi^{\overline{c}}$.

Hence $\psi=\varphi^c\in \overline{c}\text{-conc}(X)$ and similarly $\varphi=\psi^{\overline{c}}\in c\text{-conc}(X)$.

Remark. If $\min(\mathrm{KP})=\max(\mathrm{DP})$, we also have
$$
\min(\mathrm{KP})
=
\max_{\varphi\in c\text{-conc}(X)}
\left[
\int_X \varphi \ d\mu+\int_Y \varphi^c \ d\nu
\right]
$$
which also shows that the minimum value of $(\mathrm{KP})$ is a convex function of $(\mu,\nu)$ as it is a supremum of linear functionals.

Definition 1.3. The functions $\varphi$ realizing the maximum in $(\star)$ are called Kantorovich potentials for the transport from $\mu$ and $\nu$.

Reference

Santambrogio, Filippo. Optimal transport for applied mathematicians. Birkäuser, NY 55.58-63 (2015): 94.

The cover image of this article was taken in Auckland, New Zealand.

Problems of Monge and Kantorovich

2026-03-28T16:00:00.000Z

In this article, we will talk about the problems of Monge and Kantorovich and the corresponding existence of solutions.

Monge Problem

Problem 1.1 (Monge Problem). Given $\mu \in \mathcal{P}(X)$, $\nu \in \mathcal{P}(Y)$ and a Borel measurable cost function
$$
c:X \times Y \to [0,+\infty],
$$
solve
$$
(\mathrm{MP}) \qquad \inf \left\{ M(T):=\int_X c(x,T(x))\ d\mu(x) \ \middle|\ T_\sharp\mu=\nu \right\},
$$
where
$$
T_\sharp\mu(A)=\mu\bigl(T^{-1}(A)\bigr)
$$
for all $A \in \mathcal{B}(Y)$. Find also the optimal transport map $T$.

Remark. Difficulty of Monge Problem is that the constraint
$$
\{T:X \to Y \mid T_\sharp\mu=\nu\}
$$
is often not closed under weak convergence.

Example. $X=[0,2\pi]$, $\mu=\dfrac{1}{2\pi}\ dx$, $T_n(x)=\sin(nx)$, $Y=\mathbb R$.

Claim. For all $n \in \mathbb{N}$,

$$
(T_n)_\sharp\mu=\nu,
$$

where

$$
\nu(dy)=\frac{1}{\pi\sqrt{1-y^2}}\ 1_{(-1,1)}(y)\ dy.
$$

Proof : For all $\varphi \in C_b(\mathbb R)$,

$$
\begin{aligned}
\int_Y \varphi(y)\ ((T_n)_\sharp\mu)(dy)
&=
\int_X \varphi(\sin nx)\ \mu(dx)
=
\frac{1}{2\pi}\int_0^{2\pi}\varphi(\sin nx)\ dx \\\
&=\frac{1}{2\pi n} \int_0^{2\pi n}\varphi(\sin u)\ du
=\frac{1}{2\pi}\int_0^{2\pi}\varphi(\sin u)\ du
=\int_Y \varphi(y)\ \nu(dy).
\end{aligned}
$$

Hence $(T_n)_\sharp\mu=\nu$ for all $n \in \mathbb{N}$.

However, by Riemann-Lebesgue Lemma, in $L^p[0,2\pi]$, $1 \leq p < +\infty$, we have for all $g \in L^q[0,2\pi]$
$$
\int_0^{2\pi} g(x)\sin(nx)\ dx \to 0.
$$
Hence
$$
T_n \rightharpoonup 0.
$$
But for $T=0$, $T_\sharp\mu=\delta_0 \neq \nu$, which shows that
$$
\left\{ T \in L^p([0,2\pi]) \ ; \ T_\sharp\mu=\nu \right\}
$$
is not closed under weak topology of $L^p[0,2\pi]$.

Kantorovich Problem

Problem 1.2 (Kantorovich Problem). Given $\mu \in \mathcal{P}(X)$, $\nu \in \mathcal{P}(Y)$, and a Borel measurable cost function
$$
c:X \times Y \to [0,\infty],
$$
find
$$
(\mathrm{KP}) \qquad \inf \left\{ K(\gamma):=\int_{X \times Y} c\ d\gamma \ \middle|\ \gamma \in \Pi(\mu,\nu) \right\},
$$
where $\Pi(\mu,\nu)$ is the set of so-called transport plans, i.e.
$$
\Pi(\mu,\nu)
=
\left\{
\gamma \in \mathcal{P}(X \times Y)
\ \middle|\
(\pi_x)_\sharp\gamma=\mu,\ (\pi_y)_\sharp\gamma=\nu
\right\},
$$
where $\pi_x$ and $\pi_y$ are two projections of $X \times Y$ onto $X$ and $Y$, respectively.

Remark. If $T_\sharp\mu=\nu$ then the plan
$$
\pi=(\mathrm{id} \times T)_\sharp\mu \in \Pi(\mu,\nu).
$$
Conversely, if $\pi=(\mathrm{id} \times T)_\sharp\mu \in \Pi(\mu,\nu)$, then $T_\sharp\mu=\nu$.

Proof : First, if $T_\sharp\mu=\nu$, then for all $A \in \mathcal{B}(X)$, $B \in \mathcal{B}(Y)$,
$$
\pi(A \times Y)
=
((\mathrm{id} \times T)_\sharp\mu)(A \times Y)
=
\mu\bigl((\mathrm{id} \times T)^{-1}(A \times Y)\bigr)
=
\mu(A),
$$
and
$$
\pi(X \times B)
=
((\mathrm{id} \times T)_\sharp\mu)(X \times B)
=
\mu\bigl((\mathrm{id} \times T)^{-1}(X \times B)\bigr)
=
\mu(T^{-1}(B))
=
\nu(B).
$$
Hence $\pi \in \Pi(\mu,\nu)$.

Second, if $\pi \in \Pi(\mu,\nu)$, then for all $B \in \mathcal{B}(Y)$,
$$
\nu(B)
=
\pi(X \times B)
=
((\mathrm{id} \times T)_\sharp\mu)(X \times B)
=
\mu\bigl((\mathrm{id} \times T)^{-1}(X \times B)\bigr)
=
\mu(T^{-1}(B)).
$$
Hence $T_\sharp\mu=\nu$. $\square$

Remark. In the problem of Monge, it may happen that there is no map $T$ such that $T_\sharp\mu=\nu$. Indeed, if $\mu=\delta_{x_0}$, $x_0 \in X$ and $\nu$ is any measure that is different from an atomic measure on one atom, then there is no such map, as
$$
T_\sharp\mu=\delta_{T(x_0)}.
$$
But the set $\Pi(\mu,\nu)$ is always non-empty, as
$$
\mu \otimes \nu \in \Pi(\mu,\nu).
$$

Weak-$\star$ Convergence and Weak Convergence

Let $\mathcal{M}(X)$ be the set of finite signed measures on $X$. By Riesz Representation theorem, if $X$ is separable and locally compact metric space, let $\mathcal{X}=C_0(X)$ be the set of continuous functions on $X$ vanishing at infinity, i.e. $f \in C_0(X)$ if and only if $f \in C_b(X)$ and for every $\varepsilon>0$, there exists a compact subset $K \subset X$ such that
$$
|f(x)|<\varepsilon
$$
on $X \setminus K$.

And endow this space with sup norm. Since $C_0(X) \subset C_b(X)$, then $C_0(X)$ is a closed subset of $C_b(X)$ also a Banach space. By Riesz Representation Theorem for all $\xi \in \mathcal{X}’$, there exists a unique $\lambda \in \mathcal{M}(X)$ such that
$$
\langle \xi,\varphi\rangle=\int_X \varphi\ d\lambda
$$
for every $\varphi \in \mathcal{X}=C_0(X)$.

Moreover $\mathcal{X}^\prime \simeq \mathcal{M}(X)$ endowed with the norm $\Vert\lambda\Vert:=|\lambda|(X).$

For signed measures $\mathcal{M}(X)$:

weak-$\star$ convergence: the convergence in the duality with $C_0(X)$, that is $\mu_n \overset{\mathrm{weak}-\star}{\longrightarrow} \mu$ if and only if for all $\varphi \in C_0(X)$

$$
\int_X \varphi\ d\mu_n \to \int_X \varphi\ d\mu
\qquad \text{as } n \to \infty.
$$

weak convergence (also called narrow convergence): $\mu_n \rightharpoonup \mu$ if and only if for all $\varphi \in C_b(X)$

$$
\int_X \varphi\ d\mu_n \to \int_X \varphi\ d\mu
\qquad \text{as } n \to \infty.
$$

If $X$ is compact, then $C_0(X)=C_b(X)=C(X)$, weak-$\star$ convergence and weak convergence are the same.

Existence of Solutions

Theorem 1.1. Suppose that $X,Y$ are compact and $c:X \times Y \to \mathbb R$ is continuous. Then the Kantorovich Problem admits a minimizer.

Proof : We just need to show that the set $\Pi(\mu,\nu)$ is compact and that
$$
\gamma \longmapsto K(\gamma)=\int_{X \times Y} c\ d\gamma
$$
is continuous. Since $X \times Y$ compact, $C_0(X \times Y)=C_b(X \times Y)=C(X \times Y)$, $\gamma \mapsto K(\gamma)$ is continuous.

For the compactness, take $(\gamma_n) \subset \Pi(\mu,\nu)$, since they are probability measures, they are bounded in $\mathcal{M}(X \times Y) \simeq (C(X \times Y))’$. By Banach-Alaoglu theorem, there exists a subsequence $\gamma_{n_k}$,
$$
\gamma_{n_k} \overset{\mathrm{weak}-\star}{\longrightarrow} \gamma \in \mathcal{M}(X \times Y).
$$
Since $X \times Y$ is compact,
$$
\gamma(X \times Y)
=
\int_{X \times Y} 1\ d\gamma
=
\lim_{k \to \infty}\int_{X \times Y} 1\ d\gamma_{n_k}
=
1.
$$
Hence $\gamma \in \mathcal{P}(X \times Y)$. Moreover, for all $\varphi \in C(X)$, and since $\gamma_{n_k} \in \Pi(\mu,\nu)$,
$$
\int_{X \times Y} \varphi\ d\gamma_{n_k}
=
\int_X \varphi\ d\mu.
$$
Let $k \to \infty$, we have
$$
\int_{X \times Y} \varphi\ d\gamma
=
\int_X \varphi\ d\mu.
$$
Hence $(\pi_x)_\sharp\gamma=\mu$ and similarly, $(\pi_y)_\sharp\gamma=\nu$. Hence $\gamma \in \Pi(\mu,\nu)$. Therefore, $\Pi(\mu,\nu)$ is compact. $\square$

Lemma 1.1. Let $f:X \to \mathbb R \cup \{+\infty\}$ be a function bounded from below. Then $f$ is lower semi-continuous if and only if there exists a sequence $f_k$ of $k$-Lipschitz functions such that for every $x \in X$, $f_k(x)$ converges increasingly to $f(x)$. Furthermore, $f_k$ can also be made bounded.

Proof : On the one hand, if
$$
f(x)=\lim_{k \to \infty} f_k(x)
$$
for all $x \in X$ with $f_k$ $k$-Lipschitz and increasing, then
$$
f=\sup_k f_k
$$
is also lower semi-continuous.

On the other hand, if $f$ is lower semi-continuous and bounded from below, define
$$
f_k(x)=\inf_y \{ f(y)+k\ d(x,y) \}.
$$
It is easy to show that $f_k$ is $k$-Lipschitz. For fixed $x \in X$, $f_k(x)$ is increasing and we have
$$
\inf f \leq f_k(x) \leq f(x).
$$
We just need to show that
$$
\ell:=\lim_{k \to \infty} f_k(x)=\sup_k f_k(x)=f(x).
$$
Suppose by contradiction $\ell$$
f(y_k)\leq f(y_k)+k\ d(x,y_k)$$
We get
$$
d(y_k,x)\leq \frac{\ell+\frac{1}{k}-f(y_k)}{k}\leq \frac{C}{k}
$$
(since $\ell<\infty$, $f$ bounded from below). Hence $y_k \to x$. Let $k \to \infty$ in $(1)$. Since $f$ is lower semi-continuous,
$$
f(x)\leq \liminf_{k \to \infty} f(y_k)\leq \lim_{k \to \infty} f_k(x)=\ell,
$$
which gives a contradiction.

Finally, $f_k$ can be made bounded by taking $f_k \wedge k$. $\square$

Lemma 1.2. If $f:X \to \mathbb R \cup \{+\infty\}$ is a lower semi-continuous function, bounded from below on a metric space $X$, then the functional
$$
J:\mathcal{M}_+(X)\to \mathbb R\cup\{+\infty\}
$$
defined on positive measures on $X$ through
$$
J(\mu):=\int_X f\ d\mu
$$
is lower semi-continuous for the weak convergence of measures.

Proof : By Lemma 1.1, there exists a sequence $f_k$ of continuous and bounded functions converging increasingly to $f$. Then write
$$
J(\mu)=\sup_k J_k(\mu):=\int_X f_k\ d\mu
$$
(Actually $J_k \leq J$ and $J_k(\mu)\to J(\mu)$ for every $\mu$ by monotone convergence.)

Since every $J_k$ is continuous for the weak convergence, hence $J$ is a lower semi-continuous functional. $\square$

Theorem 1.2. Let $X$ and $Y$ be compact metric spaces, $\mu \in \mathcal{P}(X)$, $\nu \in \mathcal{P}(Y)$ and
$$
c:X \times Y \to \mathbb R\cup\{+\infty\}
$$
be lower semi-continuous and bounded from below. Then the Kantorovich Problem $(\mathrm{KP})$ admits a solution.

Proof : By Lemma 1.2
$$
\gamma \mapsto K(\gamma)=\int_{X \times Y} c\ d\gamma
$$
is lower semi-continuous. By the proof of Theorem 1.1, we know $\Pi(\mu,\nu)$ is compact. $\square$

Theorem 1.3. Let $X$ and $Y$ be Polish spaces, $\mu \in \mathcal{P}(X)$, $\nu \in \mathcal{P}(Y)$ and
$$
c:X \times Y \to \mathbb R\cup\{+\infty\}
$$
is lower semi-continuous. Then $(\mathrm{KP})$ admits a solution.

Proof : We only need to prove the compactness of $\Pi(\mu,\nu)$.

Since $\mu$ and $\nu$ are finite Borel measures on Polish spaces, by Ulam theorem, $\mu$ and $\nu$ are tight, there exist $K_X \subset X$ and $K_Y \subset Y$ such that
$$
\mu(X \setminus K_X)<\frac{\varepsilon}{2},
\qquad
\nu(Y \setminus K_Y)<\frac{\varepsilon}{2}.
$$
Then the set $K_X \times K_Y$ is compact in $X \times Y$ and for any $(\gamma_n)\subset \Pi(\mu,\nu)$
$$
\gamma_n\bigl((X \times Y)\setminus (K_X \times K_Y)\bigr)
\leq
\gamma_n\bigl((X \setminus K_X)\times Y\bigr)+\gamma_n\bigl(X \times (Y \setminus K_Y)\bigr)=
\mu(X \setminus K_X)+\nu(Y \setminus K_Y)
<
\frac{\varepsilon}{2}+\frac{\varepsilon}{2}
=
\varepsilon.
$$
Hence $(\gamma_n)$ is tight. Hence by Prokhorov theorem, there exists $\gamma \in \mathcal{P}(X \times Y)$ and a subsequence $(\gamma_{n_k})$ such that
$$
\gamma_{n_k}\rightharpoonup \gamma.
$$

Now, we need to show that $\gamma \in \Pi(\mu,\nu)$. For all $\varphi \in C_b(X)$, since $\gamma_{n_k}\in \Pi(\mu,\nu)$,
$$
\int_{X \times Y} \varphi(x)\ d\gamma_{n_k}(x,y)
=
\int_X \varphi(x)\ d\mu(x).
$$
Let $k \to \infty$, we have
$$
\int_{X \times Y} \varphi(x)\ d\gamma(x,y)
=
\int_X \varphi(x)\ d\mu(x).
$$
Hence $(\pi_x)_\sharp\gamma=\mu$ and similarly $(\pi_y)_\sharp\gamma=\nu$. Therefore, $\gamma \in \Pi(\mu,\nu)$. $\square$

Reference

Santambrogio, Filippo. Optimal transport for applied mathematicians. Birkäuser, NY 55.58-63 (2015): 94.

The cover image of this article was taken in Saipan, Commonwealth of the Northern Mariana Islands (U.S.).

An Introduction to Mean-Field Langevin Dynamics

2026-03-24T16:00:00.000Z

Optimization over the space of probability measures is not only widely applicable, but also offers a useful perspective for analyzing certain complicated finite-dimensional nonconvex optimization problems. In particular, lifting such problems to optimization problems over probability measures can lead to better structural properties, such as convexity. Mean-field Langevin dynamics provides a representative example of this idea. Its central motivation is that some highly nonconvex optimization problems arising in neural network training become better behaved when reformulated as the optimization of a functional on the space of probability measures. This viewpoint also makes it possible to build a theoretical foundation for understanding the convergence of SGD. In what follows, we briefly introduce this perspective, mainly based on the paper by [Hu, Kaitong, et al]. The main analytical framework of this theory can be illustrated by Figure 2.

Motivation: Neural Networks

Suppose we have datasets $(z_m,y_m)_{m=1}^N$, where $z_m\in \mathbb{R}^{l_0}$ is the data and $y_m\in \mathbb{R}^{l_L}$ is the corresponding label. The aim of the neural network is to reconstruct the map from all $z_m$ to $y_m$.

Let $\varphi:\mathbb{R}\to\mathbb{R}$ be a nonlinear activation function and $\varphi^l:\mathbb{R}^l\to\mathbb{R}^l$ be pointwise, that is
$$
\varphi^l(z)=(\varphi(z_1),\varphi(z_2),\cdots,\varphi(z_l))\quad \text{for } z=(z_1,z_2,\cdots,z_l).
$$
Then the framework for neural neural network is shown as follows.

Figure 1: The framework for neural neural network.

The parameters we need to optimize can be listed as
$$
\Psi=\left((\alpha^1,\beta^1),(\alpha^2,\beta^2),\cdots,(\alpha^L,\beta^L)\right)\in \Pi=\left((\mathbb{R}^{l_1\times l_0}\times \mathbb{R}^{l_1})\times (\mathbb{R}^{l_2\times l_1}\times \mathbb{R}^{l_2})\times \cdots \times (\mathbb{R}^{l_L\times l_{L-1}}\times \mathbb{R}^{l_L})\right).
$$
And the reconstruction map can be defined as
$$
\begin{aligned}
\operatorname{R\Psi}: \mathbb{R}^{l_0}&\to \mathbb{R}^{l_L}\\\
z^0&\mapsto z^L.
\end{aligned}
$$
And the learning task can be formulated as the following optimization problem (often non-convex)
$$
\inf_{\Psi\in\Pi} \int_{\mathbb{R}^{l_0}\times \mathbb{R}^{l_L}} \Phi\Big(y-(\operatorname{R\Psi})(z)\Big)\ \nu(dz,dy),
$$
where $\Phi: \mathbb{R}^{l_L}\to \mathbb{R}$ is a convex function.

Two Layer Neural Networks

Now we focus on the two layer neural networks. Let $l_0=d-1,L=2, l_1=n, l_2=1$ and $\beta^1=\beta^2=0$. We can partition the matrix $\alpha^1\in \mathbb{R}^{n\times (d-1)}$ into blocks
$$
\alpha^1=\begin{pmatrix}
(\alpha_1^1)^T\\\
(\alpha_2^1)^T\\\
\vdots\\\
(\alpha_n^1)^T
\end{pmatrix},\quad \alpha_i^1\in \mathbb{R}^{d-1},\ i=1,2,\cdots,n
$$
and let
$$
\alpha^2=\Big(\frac{c_1}{n},\frac{c_2}{n},\cdots,\frac{c_n}{n}\Big),\quad c_i\in \mathbb{R},\ i=1,2,\cdots,n.
$$
Then for input $z^0\in \mathbb{R}^{d-1}$, we have
$$
z^1=\varphi^n(\alpha^1 z^0)=\begin{pmatrix}
\varphi(\alpha_1^1\cdot z^0)\\\
\varphi(\alpha_2^1\cdot z^0)\\\
\vdots\\\
\varphi(\alpha_n^1\cdot z^0)
\end{pmatrix}
$$
and
$$
z^2=\alpha^2z^1=\frac{1}{n}\sum_{i=1}^n c_i\varphi(\alpha_i^1\cdot z^0),
$$
where “$\cdot$” denotes the inner product. Hence the reconstruction map for two layer neural networks problem is
$$
\begin{aligned}
\operatorname{R\Psi}: \mathbb{R}^{d-1}&\to \mathbb{R}\\\
z&\mapsto \frac{1}{n}\sum_{i=1}^n c_i\varphi(\alpha_i^1\cdot z).
\end{aligned}
$$
Then the learning task can be formulated as the following optimization problem (often non-convex)
$$
\inf_{\alpha_i,\ c_i,\ i=1,2,\cdots,n} \int_{\mathbb{R}^{d}} \Phi\Big(y-\frac{1}{n}\sum_{i=1}^n c_i\varphi(\alpha_i\cdot z)\Big)\ \nu(dz,dy),
$$
where $\Phi: \mathbb{R}\to \mathbb{R}$ is a convex function. To simplify the notations, we denote $x=(\alpha,c)\in \mathbb{R}^{d}$ and
$$
\widehat\varphi (x,z):= c\varphi(\alpha\cdot z).
$$
Then the optimization problem can be written as
$$
\inf_{x^i=(\alpha_i,c_i),\ i=1,2,\cdots,n} F^n(x^1,\cdots,x^n):= \int_{\mathbb{R}^{d}} \Phi\Big(y-\frac{1}{n}\sum_{i=1}^n \widehat\varphi(x^i,z)\Big)\ \nu(dz,dy).\tag{P1}
$$
One of the most important observation is that if we lift the finite-dimensional optimization problem (P1) to the infinite-dimensional optimization problem over the space of probability measures (P2), it will become convex.
$$
\inf_{m\in \mathcal{P}(\mathbb{R}^d)} F(m):=\int_{\mathbb{R}^{d}} \Phi\Big(y-\mathbb{E}_{X\sim m}[\widehat\varphi(X,z)]\Big)\ \nu(dz,dy).\tag{P2}
$$

Proposition 1. The functional $F: \mathcal{P}(\mathbb{R}^d)\to\mathbb{R}$ is convex.

Proof : For all $m,m^\prime\in \mathcal{P}(\mathbb{R}^d)$ and all $t\in [0,1]$,

$$
\begin{aligned}
F((1-t)m+tm^\prime) &=\int_{\mathbb{R}^{d}} \Phi\Big(y-\mathbb E_{X\sim (1-t)m+tm^\prime}[\widehat\varphi(X,z)]\Big)\ \nu(dz,dy)\\\
&=\int_{\mathbb{R}^{d}} \Phi\Big((1-t)(y-\mathbb E_{X\sim m}[\widehat\varphi(X,z)])+t(y-\mathbb E_{X\sim m^\prime}[\widehat\varphi(X,z)])\Big)\ \nu(dz,dy)\\\
&\le (1-t)\int_{\mathbb{R}^{d}} \Phi\Big(y-\mathbb E_{X\sim m}[\widehat\varphi(X,z)]\Big)\ \nu(dz,dy)+t \int_{\mathbb{R}^{d}} \Phi\Big(y-\mathbb E_{X\sim m^\prime}[\widehat\varphi(X,z)]\Big)\ \nu(dz,dy)\\\
&=(1-t)F(m)+tF(m^\prime).
\end{aligned}
$$

Hence $F$ is convex. $\square$

Now maybe you want to ask what is the relationship between (P1) and (P2) ? It is obvious that
$$
F^n(x^1,\cdots,x^n)=F(\frac{1}{n}\sum_{i=1}^n \delta_{x^i}).
$$
Hence we have
$$
\inf_{x^i,\ i=1,2,\cdots,n} F^n(x^1,\cdots,x^n)\ge \inf_{m\in \mathcal{P}(\mathbb{R}^d)} F(m).
$$
Moreover,we have the following theorem, which shows that the infimum of (P1) and (P2) is very close as long as $n$ is sufficiently large.

Theorem 1. We assume that the 2nd order linear functional derivative of $F$ exists, is jointly continuous in both variables and that there is $L>0$ such that for any random variables $\eta_1,\eta_2$ such that $\mathbb{E}[|\eta_i|^2]<\infty$, $i=1,2$, it holds that

$$
\mathbb{E} \left[\sup_{\nu\in\mathcal P_2(\mathbb{R}^d)}\left\vert\frac{\delta F}{\delta m}(\nu,\eta_1)\right\vert\right]
+
\mathbb{E}\left[\sup_{\nu\in\mathcal P_2(\mathbb{R}^d)}\left\vert\frac{\delta^2 F}{\delta m^2}(\nu,\eta_1,\eta_2)\right\vert\right]
\leq L
$$

If there is an $m^\star\in\mathcal P_2(\mathbb{R}^d)$ such that $F(m^\star)=\inf_{m\in\mathcal P_2(\mathbb{R}^d)}F(m)$ then we have that

$$
\left|
\inf_{x^i,\ i=1,2,\cdots,n}
F\left(\frac{1}{n}\sum_{i=1}^n\delta_{x^i}\right)
-
F(m^\star)
\right|
\leq
\frac{2L}{n}.
$$

So the question becomes how to solve (P2)? From now on, we assume the functional $F$ has the following properties.

Assumption 1. Assume that $F \in \mathcal C^1$ is convex and bounded from below, where we say a function $F : \mathcal{P}(\mathbb{R}^d) \to \mathbb{R}$ is in $\mathcal C^1$ if there exists a bounded continuous function
$$
\frac{\delta F}{\delta m} : \mathcal{P}(\mathbb{R}^d) \times \mathbb{R}^d \to \mathbb{R}
$$
such that
$$
F(m’) - F(m) = \int_0^1 \int_{\mathbb{R}^d} \frac{\delta F}{\delta m}\bigl((1-\lambda)m + \lambda m’, x\bigr)\ (m’ - m)(dx)\ d\lambda.
\tag{1}
$$

We will refer to $\frac{\delta F}{\delta m}$ as the linear functional derivative. There is at most one $\frac{\delta F}{\delta m}$, up to a constant shift, satisfying (1). To avoid the ambiguity, we impose
$$
\int_{\mathbb{R}^d} \frac{\delta F}{\delta m}(m,x)\ m(dx)=0.
$$

If $(m,x)\mapsto \frac{\delta F}{\delta m}(m,x)$ is continuously differentiable in $x$, we define its intrinsic derivative $D_m F : \mathcal{P}(\mathbb{R}^d)\times \mathbb{R}^d \to \mathbb{R}^d$ by
$$
D_m F(m,x)=\nabla\left(\frac{\delta F}{\delta m}(m,x)\right).
$$

However, even under the Assumption 1, the existence and uniqueness of the minimizer of the optimization problem (P2) is not clear. So instead, we can consider the regularized problem (P3).

$$
\inf_{m\in \mathcal{P}(\mathbb{R}^d)} V^\sigma(m):=F(m)+\frac{\sigma^2}{2}H(m),\tag{P3}
$$
where
$$
g(x)=e^{-U(x)} \text{ with } U \text{ s.t. } \int_{\mathbb{R}^d} e^{-U(x)}dx=1,
$$
is the density of the Gibbs measure and the function $U$ satisfies the following conditions.

Assumption 2. The function $U:\mathbb{R}^d \to \mathbb{R}$ belongs to $C^\infty$. Further,

(i) there exist constants $C_U>0$ and $C’_U\in\mathbb{R}$ such that
$$
\nabla U(x)\cdot x \ge C_U |x|^2 + C’_U \quad \text{for all } x\in\mathbb{R}^d.
$$
(ii) $\nabla U$ is Lipschitz continuous.

Immediately, we obtain that there exist $0\le C’ \le C$ such that for all $x\in\mathbb{R}^d$
$$
C’|x|^2 - C \le U(x) \le C(1+|x|^2), \qquad |\Delta U(x)| \le C.
$$

For regularized optimization problem (P3), we have a necessary and sufficient first-order optimality condition.

Theorem 2. Under Assumptions 1 and Assumption 2, the function $V^\sigma$ has a unique minimizer absolutely continuous with respect to Lebesgue measure $\ell$, and belonging to $\mathcal{P}_2(\mathbb{R}^d)$. Moreover,
$$
m^* \in \mathcal{P}_2(\mathbb{R}^d) = \underset{m \in \mathcal{P}(\mathbb{R}^d)}{\arg\min}\ V^\sigma(m)
$$
if and only if $m^*$ is equivalent to Lebesgue measure and

$$
\frac{\delta F}{\delta m}(m^\star, \cdot) + \frac{\sigma^2}{2} \log(m^\star) + \frac{\sigma^2}{2} U
\text{ is a constant, } \ell\text{-a.s.},
\tag{2}
$$

where we abuse the notation, still denoting by $m^*$ the density with respect to Lebesgue measure.

What’s more, as shown in my previous blog: An Example of Gamma Convergence, we know that
$$
V^\sigma = F + \frac{\sigma^2}{2} H \xrightarrow{\Gamma} F \quad\text{as } \sigma \downarrow 0.
$$
Therefore, every cluster point of
$$
\left(\arg\min_m V^\sigma(m)\right)_\sigma
$$
is a minimizer of $F$.

By Theorem 2, we know that if $m^\star$ is the unique minimizer of (P3), then
$$
D_m F(m^\star, x) + \frac{\sigma^2}{2} \frac{\nabla m^\star(x)}{m^\star(x)} + \frac{\sigma^2}{2} \nabla U(x)
=0\quad \ell\text{-a.s.},
\tag{3}
$$
which means $m^\star$ is the invariant measure of the following Fokker-Planck equation
$$
\partial_t m_t=\nabla\cdot \left[\left(D_m F(m_t, x)+\frac{\sigma^2}{2} \nabla U(x)\right)m_t\right]+\frac{\sigma^2}{2} \Delta m_t.\tag{4}
$$

Therefore, $m^\star$ is also the invariant measure of the following mean-field Langevin dynamic
$$
dX_t = -\left(D_m F(m_t,X_t) + \frac{\sigma^2}{2}\nabla U(X_t)\right)dt + \sigma dW_t,
\qquad \text{where } m_t := \mathrm{Law}(X_t).
\tag{5}
$$

However, the dynamic (5) is hard to discretization. Consider independent random variables $X_0^i$, $X_0^i\sim m_0$ and independent Brownian motions $(W^i),\ i=,1,2\cdots,n$. By approximating the law of the process (5) its empirical law we arrive at the following interacting particle system

$$
\begin{cases}
dX_t^i=-\left(D_mF(m_t^n,X_t^i)+\frac{\sigma^2}{2}\nabla U(X_t^i)\right)dt+\sigma dW_t^i,\quad i=1,\ldots,n,\\\
m_t^n=\dfrac{1}{n} \sum\limits_{i=1}^n\delta_{X_t^i}.
\end{cases}
\tag{6}
$$
By the theory of mean-field limit, we know as $n\to\infty$, the dynamic (6) will tend to dynamic (5). By the following lemma, the term $D_mF(m_t^n,X_t^i)$ is easy to compute.

Lemma 1. We have
$$
\partial_{x^i} F^n(x^1,\cdots,x^n) = \frac{1}{n}D_mF(\sum_{i=1}^n \delta_{x^i},x^i).
$$

Hence by Lemma 1, the dynamic (6) can be written as
$$
dX_t^i = -\left( n \partial_{x_i} F^n(X_t^1,\cdots,X_t^n) + \frac{\sigma^2}{2} \nabla U(X_t^i) \right) dt + \sigma dW_t^i .\quad i=1,2,\cdots,n.
$$
By the definition of $F^n$ in (P1), we have
$$
\partial_{x^i} F^n(x^1,\cdots,x^n)
=
-\frac{1}{n}
\int_{\mathbb{R}^d}
\dot{\Phi}!\left(
y-\frac{1}{n}\sum_{j=1}^n \hat{\varphi}(x^j,z)
\right)
\nabla \hat{\varphi}(x^i,z)\ \nu(dz,dy).
$$

We thus see that the dynamic (6) corresponds to

$$
dX_t^i
=
\left(
\int_{\mathbb{R}^d}
\Phi^\prime\left(
y-\frac{1}{n}\sum_{j=1}^n \hat{\varphi}(X_t^j,z)
\right)
\nabla \hat{\varphi}(X_t^i,z)\ \nu(dz,dy)
-
\frac{\sigma^2}{2}\nabla U(X_t^i)
\right)dt
+\sigma dW_t^i ,\quad i=1,2,\cdots,n.
$$

For a fixed time step $\tau > 0$ fixing a grid of time points $t_k = k\tau$, $k = 0,1,\ldots$ we can then write the explicit Euler scheme

$$
X_{t_{k+1}}^{\tau,i} - X_{t_k}^{\tau,i}
=
\left(
\int_{\mathbb{R}^d}
\Phi^\prime \left(
y - \frac{1}{n}\sum_{j=1}^n \hat{\varphi}(X_{t_k}^{\tau,j},z)
\right)
\nabla \hat{\varphi}(X_{t_k}^{\tau,i},z)\ \nu(dz,dy)
-
\frac{\sigma^2}{2}\nabla U(X_{t_k}^{\tau,i})
\right)\tau
+
\sigma\bigl(W_{t_{k+1}}^i - W_{t_k}^i\bigr).
$$

To relate this to the gradient descent algorithm we consider the case where we are given data points $(y_m,z_m),\ m=1,2,\cdots,N$ which are i.i.d. samples from $\nu$. If the loss function $\Phi(x)=x^2$ is simply the square loss then the evolution of parameter $x_k^i$ will simply read as

$$
x_{k+1}^i
=
x_k^i
+
2\tau
\left(
\left(
y_{I_k} - \frac{1}{n}\sum_{j=1}^n \hat{\varphi}(x_k^j,z_{I_k})
\right)
\nabla \hat{\varphi}(x_k^i,z_{I_k})
-
\frac{\sigma^2}{2}\nabla U(x_k^i)
\right)
+
\sigma\sqrt{\tau}\ \xi_k^i,
$$

with$I_k\sim \mathrm{Unif}\{1,2,\cdots,N\}$ and $\xi_k^i$ independent samples from $N(0,I_d)$, which can be viewed as a version of regularized noisy SGD algorithm.

Summary

The main analytical framework of this theory can be illustrated by the following diagram.

Figure 2: Summary of the main analytical framework.

Reference

Hu, Kaitong, et al. “Mean-field Langevin dynamics and energy landscape of neural networks.” Annales de l’Institut Henri Poincare (B) Probabilites et statistiques. Vol. 57. No. 4. Institut Henri Poincaré, 2021.

The cover image of this article was taken at Jungfrau in Switzerland.

Properties of the Relative Entropy

2026-03-16T16:00:00.000Z

In this article, we will introduce some important properties of the relative entropy, including lower semi-continuity, convexity, compactness of sublevel sets based on the Donsker-Varadhan variational formula.

Let $X$ be a Polish space, and $\mathcal{B}$ be the corresponding Borel $\sigma-$algebra. We denote by $\mathcal{P}(X)$ the set of probability measures on $(X,\mathcal{B})$.

Definition 1. For $\nu\in \mathcal{P}(X)$, the relative entropy functional (also called KL divergence) is a mapping $H(\cdot \Vert\nu ): \mathcal{P}(X)\to \overline{\mathbb{R}}$, defined by

$$
H(\mu\Vert\nu):=\begin{cases}
\int_X \log \left(\frac{d\mu}{d\nu}\right)d\mu,& \text{if } \mu\ll \nu\\\
+\infty,& \text{otherwise}
\end{cases}.
$$

Remark 1. Since $s(\log s)^{-}$ is bounded for $s\in [0,+\infty)$, whenever $\mu\in \mathcal{P}(X)$ is absolutely continuous with respect to $\nu$,

$$
\int_X \log \left(\frac{d\mu}{d\nu}\right)^{-}d\mu=\int_X \frac{d\mu}{d\nu}\log \left(\frac{d\mu}{d\nu}\right)^{-}d\nu<+\infty.
$$

It follows that the relative entropy is well-defined.

Lemma 1. For $\nu\in \mathcal{P}(X)$, the relative entropy $H(\cdot\Vert\nu)$ functional is positive definite, namely, $H(\mu\Vert\nu)\ge 0$ and $H(\mu\Vert\nu)=0$ if and only if $\mu=\nu$.

Proof. It suffices to consider the case where $H(\mu\Vert\nu)<+\infty$. Since $s\log s\ge s-1$ with equality if and only if $s=1$,
$$
H(\mu\Vert\nu)=\int_X \frac{d\mu}{d\nu}\log \left(\frac{d\mu}{d\nu}\right) d\nu\ge \int_X \left(\frac{d\mu}{d\nu}-1\right) d\nu=\int_X d\mu-\int_Xd\nu=\mu(X)-\nu(X)=0.
$$
And the equality holds if and only if
$$
\frac{d\mu}{d\nu}=1\quad \nu-a.e.\ ,
$$
if and only if $\mu=\nu$. This complete the proof. $\square$

Proposition 1. Let $f:X\to\mathbb{R}$ be a bounded measurable function and $\nu\in \mathcal{P}(X)$. The following conclusions holds.

(a) We have the variational formula
$$
-\log \int_X e^{-f} d\nu=\inf_{\mu\in \mathcal{P}(X)} \left\{ H(\mu\Vert \nu)+\int_X f\ d\mu \right\}.\tag{1}
$$
(b) Let $\mu_0$ denotes the probability measure on $X$ which is absolutely continuous with respect to $\nu$ and satisfies
$$
\frac{d\mu_0}{d\nu}(x) = \frac{e^{-f(x)}}{\int_X e^{-f}d\nu}.
$$
Then the infimum in the variational formula (1) is uniquely attained at $\mu_0$.

Proof. For part (a), it suffices to prove that
$$
-\log \int_X e^{-f} d\nu=\inf \left\{ H(\mu\Vert \nu)+\int_X f\ d\mu: \mu \in \mathcal{P}(X),\ H(\mu\Vert\nu)<+\infty\right\}.
$$
If $H(\mu\Vert\nu)<+\infty$, then $\mu\ll\nu$ and since by definition $\nu\ll \mu_0$, we have also $\mu\ll\mu_0$. Thus

$$
\begin{aligned}
H(\mu\Vert \nu)+\int_X f\ d\mu&= \int_X \log \left(\frac{d\mu}{d\nu}\right)d\mu +\int_X f\ d\mu\\\
&=\int_X \log \left(\frac{d\mu}{d\mu_0}\right)d\mu+\int_X \log \left(\frac{d\mu_0}{d\nu}\right)d\mu +\int_X f\ d\mu\\\
&=H(\mu\Vert\mu_0)-\log \int_X e^{-f} d\nu.
\end{aligned}
$$
Hence by proposition 1, we complete the proof. $\square$

Now, we denote by $C_b(X)$ the space of bounded continuous functions mapping $X$ into $\mathbb{R}$ and by $B_b(X)$ the space of bounded Borel measurable functions mapping $X$ into $\mathbb{R}$.

Theorem 1 (Donsker-Varadhan variational formula). For each $\mu$ and $\nu$ in $\mathcal{P}(X)$,
$$
H(\mu\Vert\nu)=\sup_{g\in C_b(X)}\left\{\int_X g\ d\mu-\log\int_X e^{g}\ d\nu\right\}=\sup_{h\in B_b(X)}\left\{\int_X h\ d\mu-\log\int_X e^{h}\ d\nu\right\}.
$$

Proof. We first show that
$$
\sup_{g\in C_b(X)}\left\{\int_X g\ d\mu-\log\int_X e^{g}\ d\nu\right\}=\sup_{h\in B_b(X)}\left\{\int_X h\ d\mu-\log\int_X e^{h}\ d\nu\right\}.
$$
On the one hand, given $\varepsilon>0$, since $X$ is a Polish space, $\mu$ and $\nu$ are Borel probability measure, by Ulam’s Theorem, $\mu$ and $\nu$ are tight. Hence, there is a compact subset $K$ of $X$ such that
$$
\mu(K^c)\le \varepsilon\quad \text{and}\quad \nu(K^c)\le \varepsilon.
$$
By Lusin’s Theorem and the Tietze-Urysohn Extension Theorem, for any $h\in B_b(X)$, there exists a closed subset $F$ of $K$, which is also compact, such that
$$
\mu(K\setminus F)\le \varepsilon \quad \text{and}\quad \nu(K\setminus F)\le \varepsilon.
$$
And $g\in C_b(X)$ such that

$$
g|_F=h|_F
$$

and
$$
\lVert g\rVert_{\infty}\le \lVert h\rVert_{\infty}.
$$
It follows that
$$
\mu(F^c)\le \mu(K^c)+ \mu(K\setminus F)\le 2\varepsilon \quad \text{and}\quad \nu(F^c)\le \nu(K^c)+ \nu(K\setminus F)\le 2\varepsilon
$$
and that
$$
\begin{aligned}
\int_X h\ d\mu-\log\int_X e^{h}\ d\nu &=\int_{F} h\ d\mu-\log\int_{F} e^{h}\ d\nu+\int_{F^c} h\ d\mu-\log\int_{F^c} e^{h}\ d\nu\\\
&\le \int_{F} g\ d\mu-\log\int_{F} e^{g}\ d\nu+4\lVert h\rVert_{\infty}\varepsilon\\\
&=\int_{X} g\ d\mu-\log\int_{X} e^{g}\ d\nu-\int_{F^c} g\ d\mu+\log\int_{F^c} e^{g}\ d\nu+4\lVert h\rVert_{\infty}\varepsilon\\\
&\le \int_{X} g\ d\mu-\log\int_{X} e^{g}\ d\nu+ 8\lVert h\rVert_{\infty}\varepsilon.
\end{aligned}
$$

Taking the supremum over $g\in C_b(X)$ and letting $\varepsilon\to 0$ yields
$$
\int_X h\ d\mu-\log\int_X e^{h}\ d\nu \le \sup_{g\in C_b(X)}\left\{\int_X g\ d\mu-\log\int_X e^{g}\ d\nu\right\},
$$
and since $h\in B_b(X)$ is arbitrary, we conclude that
$$
\sup_{h\in B_b(X)}\left\{\int_X h\ d\mu-\log\int_X e^{h}\ d\nu\right\}\le \sup_{g\in C_b(X)}\left\{\int_X g\ d\mu-\log\int_X e^{g}\ d\nu\right\}.
$$
On the other hand, since $C_b(X)\subset B_b(X)$, we have
$$
\sup_{h\in B_b(X)}\left\{\int_X h\ d\mu-\log\int_X e^{h}\ d\nu\right\}\ge \sup_{g\in C_b(X)}\left\{\int_X g\ d\mu-\log\int_X e^{g}\ d\nu\right\}.
$$
Hence
$$
\sup_{g\in C_b(X)}\left\{\int_X g\ d\mu-\log\int_X e^{g}\ d\nu\right\}=\sup_{h\in B_b(X)}\left\{\int_X h\ d\mu-\log\int_X e^{h}\ d\nu\right\}.
$$
Next, we denote
$$
R(\mu,\nu):=\sup_{g\in C_b(X)}\left\{\int_X g\ d\mu-\log\int_X e^{g}\ d\nu\right\}=\sup_{h\in B_b(X)}\left\{\int_X h\ d\mu-\log\int_X e^{h}\ d\nu\right\}
$$
and we need to show
$$
R(\mu,\nu)=H(\mu\Vert \nu).
$$
Since the function $g_0\equiv 0$ on $X$ is also bounded continuous, we have
$$
R(\mu,\nu)\ge\int_X g_0\ d\mu-\log\int_X e^{g_0}\ d\nu=0.
$$
By Proposition 1, for any $f\in B_b(X)$,
$$
H(\mu\Vert\nu)\ge -\int_X f\ d\mu-\log \int_X e^{-f}\ d\nu.
$$
Replacing $f$ by $h:=-f$, and taking the supremum over $h\in B_b(X)$, we obtain
$$
H(\mu\Vert\nu)\ge\sup_{h\in B_b(X)}\left\{\int_X h\ d\mu-\log\int_X e^{h}\ d\nu\right\}=R(\mu,\nu).
$$
Now, we only need to show that
$$
R(\mu,\nu)\ge H(\mu\Vert\nu).
$$
We may assume that $R(\mu,\nu)<+\infty$ for otherwise there is nothing to prove.

We claim that under this condition, $\mu\ll\nu$. Indeed, let $A\in \mathcal{B}$ such that $\nu(A)=0$ and take $h=r1_A$ with $r>0$, we have
$$
r\mu(A)=\int_X h\ d\mu-\log\int_X e^{h}\ d\nu\le R(\mu,\nu)<\infty.
$$
Taking $r\to\infty$ gives $\mu(A)=0$ as claimed.

Since $\mu\ll\nu$, the Radon-Nikodym derivative
$$
f:=\frac{d\mu}{d\nu}
$$
exists.

Case 1: $f(x)>0$ for all $x\in X$ and bounded. Then $h=\log f$ is bounded and measurable. Hence by
$$
\log\int_X e^{h}\ d\nu=\log\int_X f\ d\nu=\log\int_X d\mu=0,
$$

we have

$$
H(\mu\Vert \nu)=\int_X \log f\ d\mu= \int_X h\ d\mu- \log\int_X e^{h}\ d\nu \le R(\mu,\nu).
$$
Case 2: $f(x)>0$ for all $x\in X$ but $f$ is not bounded. Then for $n\in \mathbb{N}$, we set $f_n=f\wedge n$ and $h=\log f_n$, we obtain by Monotone Convergence Theorem
$$
H(\mu\Vert \nu)=\int_X \log f\ d\mu=\lim_{n\to\infty}\int_X \log f_n\ d\mu\le R(\mu,\nu)+\lim_{n\to\infty} \log\int_X f_n\ d\nu=R(\mu,\nu).
$$
Case 3: neither $f(x)>0$ for all $x\in X$ nor bounded. For $t\in [0,1]$, define
$$
\mu_t:=t\nu+(1-t)\mu\quad\text{and}\quad f_t:= \frac{d\mu_t}{d\nu}=t\cdot 1+(1-t)f.
$$
For each $t\in (0,1]$, $f_t>0$ for all $x\in X$ and so by Case 2, we have
$$
H(\mu_t\Vert\nu)\le R(\mu_t,\nu).
$$
We now prove that
$$
\lim_{t\to 0} H(\mu_t\Vert\nu)=H(\mu\Vert\nu)\quad \text{and}\quad \lim_{t\to 0}R(\mu_t,\nu)=R(\mu,\nu),
$$
which will complete the proof.

Since $s\log s$ is convex on $[0,+\infty)$,
$$
H(\mu_t\Vert \nu)=\int_X f_t\log f_t\ d\nu\le (1-t) \int_X f\log f\ d\nu=(1-t)H(\mu\Vert \nu).
$$
Therefore,
$$
\limsup_{t\to 0} H(\mu_t\Vert \nu)\le H(\mu\Vert \nu).
$$

Moreover, since $f_t\ge t$ we have $\log f_t\ge \log t$ and $\log s$ is concave, we have
$$
\log f_t\ge (1-t)\log f,
$$

which follows that
$$
H(\mu_t\Vert \nu)=\int_X f_t\log f_t\ d\nu=t\int_X \log f_t \ d\nu+(1-t) \int_X f\log f_t\ d\nu\ge t\log t +(1-t)^2 H(\mu\Vert\nu).
$$
Therefore,
$$
\liminf_{t\to 0} H(\mu_t\Vert \nu)\ge H(\mu\Vert \nu).
$$

Thus we have
$$
\lim_{t\to 0} H(\mu_t\Vert \nu)=H(\mu\Vert \nu).
$$
For $R(\mu,\nu)$, by Jensen’s inequality, for $g\in C_b(X)$,
$$
-\log\int_X e^g\ d\nu\le \int_X (-\log e^g)\ d\nu=-\int_X g\ d\nu,
$$
which follows that
$$
0\le R(\nu,\nu)=\sup_{g\in C_b(X)}\left\{\int_X g\ d\nu-\log\int_X e^{g}\ d\nu\right\}\le 0.
$$
Therefore $R(\nu,\nu)=0$. It is easy to check that the mapping $t\in [0,1]\mapsto R(\mu_t,\nu)$ is convex and lower semi-continuous. Furthermore,
$$
0\le R(\mu_t,\nu)\le tR(\nu,\nu)+(1-t) R(\mu,\nu)=(1-t) R(\mu,\nu)\le R(\mu,\nu)<+\infty
$$
and so it is also bounded. By convex analysis, we know that $t\in [0,1]\mapsto R(\mu_t,\nu)$ is continuous and therefore
$$
\lim_{t\to 0}R(\mu_t,\nu)=R(\mu,\nu).
$$
This complete the proof. $\square$

Theorem 2. The relative entropy $H(\mu\Vert \nu)$ is a convex, lower semi-continuous function of $(\mu,\nu)\in \mathcal{P}(X)\times \mathcal{P}(X)$. In particular, $H(\mu\Vert \nu)$ is a convex, lower semi-continuous function of each $\mu$ and $\nu$ separately. In addition, for fixed $\nu\in \mathcal{P}(X)$, the relative entropy $H(\cdot\Vert\nu)$ is strictly convex on the set
$$
\{\mu\in \mathcal{P}(X): H(\mu\Vert\nu)<+\infty\}.
$$

Proof. By Donsker-Varadhan variational formula, we have
$$
H(\mu\Vert\nu)=\sup_{g\in C_b(X)}\left\{\int_X g\ d\mu-\log\int_X e^{g}\ d\nu\right\}.
$$
Since for each fixed $g\in C_b(X)$, the mapping
$$
(\mu,\nu)\in \mathcal{P}(X)\times \mathcal{P}(X)\mapsto \int_X g\ d\mu-\log\int_X e^{g}\ d\nu
$$
is convex and continuous. As the supremum over $g\in C_b(X)$, we have $H(\mu\Vert \nu)$ is a convex, lower semi-continuous function of $(\mu,\nu)\in \mathcal{P}(X)\times \mathcal{P}(X)$.

To prove the strict convexity on
$$
\{\mu\in \mathcal{P}(X): H(\mu\Vert\nu)<+\infty\},
$$
where we have
$$
H(\mu\Vert\nu)=\int_X \frac{d\mu}{d\nu}\log \left(\frac{d\mu}{d\nu}\right) d\nu.
$$
Then the strict convexity follows from the strict convexity of $s\log s$ for $s\in [0,+\infty)$. This complete the proof. $\square$

Remark. The relative entropy is often not continuous even with respect to the strong convergence of probability measures.

Conterexample. Let
$$
\mu_n:=\frac{1}{a_n}\operatorname{Unif}[1,n]+\left(1-\frac{1}{a_n}\right)\operatorname{Unif}[-2,-1],
$$
where $a_n=\log\log n$ and
$$
\mu:=\operatorname{Unif}[-2,-1].
$$
Then it is easy to show that
$$
\operatorname{TV}(\mu_n,\mu)=\frac{1}{a_n}\to 0.
$$
However, let $\gamma$ be the Gaussian measure on $\mathbb{R}$, then it is easy to show that
$$
H(\mu_n\Vert \gamma)\to +\infty\neq H(\mu\Vert \gamma)\quad \text{as } n\to\infty.
$$

Theorem 3. For each $\nu\in \mathcal{P}(X)$, the relative entropy has compact sublevel sets. That is, for each $M<+\infty$ the sublevel set
$$
\{\mu\in \mathcal{P}(X): H(\mu\Vert\nu)\le M\}
$$
is a compact subset of $\mathcal{P}(X)$.

Proof. Let $\{\mu_n,\ n\in \mathbb{N}\}$ be any sequence in the sublevel set
$$
\{\mu\in \mathcal{P}(X): H(\mu\Vert\nu)\le M\},
$$
which implies
$$
\sup_{n\in \mathbb{N}} H(\mu_n\Vert \nu)\le M<+\infty.
$$
By Donsker-Varadhan variational formula, for any $h\in B_b(X)$, we have for each $n\in \mathbb{N}$
$$
\int_X h\ d\mu_n-\log\int_X e^h\ d\nu\le H(\mu_n\Vert \nu)\le M.
$$

Let $\delta>0$ and $\varepsilon>0$ be given. Since $\nu$ is a Borel probability measure on Polish space $X$, by Ulam’s Theorem, $\nu$ is tight, which means there exists a compact set $K$ such that
$$
\nu(K^c)\le \delta.
$$
Take
$$
h(x)=\begin{cases}
0,& x\in K\\\
\log (1+\frac{1}{\delta}),& x\in K^c
\end{cases},
$$
which is bounded and Borel measurable, we have for each $n\in \mathbb{N}$,
$$
\int_X h\ d\mu_n-\log\int_X e^h\ d\nu=\log (1+\frac{1}{\delta})\ \mu_n(K^c)-\log\nu(K)-\log \left[(1+\frac{1}{\delta})\nu(K^c)\right] \le M.
$$
It follows that
$$
\mu_n(K^c)\le \frac{1}{\log (1+\frac{1}{\delta})} \left(M+\log \left[ \nu(K)+(1+\frac{1}{\delta}) \nu(K^c)\right]\right)\le \frac{M+\log 2}{\log (1+\frac{1}{\delta})}.
$$
Hence we can choose $\delta>0$ such that
$$
\frac{M+\log 2}{\log (1+\frac{1}{\delta})}\le \varepsilon,
$$

which implies that $\{\mu_n,\ n\in \mathbb{N}\}$ is tight. By Prohorov’s Theorem, there exists $\mu\in \mathcal{P}(X)$ and a subsequence $\mu_{n_k}$ such that $\mu_{n_k}$ weak converge to $\mu$. The lower semi-continuity of $H(\cdot\Vert\nu)$ yields
$$
H(\mu\Vert\nu)\le\liminf_{k\to\infty} H(\mu_{n_k}\Vert\nu)\le M,
$$
which means $\mu$ also lies in the sublevel set. This complete the proof. $\square$

The cover image of this article was taken while taking a helicopter sightseeing flight over Aoraki / Mount Cook in New Zealand.

The Importance of Using Cash in China

2026-03-14T16:00:00.000Z

Whether in the past, the present, or the future, cash payments remain extremely important in every country and every place, even in today’s society where digital payment is widespread. The privacy, security, and independence from network infrastructure that cash offers are irreplaceable. Cash not only helps diversify risk, but can also serve as a way to curb spending.

Initially, cash offers a high level of privacy and security, something online payment simply cannot match. In today’s age of big data, digital payment records can easily be collected, aggregated, and potentially exploited by malicious actors. This poses a serious threat not only to personal privacy, but even to personal safety.

What’s more, cash does not depend on the internet. Since ancient times, whether people relied on barter, gold, silver, copper coins, or cash, cash-based exchange has never depended on external non-personal factors, especially electricity and network access. In such an unstable era, war could break out around us at any time. If that were to happen, large-scale network failures would be highly likely, making digital payment unusable. This concern is particularly relevant in China, where Starlink has not been introduced and communication base stations are all land-based. Under such circumstances, relying solely on online payment would create enormous inconvenience. This is especially true given China’s large population: once an emergency occurs, if everyone rushes to withdraw cash at the same time, the resulting difficulties would be immense.

Moreover, even in the absence of war, natural disasters are unavoidable. For example, severe flooding caused by torrential rain may destroy communication base stations, leading to network outages and making online payment impossible. In addition, many cities also face unavoidable earthquake risks. For example, Beijing is located at the intersection of the North China Plain fault zone, the Shanxi rift basin zone, and the Zhangjiakou–Bohai seismic tectonic belt. It is also one of the very few capital megacities in the world to have experienced earthquakes above magnitude 8 and to have a basic seismic intensity as high as VIII. Historically, Beijing has suffered multiple destructive earthquakes, and the threat that future earthquakes pose to its sustainable development should not be underestimated. When disasters strike, networks can easily fail, which poses a major challenge to online payment. Even in everyday situations, if your phone runs out of battery, is lost, or is damaged while you are out, and you are not carrying any cash, the payment risk can become substantial. Therefore, the importance of carrying cash, keeping cash at home, and maintaining the habit of paying with cash is beyond question.

In addition, with telecom and online fraud now so rampant, banks have adopted increasingly strict risk-control measures, and it is not uncommon for bank accounts to be suddenly flagged or restricted. For this reason, aside from using multiple banks to diversify risk, cash payment is also a sensible option.

Finally, paying with cash can also have the unexpected benefit of helping one save money. Based on my own experience over the past few weeks, this effect does not come from the commonly repeated claim that cash makes people “feel” their spending more vividly. Rather, it comes from the fact that cash payments are more cumbersome than digital ones, which makes me pause before each purchase and think about whether the expense is truly necessary. In this way, cash can significantly curb impulse spending and ultimately help reduce unnecessary expenses.

I am also grateful for the fact that under Chinese law, refusing to accept cash is illegal. This made my experiment with cash payments over the past few weeks go very smoothly. Almost every merchant had enough small change on hand, and I did not encounter any case in which cash was refused or change could not be given. In conclusion, I hope everyone can remain prepared even in peaceful times, carry cash when going out, keep cash at home, and maintain the habit of paying with cash.

The cover image of this article was taken on a sightseeing cruise in Lucerne, Switzerland.

Lusin‘s Theorem

2026-03-09T16:00:00.000Z

Theorem 1 (Lusin’s theorem). Suppose that $X$ and $Y$ are Polish spaces, that $\mu$ is a finite Borel measure on $X$, that $f:X\to Y$ is Borel measurable, and that $\varepsilon>0$. Then there exists a compact subset $K$ of $X$, with
$$
\mu(X\setminus K)<\varepsilon,
$$
such that the restriction of $f$ to $K$ is continuous.

Proof : Let $d$ be a metric on $Y$ which defines the topology of $Y$. Since $Y$ is Polish, in particular it is separable, so there exists a dense sequence $(y_n)_{n=1}^\infty$ in $Y$.

Fix $j\in \mathbb{N}$. For each $n\in \mathbb{N}$, define
$$
A_{n,j}=\{x\in X:d(f(x),y_n)<1/j\},
$$
which is a Borel subset of $X$.

Now define
$$
B_{n,j}=A_{n,j}\setminus \bigcup_{m=1}^{n-1}A_{m,j},
$$
and
$$
C_{n,j}=\bigcup_{m=1}^n A_{m,j}=\bigcup_{m=1}^n B_{m,j}.
$$

Then $(C_{n,j})_{n=1}^\infty$ is an increasing sequence of Borel subsets of $X$ with

$$
\bigcup_{n=1}^\infty C_{n,j}=X.
$$

Since $\mu$ is finite, continuity from below tells us that there exists $N_j\in \mathbb{N}$ such that
$$
\mu(X\setminus C_{N_j,j})<\frac{\varepsilon}{2^{j+1}}.
$$

Now fix $1\le n\le N_j$. Since $B_{n,j}$ is Borel in the Polish space $X$, and $\mu$ is a finite Borel measure on $X$, which implies $\mu$ is tight, hence there is a compact set $K_{n,j}\subseteq B_{n,j}$ such that
$$
\mu(B_{n,j}\setminus K_{n,j})<\frac{\varepsilon}{2^{j+1}N_j}.
$$

Let
$$
K_j=\bigcup_{n=1}^{N_j}K_{n,j}.
$$
Since this is a finite union of compact sets, $K_j$ is compact.

We next estimate $\mu(X\setminus K_j)$. Since

$$
C_{N_j,j}\setminus K_j
=\bigcup_{n=1}^{N_j}(B_{n,j}\setminus K_{n,j}),
$$

we have

$$
\mu(C_{N_j,j}\setminus K_j)
\le \sum_{n=1}^{N_j}\mu(B_{n,j}\setminus K_{n,j})
< \sum_{n=1}^{N_j}\frac{\varepsilon}{2^{j+1}N_j}
=\frac{\varepsilon}{2^{j+1}}.
$$
Moreover,
$$
X\setminus K_j
=(X\setminus C_{N_j,j})\cup (C_{N_j,j}\setminus K_j),
$$
we have
$$
\mu(X\setminus K_j)
\le \mu(X\setminus C_{N_j,j})+\mu(C_{N_j,j}\setminus K_j)
<\frac{\varepsilon}{2^{j+1}}+\frac{\varepsilon}{2^{j+1}}
=\frac{\varepsilon}{2^j}.
$$

Now define a function $f_j:K_j\to Y$ by setting
$$
f_j(x)=y_n \qquad \text{for } x\in K_{n,j},\ 1\le n\le N_j.
$$
This is well-defined because the sets $K_{n,j}$ are pairwise disjoint.

We claim that $f_j$ is continuous on $K_j$. Indeed, for each $n$, the set $K_{n,j}$ is compact in the metric space $X$, hence closed in $X$, and therefore also closed in the subspace $K_j$. Since there are only finitely many such sets and they are pairwise disjoint, each point $x\in K_j$ lies in exactly one $K_{n,j}$. Let $x\in K_{n,j}$. Because
$$
K_j\setminus K_{n,j}=\bigcup_{\substack{1\le m\le N_j\\ m\ne n}}K_{m,j},
$$
which is a finite union of closed sets in $K_j$, it is closed in $K_j$. Hence $K_{n,j}$ is open in $K_j$ as well. So each $K_{n,j}$ is clopen in the subspace $K_j$. Now let $U\subseteq Y$ be open. Then

$$
f_j^{-1}(U)=\bigcup_{\{n:y_n\in U\}}K_{n,j},
$$
which is open in $K_j$ because each $K_{n,j}$ is open in $K_j$. Thus $f_j$ is continuous.

Next we show that $f_j$ uniformly approximates $f$ on $K_j$. If $x\in K_{n,j}$, then $K_{n,j}\subseteq B_{n,j}\subseteq A_{n,j}$, so by the definition of $A_{n,j}$,
$$
d(f(x),y_n)<\frac{1}{j}.
$$
Since $f_j(x)=y_n$, it follows that
$$
d(f_j(x),f(x))<\frac{1}{j}
\qquad \text{for all } x\in K_j.
$$

Finally, define
$$
K=\bigcap_{j=1}^\infty K_j.
$$
Since $K\subseteq K_1$ and $K_1$ is compact, and since $K$ is closed in $K_1$, it follows that $K$ is compact. Also,

$$
X\setminus K
= X\setminus \bigcap_{j=1}^\infty K_j
= \bigcup_{j=1}^\infty (X\setminus K_j).
$$
Therefore,
$$
\mu(X\setminus K)
\le \sum_{j=1}^\infty \mu(X\setminus K_j)
< \sum_{j=1}^\infty \frac{\varepsilon}{2^j}
=\varepsilon.
$$

For each $j$, since $K\subseteq K_j$, the restriction $f_j|_K$ is continuous on $K$. Moreover, for every $x\in K$, we have $x\in K_j$, so

$$
d(f_j(x),f(x))<\frac{1}{j}.
$$

Hence

$$
\sup_{x\in K} d(f_j(x),f(x))\le \frac{1}{j}\to 0,
$$

so $f_j|_K\to f|_K$ uniformly on $K$. Because each $f_j|_K$ is continuous and the uniform limit of continuous functions is continuous, it follows that $f|_K$ is continuous. This completes the proof. $\square$

Corollary 1. Suppose that $f$ is a non-negative real-valued Borel measurable function on $X$. Then
$$
\int_X f\ d\mu
=
\sup\left\{
\int_K f\ d\mu
:
K \text{ compact},\ K\subseteq X,\ f|_K \text{ continuous}
\right\}.
$$

Proof : By Lusin’s theorem, for each $j\in \mathbb{N}$ there exists a compact set $L_j\subseteq X$ such that
$$
\mu(X\setminus L_j)<\frac{1}{2^j}
$$
and such that $f|_{L_j}$ is continuous. Define

$$
K_j=\bigcap_{m=j}^\infty L_m.
$$
Then each $K_j$ is closed subset of $L_j$, which follows that $K_j$ is compact. Moreover, the sequence $(K_j)_{j=1}^\infty$ is increasing.

Since
$$
X\setminus K_j
=
X\setminus \bigcap_{m=j}^\infty L_m
=
\bigcup_{m=j}^\infty (X\setminus L_m),
$$
we have

$$
\mu(X\setminus K_j)
\le \sum_{m=j}^\infty \mu(X\setminus L_m)\le \sum_{m=j}^\infty \frac{1}{2^m}\to 0\quad \text{as } j\to\infty.
$$

Also, since $K_j\subseteq L_j$, the restriction $f|_{K_j}$ is continuous for every $j$.

Thus we have constructed an increasing sequence $(K_j)_{j=1}^\infty$ of compact subsets of $X$ such that

$$
\mu(X\setminus K_j)\to 0
$$

and such that $f|_{K_j}$ is continuous for every $j$.

Now define
$$
f_j=f\ \mathbf{1}_{K_j}.
$$
Then $f_j\ge 0$ and
$$
f_j(x)\uparrow f(x)
\qquad \text{for } \mu\text{-a.e. } x.
$$

By the monotone convergence theorem,

$$
\int_X f\ d\mu
=
\lim_{j\to\infty}\int_X f_j\ d\mu=\lim_{j\to\infty}
\int_X f\ \mathbf 1_{K_j}\ d\mu
=\lim_{j\to\infty}
\int_{K_j} f\ d\mu.
$$

Now let
$$
S=
\left\{ \int_K f\ d\mu : K \text{ compact},\ K\subseteq X,\ f|_K \text{ continuous} \right\}.
$$

For each $j$, the set $K_j$ is compact and $f|_{K_j}$ is continuous, so

$$
\int_{K_j} f\ d\mu\in S.
$$

Hence

$$
\int_{K_j} f\ d\mu\le \sup S
\qquad \text{for every } j.
$$

Let $j\to\infty$, we have
$$
\int_X f\ d\mu\le \sup S.
$$

On the other hand, if $K\subseteq X$ is compact and $f|_K$ is continuous, then since $f\ge 0$,
$$
\int_K f\ d\mu\le \int_X f\ d\mu.
$$
Therefore every element of $S$ is bounded above by $\int_X f\ d\mu$, and so
$$
\sup S\le \int_X f\ d\mu.
$$

Combining the two inequalities, we conclude that
$$
\int_X f\ d\mu
=
\sup\left\{
\int_K f\ d\mu
:
K \text{ compact},\ K\subseteq X,\ f|_K \text{ continuous}
\right\}.
$$
This completes the proof. $\square$

Reference

Chapter 16 in the following reference:

Garling, David JH. Analysis on Polish spaces and an introduction to optimal transportation. Vol. 89. Cambridge University Press, 2018.

The cover image of this article was taken on Rottnest Island in Western Australia, Australia.

Radon Measure

2026-03-07T16:00:00.000Z

Definition 1. Suppose that $f$ is a function on the Borel subsets of a metric space $X$ taking values in $[0,\infty]$. We say $f$ is locally finite if for each $x \in X$ there exists a neighborhood $N$ of $x$ with $f(N) < \infty$.

Proposition 1. If $f:\mathcal{B}(X)\to [0,\infty]$ is a locally finite additive function on a metrizable space $(X,\tau)$ and $K$ is a compact subset of $X$ then $f(K) < \infty$.

Proof: For each $x \in K$ there exists a neighborhood $N_x$ of $x$ with $f(N_x) < \infty$. The sets $\{N_x:x\in K\}$ cover $K$, and so there is a finite subcover. Additivity then ensures that $f(K) < \infty$. $\square$

Definition 2. A Radon measure $\mu$ on a metrizable space $(X,\tau)$ is a tight additive function from $\mathcal{B}(X)$ to $[0,\infty]$ which is locally finite.

Remark: Since non-negative additive tight function on the Borel sets of a metrizable space $(X,\tau)$ is also $\sigma$-additive, a Radon measure $\mu$ is indeed a measure, and, by the preceding property, $\mu(K) < \infty$ if $K$ is compact.

Proposition 2. If $\mu$ is a Radon measure on a separable metrizable space, then $\mu$ is $\sigma$-finite; there exists a countable set $\mathcal{W}$ of open sets for which
$$
X=\bigcup_{W\in\mathcal{W}}W
$$
and $f(W)<\infty$, for all $W\in\mathcal{W}$.

Proof: Let $d$ be a metric on $X$ which defines the topology $\tau$. For each $n \in \mathbb{N}$, let

$$
U_n=\{x\in X:\text{there exists }r_x>1/n\text{ such that }f(N_{r_x}(x))<\infty\}.
$$

First, we show that $U_n$ is open for each $n\in \mathbb{N}$. For $x\in U_n$ and
$$
d(x,y)< r_x-\frac{1}{n}.
$$
Then let
$$
s_y=r_x-d(x,y)>r_x-(r_x-\frac{1}{n})=\frac{1}{n}.
$$
Now we claim that

$$
N_{s_y}(y)\subseteq N_{r_x}(x).
$$

Actually, for all $z\in N_{s_y}(y)$, we have
$$
d(z,x)\le d(z,y)+d(y,x)< s_y+d(y,x)=r_x,
$$
which means $z\in N_{r_x}(x)$. Therefore, $N_{s_y}(y)\subseteq N_{r_x}(x)$. Hence $U_n$ is open.

Let $C_n$ be a countable dense subset of $U_n$, and let

$$
W_n(c)=N_{1/n}(c)\quad \text{ for } c\in C_n.
$$

Then $f(W_n(c))<\infty$ for each $c\in C_n$ due to Radon measure $\mu$ is locally finite.

If $x\in U_n$ then there exists $c\in C_n$ with $d(x,c)<1/n$, so that

$$
U_n\subseteq \bigcup_{c\in C_n}W_n(c).
$$

Let $\mathcal{W}=\{W_n(c):n\in \mathbb{N},,c\in C_n\}$, then

$$
X=\bigcup_{W\in\mathcal{W}}W.\quad \square
$$

Remark: Suppose that $X$ and $Y$ are metric spaces, that $f:X\to Y$ is continuous and that $\mu$ is a Radon measure on $X$. Then the push-forward measure $f_\star(\mu)$ need not be a Radon measure on $Y$.

Counterexample: Let $\mu$ be counting measure on $\mathbb{N}$, and let $f:\mathbb{N}\to [0,\infty]$ be the inclusion mapping. Then $f_*(\mu)$ is not locally finite at $\infty$.

Theorem 1. Let $(X,\tau)$ be a Polish space, and let $(U_i)_{i=1}^\infty$ be a sequence of open subsets of $X$ such that

$$
X=\bigcup_{i=1}^\infty U_i.
$$

Suppose that for each $i$, $\mu_i$ is a finite measure on the Borel sets of $U_i$ and that these measures are compatible: if $A$ is a Borel set of $U_i\cap U_j$, then

$$
\mu_i(A)=\mu_j(A).
$$

Then there exists a unique Radon measure $\pi$ on $X$ for which

$$
\pi(A)=\mu_i(A)
$$

for each Borel set $A$ in $U_i$, for each $i\in \mathbb{N}$.

Proof: Let

$$
V_j=\bigcup_{i=1}^j U_i.
$$

The compatibility condition ensures that we can define a finite positive Borel measure $\nu_j$ on $V_j$ such that

$$
\nu_j(A)=\mu_i(A)
$$

if $1\le i\le j$ and $A\subseteq U_i$ is Borel.

Further, if $A$ is a Borel subset of $V_j$ and $j\le k$, then

$$
\nu_j(A)=\nu_k(A).
$$

If $A$ is a Borel subset of $X$, then $(\nu_j(A\cap V_j))_{j=1}^\infty$ is an increasing sequence. Indeed, if $j\le k$, then

$$
A\cap V_j\subseteq A\cap V_k,
$$

and since $\nu_k$ extends $\nu_j$ on $V_j$,

$$
\nu_j(A\cap V_j)=\nu_k(A\cap V_j)\le \nu_k(A\cap V_k).
$$

Let

$$
\pi(A)=\lim_{j\to\infty}\nu_j(A\cap V_j).
$$

We now verify that $\pi$ is tight, locally finite and additive.

First, $\pi$ is locally finite. Let $x\in X$. Since $\bigcup_{i=1}^\infty U_i=X$, there exists $i$ such that $x\in U_i$. Because $U_i\subseteq V_j$ for every $j\ge i$, if $A\subseteq U_i$ is Borel then

$$
\nu_j(A)=\mu_i(A), \qquad j\ge i.
$$

Hence

$$
\pi(A)=\lim_{j\to\infty}\nu_j(A\cap V_j)=\mu_i(A).
$$

Taking $A=U_i$, we obtain

$$
\pi(U_i)=\mu_i(U_i)<\infty.
$$

Thus $x$ has an open neighbourhood of finite $\pi$-measure, and $\pi$ is locally finite.

Next, $\pi$ is additive. Let $A,B\subseteq X$ be disjoint Borel sets. Then for each $j$,

$$
(A\cup B)\cap V_j=(A\cap V_j)\cup (B\cap V_j),
$$

and the two sets on the right-hand side are disjoint. Since $\nu_j$ is a measure,

$$
\nu_j((A\cup B)\cap V_j)=\nu_j(A\cap V_j)+\nu_j(B\cap V_j).
$$

Passing to the limit gives

$$
\pi(A\cup B)=\pi(A)+\pi(B).
$$

So $\pi$ is additive.

Finally, $\pi$ is tight. Let $A$ be a Borel subset of $X$ with $\pi(A)<\infty$, and let $\varepsilon>0$. By definition of $\pi(A)$, there exists $j$ such that

$$
\pi(A)-\nu_j(A\cap V_j)<\varepsilon/2.
$$

Since $V_j$ is an open subspace of the Polish space $X$, it is itself Polish. The measure $\nu_j$ is a finite Borel measure on the Polish space $V_j$, hence it is a Radon measure on $V_j$. Therefore there exists a compact set $K\subseteq A\cap V_j$ such that

$$
\nu_j((A\cap V_j)\setminus K)<\varepsilon/2.
$$

Because $K\subseteq V_j$, we have

$$
\pi(K)=\nu_j(K).
$$

Also,

$$
A\setminus K=((A\cap V_j)\setminus K)\cup (A\setminus V_j),
$$

and these two sets are disjoint. Hence, by additivity,

$$
\pi(A\setminus K)=\pi((A\cap V_j)\setminus K)+\pi(A\setminus V_j).
$$

Now

$$
\pi((A\cap V_j)\setminus K)=\nu_j((A\cap V_j)\setminus K)<\varepsilon/2,
$$

and

$$
\pi(A\setminus V_j)=\pi(A)-\pi(A\cap V_j)=\pi(A)-\nu_j(A\cap V_j)<\varepsilon/2.
$$

Therefore

$$
\pi(A\setminus K)<\varepsilon.
$$

So $\pi$ is tight.

Thus $\pi$ is a tight, locally finite additive function on the Borel subsets of the metrizable space $X$. Hence $\pi$ is a Radon measure.

Moreover, as already shown above, if $A$ is a Borel subset of $U_i$, then for every $j\ge i$,

$$
\nu_j(A)=\mu_i(A),
$$

$$
\pi(A)=\mu_i(A).
$$

Thus $\pi$ has the required restriction property.

It remains to prove uniqueness. Let $\widetilde{\pi}$ be another Radon measure on $X$ such that

$$
\widetilde{\pi}(A)=\mu_i(A)
$$

for every Borel set $A\subseteq U_i$ and every $i$. Then for each $j$, the restrictions of $\widetilde{\pi}$ and $\nu_j$ to every $U_i$, $1\le i\le j$, coincide. Since $V_j=\bigcup_{i=1}^j U_i$, it follows that

$$
\widetilde{\pi}(B)=\nu_j(B)
$$

for every Borel set $B\subseteq V_j$.

Now let $A\subseteq X$ be Borel. Since

$$
A\cap V_1\subseteq A\cap V_2\subseteq \cdots
\quad\text{and}\quad
\bigcup_{j=1}^\infty (A\cap V_j)=A,
$$

and since $\widetilde{\pi}$ is a measure, continuity from below gives

$$
\widetilde{\pi}(A)=\lim_{j\to\infty}\widetilde{\pi}(A\cap V_j).
$$

But for each $j$,

$$
\widetilde{\pi}(A\cap V_j)=\nu_j(A\cap V_j).
$$

Therefore

$$
\widetilde{\pi}(A)=\lim_{j\to\infty}\nu_j(A\cap V_j)=\pi(A).
$$

So $\widetilde{\pi}=\pi$, and the proof is complete. $\square$

Reference

Chapter 16 in the following reference:

Garling, David JH. Analysis on Polish spaces and an introduction to optimal transportation. Vol. 89. Cambridge University Press, 2018.

The cover image of this article was taken in Lucerne, Switzerland.

Abelian, Nilpotent and Soluable Lie Algebras

2026-03-05T16:00:00.000Z

Definition 1. A Lie algebra $L$ is abelian if $[L,L]=0$, which means for all $x,y\in L,\ [x,y]=0$.

Notation:
$$
L^1=L,\qquad L^{n+1}=[L^n,L]\qquad (n\geq 1)
$$

Lemma 1. If $I,J$ are ideals of $L$, so is $[I,J]$.

Proof: Let $x\in I,\ y\in J,\ z\in L$, then
$$
[[x,y],z]=[x,[y,z]]-[y,[x,z]]\in [I,J].\quad \square
$$

Proposition 1. (i) $L^n$ is an ideal of $L$,

(ii) Also
$$
L=L^1\supseteq L^2\supseteq L^3\supseteq \cdots
$$

Proof: (i) Follows from the above lemma.

(ii)
$$
L^{n+1}=[L^n,L]\subseteq L^n.\quad \square
$$

Definition 2. A Lie algebra $L$ is nilpotent if $L^n=0$ for some $n\geq 1$,

Remark: Every subalgebra and every quotient algebra of a nilpotent Lie algebra are nilpotent.

Notation:
$$
L^{(0)}=L,\qquad L^{(n+1)}=[L^{(n)},L^{(n)}]\qquad (n\geq 0).
$$

Proposition 2. (i) $L^{(n)}$ is an ideal of $L$.

(ii)
$$
L=L^{(0)}\supseteq L^{(1)}\supseteq L^{(2)}\supseteq \cdots
$$

Proof: Similar to above. $\square$

Definition 3. A Lie algebra $L$ is soluble if $L^{(n)}=0$ for some $n\geq 0$.

Proposition 3. (i)
$$
[L^m,L^n]\subseteq L^{m+n}\qquad \text{for }m,n\geq 1,
$$

(ii)
$$
L^{(n)}\subseteq L^{2^n}\qquad \text{for }n\geq 0,
$$

(iii) nilpotent implies soluble.

Proof: (i) We prove by induction on $n$. Base case $n=1$ is clear. Inductive step:

Suppose (i) holds for $n\leq r$,
$$
[L^m,L^{r+1}]
=[L^m,[L^r,L]]
=[ [L^m,L],L^r ]+[ [L^m,L^r],L ]
$$
by Jacobi identity. Hence
$$
[L^m,L^{r+1}]
\subseteq [L^{m+1},L^r]+[L^{m+r},L]
\subseteq L^{m+r+1}.
$$
(ii) We prove by induction on $n$. Base case $n=0$ is clear. Inductive step:

Suppose (ii) holds for $n\leq r$,
$$
L^{(r+1)}=[L^{(r)},L^{(r)}]
\subseteq [L^{2^r},L^{2^r}]
\subseteq L^{2^r+2^r}=L^{2^{r+1}}.
$$

(iii) Suppose $L$ is nilpotent. Then $L^n=0$ for some $n$. We can pick $k$ s.t. $2^k\geq n$.

$$
L^{(k)}\subseteq L^{2^k}\subseteq L^n=0.
$$
Hence $L^{(k)}=0$. So $L$ is soluble. $\square$

Proposition 4. Let $I$ be an ideal of $L$. If $I$ and $L/I$ are soluble, so is $L$.

Proof: We have $(L/I)^{(n)}=0$ for some $n$. Then
$$
L^{(n)}\subseteq I.
$$

Also, $I^{(m)}=0$, for some $m$. Hence
$$
L^{(n+m)}=(L^{(n)})^{(m)}\subseteq I^{(m)}=0.
$$
So $L$ is soluble. $\square$

Proposition 5. (i) Every finite dimensional Lie algebra $L$ contains a unique maximal soluble ideal $R$.

(ii) Moreover, $L/R$ contains no non-zero soluble ideal.

Proof: (i) Let $I,J$ be soluble ideals of $L$. Then $I+J$ is an ideal of $L$.

Also, $I$ is a soluble ideal of $I+J$. And

$$
(I+J)/I \cong J/(I\cap J),
$$
which is soluble. Therefore, by proposition 4, $I+J$ is soluble.

We have thus shown that the sum of two soluble ideals is soluble, it follows that $L$ has a unique maximal soluble ideal $R$. $\square$

(ii) Let $I/R$ be a soluble ideal of $L/R$. Then $I$ is a soluble ideal of $L$. By part (i), we have

$$
I\subseteq R.
$$
Therefore,
$$
I=R,
$$
and hence
$$
I/R=0.\quad \square
$$
Definition 4. $R$ is called the soluble radical of $L$.

Definition 5. A Lie algebra $L$ is semisimple if $R=0$. In other words, $L$ is semisimple if and only if $L$ has no non-zero soluble ideal.

Definition 6. $L$ is simple if $L$ has no proper ideal.

Example 1. Suppose $L$ has $\dim L=1$. Then $L$ has a basis $\{e\}$. Since $[e,e]=0$, we have
$$
L^2=0.
$$
So $L$ is abelian. Also, $L$ is simple. $L$ is called the trivial simple Lie algebra.

Proposition 6. Every non-trivial simple Lie algebra is semisimple.

Proof: Suppose $L$ is simple but not semisimple. Then
$$
R\neq 0.
$$

Since $L$ is simple and $R$ is an ideal of $L$, we must have
$$
R=L.
$$

Because $R$ is soluble, there exists some $n\geq 0$ such that
$$
L^{(n)}=0.
$$

Then
$$
L^{(1)}\neq L,
$$
for otherwise we would have
$$
L^{(n)}=L\neq 0
$$
for all $n$, which is a contradiction.

Since $L^{(1)}$ is an ideal of $L$ and $L$ is simple, it follows that
$$
L^{(1)}=0.
$$

That is,
$$
[L,L]=0.
$$

Therefore, every subspace of $L$ is an ideal of $L$.

But $L$ is simple, so necessarily
$$
\dim L=1.
$$

This contradicts the assumption that $L$ is non-trivial. Hence $L$ must be semisimple. $\square$

Reference

This series of articles on Lie algebras is based on Carter R., Lie Algebras of Finite and Affine Type (Cambridge University Press, 2005), as well as the class notes for Introduction to Lie Algebras by Chenwei Ruan at BIMSA.

The cover image of this article was taken in Wellington, New Zealand.

Tightness of Borel Measure

2026-03-03T16:00:00.000Z

Definition 1. A mapping $f$ from the Borel sets of a metrizable space $(X,\tau)$ to $[0,\infty]$ is tight if $f(K)<\infty$ for each compact $K$ in $X$ and
$$
f(A)=\sup\{f(K):K \text{ compact},\ K\subseteq A\},
\quad \text{for each } A\in \mathcal{B}(X).
$$

Tightness is very powerful, as the next result shows. We consider non-negative functions on the Borel subsets of a metric space $(X,\tau)$ which can take infinite values.

As before, a mapping $f:\mathcal{B}(X)\to [0,\infty]$ is additive if
$$
f(A\cup B)=f(A)+f(B),
$$
whenever $A$ and $B$ are disjoint, and is $\sigma$-additive if

$$
f!\left(\bigcup_{n=1}^{\infty} A_n\right)=\sum_{n=1}^{\infty} f(A_n)
$$

for each sequence $(A_n)_{n=1}^{\infty}$ of disjoint Borel sets.

Proposition 1. Suppose that $f$ is a non-negative additive tight function on the Borel sets of a metrizable space $(X,\tau)$. Then $f$ is $\sigma$-additive, and so it is a tight Borel measure on $X$.

Proof : Suppose that $(A_n)_{n=1}^{\infty}$ is a sequence of disjoint Borel sets whose union is $A$. First we consider the case where $f(A)<\infty$. Let

$$
B_n=\bigcup_{j=1}^n A_j.
$$

Since
$$
B_n\subseteq A,
$$

we have

$$
\sum_{j=1}^n f(A_j)\le f(A),
$$

and so

$$
\sum_{j=1}^{\infty} f(A_j)\le f(A).
$$

Suppose, if possible, that

$$
\sum_{j=1}^{\infty} f(A_j)=s$$

Let $\varepsilon=\frac{f(A)-s}{2}>0$ and $C_n=A\setminus B_n$. We have

$$
f(C_n)\ge f(A)-f(B_n)\ge f(A)-s= 2\varepsilon,
\quad \text{for all } n\in \mathbb{N}.
$$

By combining blocks of terms, we can suppose that

$$
f(B_n)>s-\frac{\varepsilon}{2^{n+2}},
\quad \text{for all } n\in \mathbb{N}.
$$

For each $n\in \mathbb{N}$ there exists a compact subset $K_n$ of $C_n$ with

$$
f(K_n)>f(C_n)-\frac{\varepsilon}{2^{n+1}}.
$$

Let

$$
L_n=\bigcap_{j=1}^n K_j.
$$

We now show by induction that

$$
f(L_n)\ge \left(1+\frac{1}{2^n}\right)\varepsilon
$$

so that

$$
f(C_n\setminus L_n)\le \left(1-\frac{1}{2^n}\right)\varepsilon.
$$

The result is true when $n=1$; suppose that it is true for $n$. Now

$$
\begin{aligned}
f(C_n\setminus C_{n+1})
&=f(B_{n+1}\setminus B_n) \\\
&=f(B_{n+1})-f(B_n) \\\
&\le s-\left(s-\frac{\varepsilon}{2^{n+2}}\right) \\\
&=\frac{\varepsilon}{2^{n+2}}.
\end{aligned}
$$

Since

$$
C_n\setminus K_{n+1}\subseteq (C_n\setminus C_{n+1})\cup (C_{n+1}\setminus K_{n+1}),
$$

it follows that

$$
f(C_n\setminus K_{n+1})\le f(C_n\setminus C_{n+1})+f(C_{n+1}\setminus K_{n+1})< \frac{\varepsilon}{2^{n+1}}.
$$

Since

$$
C_n=(C_n\setminus K_{n+1})\cup (C_n\setminus L_n)\cup (L_n\cap K_{n+1}),
$$

we have

$$
\begin{aligned}
2\varepsilon
&\le f(C_n) \\\
&\le f(C_n\setminus K_{n+1})+f(C_n\setminus L_n)+f(L_n\cap K_{n+1}) \\\
&\le \frac{\varepsilon}{2^{n+1}}+\left(1-\frac{1}{2^n}\right)\varepsilon+f(L_{n+1}),
\end{aligned}
$$

so that

$$
f(L_{n+1})\ge \left(1+\frac{1}{2^{n+1}}\right)\varepsilon.
$$

This establishes the induction.

But

$$
\bigcap_{n=1}^{\infty} L_n\subseteq \bigcap_{n=1}^{\infty} C_n=\varnothing.
$$

Since the sets $L_n$ are compact, it follows that there exists $N\in \mathbb{N}$ for which

$$
L_N=\varnothing,
$$

so that

$$
f(L_N)=0,
$$

giving a contradiction.

Finally, suppose that $f(A)=\infty$. If $M<\infty$, there exists a compact subset $K$ of $A$ with $f(K)>M$. Since $f$ is tight and by the definition of tightness, we have $f(K)<\infty$. Thus,

$$
\sum_{n=1}^{\infty} f(A_n)\ge \sum_{n=1}^{\infty} f(A_n\cap K)=f(K)>M,
$$

so that

$$
\sum_{n=1}^{\infty} f(A_n)=\infty.\quad \square
$$

Proposition 2. A finite Borel measure $\mu$ on a metric space $(X,d)$ is tight if and only if
$$
\sup\{\mu(K):K \text{ compact},\ K\subseteq X\}=\mu(X).
$$

Proof : The condition is certainly necessary. Suppose that it is satisfied. Suppose that $A$ is a Borel set, and that $\varepsilon>0$. There exists a closed set $B\subseteq A$ such that
$$
\mu(B)\ge \mu(A)-\varepsilon/2
$$

and there exists a compact $K$ such that

$$
\mu(K)>\mu(X)-\varepsilon/2.
$$

Then $B\cap K$ is a compact subset of $A$, and

$$
\begin{aligned}
\mu(B\cap K)&=\mu(B)-\mu(B\cap (X\setminus K))\\\
&\ge \mu(B)-\mu(X\setminus K)\\\
&\ge \mu(A)-\varepsilon/2-\varepsilon/2\\\
&= \mu(A)-\varepsilon.
\end{aligned}
$$

Therefore $\mu$ is tight. $\square$

Theorem 1(Ulam’s Theorem). A finite Borel measure on a Polish space $(X, \tau)$ is tight.

Proof : Let $d$ be a complete metric on $X$ which defines the topology $\tau$. Let $(c_j)_{j=1}^{\infty}$ be a dense sequence in $X$, and let

$$
M_{j,n} = \{ x \in X : d(x, c_j) \leq \frac{1}{n} \}
$$

and

$$
A_{j,n} = \bigcup_{i=1}^{j} M_{i,n}.
$$

Suppose that $\varepsilon > 0$. For $n \in \mathbb{N}$, each $A_{j,n}$ is closed, and $A_{j,n} \uparrow X$ as $j \to \infty$. Thus there exists $J_n$ such that if $E_n = A_{J_n,n}$, then
$$
\mu(E_n) > (1 - \frac{\varepsilon}{2^n}) \mu(X).
$$
It is easy to see that $E_n$ is totally bounded for each $n \in \mathbb{N}$. Let
$$
D_n = \bigcap_{j=1}^{n} E_j.
$$

Then $(D_n)_{n=1}^{\infty}$ is a decreasing sequence of closed sets, and for each $n \in \mathbb{N}$,

$$
\mu(D_n)=\mu(X)-\mu(\bigcup_{j=1}^{n} (X\setminus E_j))> \mu(X)-\sum_{j=1}^{n} \frac{\varepsilon}{2^j} \mu(X) = (1 - (1 - \frac{1}{2^n}) \varepsilon) \mu(X).
$$

Let
$$
D = \bigcap_{n=1}^{\infty} D_n,
$$

then
$$
\mu(D)=\lim_{n\to\infty}\mu(D_n) \geq (1 - \varepsilon) \mu(X).
$$

Moreover, $D$ is closed and totally bounded since $D\subseteq E_n$, and is therefore compact due to the space is complete. $\square$

Reference

Chapter 16 in the following reference:

Garling, David JH. Analysis on Polish spaces and an introduction to optimal transportation. Vol. 89. Cambridge University Press, 2018.

The cover image of this article was taken in Lucerne, Switzerland.

Absolutely Continuous Curves and Metric Derivative

2026-03-01T16:00:00.000Z

Throughout this article $(X,d)$ will be a given complete metric space.

We recall that a map $f:(a,b)\to \mathbb{R}$ is said to absolutely continuous on $(a,b)$ if for every $\varepsilon>0$ there exists $\delta>0$ such that for every finite family of pairwise disjoint intervals $(x_i,y_i)\subset (a,b)$, if
$$
\sum_{i=1}^N (y_i-x_i)<\delta,
$$
then
$$
\sum_{i=1}^N |f(y_i)-f(x_i)|<\varepsilon.
$$

We have the following equivalent characterization of absolute continuity.

Proposition 1. A map $f:(a,b)\to \mathbb{R}$ is absolutely continuous if and only if there exists a function $g\in L^1(a,b)$ such that for every $a$$
f(t)-f(s)=\int_s^t g(r)\ dr.
$$

Now we give the definition of absolutely continuous curves on $(X,d)$.

Definition 1 (Absolutely continuous curves). Let $(X,d)$ be a complete metric space and let $v:(a,b)\to X$ be a curve. We say that $v$ belongs to $AC^p(a,b;X)$, for $p\in[1,+\infty]$, if there exists $m\in L^p(a,b)$ such that
$$
d(v(s),v(t))\leq \int_s^t m(r)\ dr
\qquad \forall, a$$

In the case $p=1$ we are dealing with absolutely continuous curves and we will denote the corresponding space simply with $AC(a,b;X)$.

Any curve in $AC^p(a,b;X)$ is uniformly continuous; if $a>-\infty$ (resp. $b<+\infty$) we will denote by $v(a+)$ (resp. $v(b-)$) the right (resp. left) limit of $v$, which exists since $X$ is complete. The above limit exist even in the case $a=-\infty$ (resp. $b=+\infty$) if $v\in AC(a,b;X)$. Among all the possible choices of $m$ in (1) there exists a minimal one, which is provided by the following theorem.

Theorem 1 (Metric derivative). Let $p\in[1,+\infty]$. Then for any curve $v$ in $AC^p(a,b;X)$ the limit

$$
|v’|(t):=\lim_{s\to t}\frac{d(v(s),v(t))}{|s-t|}
$$

exists for $\mathcal{L}^1$-a.e. $t\in(a,b)$. Moreover the function $t\mapsto |v’|(t)$ belongs to $L^p(a,b)$, it is an admissible integrand for the right hand side of (1), and it is minimal in the following sense:
$$
|v’|(t)\leq m(t)\qquad \text{for }\mathcal{L}^1\text{-a.e. }t\in(a,b),
$$

for each function $m$ satisfying (1).

Proof : Since the space $(a,b)$ is separable and curve $v$ is continuous, the space $v((a,b))$ is also separable and then let $(y_n)\subset X$ be dense in $v((a,b))$. Let

$$
d_n(t):= d(y_n,v(t)).
$$

Now we first claim that all functions $d_n$ are absolutely continuous. For all $s,t\in (a,b),\ s\le t$, since $v\in AC^p(a,b;X)$, there exists $m\in L^p(a,b)$ such that
$$
|d_n(s)-d_n(t)|=|d(y_n, v(s))-d(y_n,v(t))|\le d(v(s),v(t))\le \int_s^t m(r)\ dr
$$
For all $\varepsilon>0$, by the absolute continuity of the integral for $m\in L^p(a,b)$, there exists $\delta>0$ such that
$$
\int_E m(r)\ dr< \varepsilon
$$
as long as $\mathcal{L}^1(E)<\delta$. Hence for every finite family of pairwise disjoint intervals $(x_i,y_i)\subset (a,b)$, if
$$
\sum_{i=1}^N (y_i-x_i)<\delta,
$$
then

$$
\sum_{i=1}^n |d_n(y_i)-d_n(x_i)|\le\int_{\bigcup_{i=1}^n (x_i,y_i)} m(r)\ dr <\varepsilon.
$$

Therefore all functions $d_n$ are absolutely continuous in $(a,b)$ and then the function

$$
h(t):= \sup_{n\in \mathbb{N}} |d_n^\prime (t)|
$$

is well-defined $\mathcal{L}^1-$a.e. in $(a,b)$. Let $t\in (a,b)$ be a point where all functions $d_n$ are differentiable and notice again that
$$
|d_n(s)-d_n(t)|=|d(y_n, v(s))-d(y_n,v(t))|\le d(v(s),v(t))\quad \text{ for all } n\in \mathbb{N}
$$
hence
$$
\sup_{n\in \mathbb{N}} |d_n(s)-d_n(t)|\le d(v(s),v(t))
$$
and then
$$
h(t)=
\sup_{n \in \mathbb{N}} \liminf_{s \to t} \frac{|d_n(s)-d_n(t)|}{|s-t|}\le \liminf_{s \to t} \frac{d(v(s), v(t))}{|s-t|}.
$$
For each function $m$ satisfying (1), by Lebesgue differentiation theorem, we further get
$$
h(t)\le \liminf_{s\to t}\frac{1}{t-s} \int_s^t m(r)\ dr=m(t) \quad \mathcal{L}^1-a.e.
$$
And therefore $h\in L^p(a,b)$. On the other hand, for all $\varepsilon>0$, since $(y_n)$ is dense in $v((a,b))$, there exist $(y_{n_k})\to v(s)$ and then
$$
d(y_{n_k},v(t))\to d(v(s),v(t))\quad \text{as } k\to \infty.
$$

Hence
$$
|d_{n_k}(s)-d_{n_k}(t)|=|d(y_{n_k},v(s))-d_{n_k}(y_{n_k},v(t))|\to d(v(s),v(t)),\quad \text{as } k\to \infty,
$$
which means for all $\varepsilon>0$, the exists a sufficiently large $k$ such that
$$
d(v(s),v(t))< |d_{n_k}(s)-d_{n_k}(t)|+\varepsilon.
$$
Therefore,
$$
d(v(s),v(t))=\sup_{n\in \mathbb{N}} |d_{n}(s)-d_{n}(t)|.
$$
Moreover, since $d_n$ are absolutely continuous, for all $s\le t$,
$$
d_n(t)-d_n(s)=\int_s^t d_n^\prime (r)\ dr\le \int_s^t h(r)\ dr,
$$
hence
$$
d(v(s),v(t))=\sup_{n\in \mathbb{N}} |d_{n}(s)-d_{n}(t)|\le \int_s^t h(r)\ dr \quad \forall s,t\in (a,b),\ s\le t.
$$

Therefore, by Lebesgue differentiation theorem,
$$
\limsup_{s\to t}\frac{d(v(s),v(t))}{|s-t|}\le h(t)
$$
at any Lebesgue point $t$ of $h$. Combining the results above, we get

$$
h(t)\le \liminf_{s \to t} \frac{d(v(s), v(t))}{|s-t|}\le \limsup_{s\to t}\frac{d(v(s),v(t))}{|s-t|}\le h(t),
$$
therefore, the limit
$$
|v’|(t):=\lim_{s\to t}\frac{d(v(s),v(t))}{|s-t|}
$$

exists for $\mathcal{L}^1$-a.e. $t\in(a,b)$,
$$
|v’|(t)=h(t)\leq m(t)\qquad \text{for }\mathcal{L}^1\text{-a.e. }t\in(a,b).
$$
and $|v’|(t)=h(t) \in L^p(a,b)$ which is an admissible integrand for the right hand side of (1). $\square$

Reference

Ambrosio, Luigi, Nicola Gigli, and Giuseppe Savaré. Gradient flows: in metric spaces and in the space of probability measures. Basel: Birkhäuser Basel, 2005.

The cover image of this article was taken in Innsbruck, Austria.