Wasserstein Distance
The Wasserstein distance has wide applications in problems such as optimal transport. It characterizes the distance between two probability measures. In German, Wasser means water, and Stein means stone (although I have not carefully verified whether Wasserstein is a person’s name; after all, many German names are of this kind, for example: Einstein = one stone). This article will mainly answer the following two questions:
- On what set is the Wasserstein distance defined?
- Why is the Wasserstein distance a distance? That is, why does it satisfy the three axioms of a metric?
In this article, we only consider the Wasserstein distance on the $d$-dimensional Euclidean space $\mathbb{R}^d$. Since $\mathbb{R}^d$ is a special Polish space, the conclusions of this article can also be extended to general Polish spaces.
On what set is the Wasserstein distance defined?
Definition 1 Let $\mathcal{P}(\mathbb{R}^d)$ be the set of all probability measures on $(\mathbb{R}^d,\mathcal{B}(\mathbb{R}^d))$; for $p\ge 1$, let $\mathcal{P}_p(\mathbb{R}^d)$ be the set of all probability measures on $(\mathbb{R}^d,\mathcal{B}(\mathbb{R}^d))$ with finite $p$-th moment.
Similarly, one may define $\mathcal{P}_p(\mathbb{R}^d\times \mathbb{R}^d)$.
Definition 2 (Coupling) For $p\ge 1$, and given $\mu,\nu\in \mathcal{P}_p(\mathbb{R}^d)$, define
$$
\Pi_p(\mu,\nu) = \{\pi\in \mathcal{P}_p(\mathbb{R}^d\times \mathbb{R}^d)\ ;\ \pi(A\times \mathbb{R}^d) = \mu(A),\ \pi(\mathbb{R}^d\times B) = \nu(B),\ \forall A,B\in \mathcal{B}(\mathbb{R}^d)\}.
$$
This set is called the couplings of $\mu$ and $\nu$.
Definition 3 (Wasserstein distance) For $p\ge 1$, define the mapping $W_p: \mathcal{P}_p(\mathbb{R}^d)\times \mathcal{P}_p(\mathbb{R}^d)\to \mathbb{R}$ by
$$
W_p(\mu,\nu) = \inf_{\gamma\in \Pi_p(\mu,\nu)}\left(\int_{\mathbb{R}^d\times \mathbb{R}^d}|x-y|^p \gamma(dx,dy)\right)^{1/p}.
$$
This is called the $p$-Wasserstein distance.
Why is the Wasserstein distance a distance?
Below we need to prove that for $p\ge 1$, and for any $\mu,\nu,\lambda\in \mathcal{P}_p(\mathbb{R}^d)$, we have
(I) $W_p(\mu,\nu)<+\infty$.
(II) $W_p(\mu,\nu)\ge 0$.
(III) $W_p(\mu,\nu) = 0$ if and only if $\mu = \nu$.
(IV) $W_p(\mu,\nu) = W_p(\nu,\mu)$.
(V) $W_p(\mu,\lambda)\le W_p(\mu,\nu)+W_p(\nu,\lambda)$.
Among them, (II) and (IV) are obvious. We first prove (I).
Proof of (I): Since for any $\gamma\in \Pi_p(\mu,\nu)$, when $p\ge 1$, we have $(a+b)^p\le 2^{p-1}$, and by Fubini’s theorem together with $\mu,\nu\in \mathcal{P}_p(\mathbb{R}^d)$, we obtain
$$
\begin{aligned}
\int_{\mathbb{R}^d\times \mathbb{R}^d}|x-y|^p \gamma(dx,dy)&\le \int_{\mathbb{R}^d\times \mathbb{R}^d}(|x|+|y|)^p \gamma(dx,dy)\le \int_{\mathbb{R}^d\times \mathbb{R}^d}2^{p-1}(|x|^p+|y|^p) \gamma(dx,dy) \\\
&= 2^{p-1}\int_{\mathbb{R}^d\times \mathbb{R}^d}|x|^p \gamma(dx,dy) +2^{p-1}\int_{\mathbb{R}^d\times \mathbb{R}^d}|y|^p \gamma(dx,dy) \\\
& = 2^{p-1}\int_{\mathbb{R}^d}|x|^p \mu(dx) +2^{p-1}\int_{\mathbb{R}^d}|y|^p \nu(dy)\\\
&<+\infty
\end{aligned}
$$
Hence $W_p(\mu,\nu)<+\infty$.
In order to prove (III) and (V), we must first understand the properties of the set $\Pi_p(\mu,\nu)$. We have the following important result:
Theorem 1 For $p\ge 1$ and any $\mu,\nu\in \mathcal{P}_p(\mathbb{R}^d)$, the set $\Pi_p(\mu,\nu)$ is a nonempty convex set and is compact in the topology of weak convergence.
In fact, Theorem 1 guarantees that the infimum in the Wasserstein distance is attained, namely we have the following theorem.
Theorem 2 There exists $\bar\gamma\in \Pi_p(\mu,\nu)$ such that
$$
W_p(\mu,\nu) = \left(\int_{\mathbb{R}^d\times \mathbb{R}^d}|x-y|^p \bar\gamma(dx,dy)\right)^{1/p}.
$$
If I do not prove Theorem 1 and Theorem 2, then you will certainly think that I am a liar. But before proving Theorem 1 and Theorem 2, let us first review some important definitions and results from advanced probability theory and optimization.
Lower semi-continuous functions
Definition 4 Let $(D,d)$ be a metric space. We say that a function $f: D\to \mathbb{R}$ is lower semi-continuous (l.s.c.) at $\bar x\in D$ if
$$
\liminf_{x\to \bar x}f(x)\ge f(\bar x).
$$
We say that $f$ is lower semi-continuous on $D$ if $f$ is lower semi-continuous at every $\bar x\in D$.
We say that a function $f: D\subseteq \mathbb{R}^d\to \mathbb{R}$ is upper semi-continuous (u.s.c.) at $\bar x\in D$ if
$$
\limsup_{x\to \bar x}f(x)\le f(\bar x).
$$
We say that $f$ is upper semi-continuous on $D$ if $f$ is upper semi-continuous at every $\bar x\in D$.
For lower semi-continuous functions, we have the following equivalent characterization:
Lemma 1 (Equivalent characterization of lower semi-continuous functions) Let $f: D\to \mathbb{R}$. Then the following are equivalent:
(1) $f$ is lower semi-continuous;
(2) For any $\bar x\in D$ and any $\varepsilon>0$, there exists $\delta>0$ such that
$$
f(\bar x)-\varepsilon<f(x),\quad \forall x\in B(\bar x,\delta)\cap D.
$$
(3) For any $\bar x\in D$ and any sequence $\{x_k\}$ in $D$ with $x_k\to \bar x$, one has
$$
\liminf_{k\to\infty}f(x_k)\ge f(\bar x).
$$
(4) The epigraph of $f$
$$
epi(f) = \{(x,\alpha)\in D\times \mathbb{R}\ ;\ f(x)\le \alpha\}
$$
is closed.
(5) For any $\alpha\in \mathbb{R}$, the sublevel set of $f$
$$
lev_{\le \alpha}(f) = \{x\in D\ ;\ f(x)\le \alpha\}
$$
is closed.
Proof: The proof of the above lemma is routine. Readers who are unfamiliar with it may refer to [4].
Corollary 1 If $G\subseteq \mathbb{R}^d$ is open, then
$$
1_G(x) = \begin{cases}
1,&x\in G\\\
0,&x\notin G
\end{cases}
$$
is a lower semi-continuous function on $\mathbb{R}^d$.
Proof: This follows immediately from Lemma 1 (5).
Lower semi-continuous functions can guarantee the existence of solutions to constrained optimization problems. We have the following result.
Theorem 3 (Weierstrass theorem)
If $f: D\to \mathbb{R}\cup\{\pm\infty\}$ is proper, lower semi-continuous, and $D$ is compact, then the constrained optimization problem
$$
\inf_{x\in D} f(x)
$$
admits a solution $x^*\in D$.
Note: A function is proper if its domain
$$
\operatorname{dom} f = \{x\in D\ ;\ f(x)<+\infty\}\neq \varnothing,
$$
and $f$ never takes the value $-\infty$ on $D$. However, in the problem considered in this article, since the objective function is nonnegative and we have already proved (I), properness is obviously satisfied.
Proof: See [5].
For lower semi-continuous functions, we also have the following result:
Lemma 2 A nonnegative lower semi-continuous function can be approximated by a family of Lipschitz functions. Namely, if $f: D\to \mathbb{R}$ is nonnegative and lower semi-continuous, then there exists a family of $n$-Lipschitz functions $f_n$ approximating it.
Proof: Let
$$
f_n(x) = \inf_{y\in D} \{f(y)+n \cdot d(x,y)\}
$$
Then clearly $0\le f_n(x)\le f_{n+1}(x)\le f(x)$. It remains to prove:
(i) $f_n$ is an $n$-Lipschitz function;
(ii) For any $x\in D$, $\lim\limits_{n\to\infty} f_n(x) = f(x)$.
First prove (i). For any $x,y\in D$,
$$
\begin{aligned}
f_{n}(x) &=\inf_{z\in D} \{f(z)+nd(x,z)\} \le \inf_{z\in D} \{f(z)+nd(x,y)+nd(y,z)\}\\\
& = \inf_{z\in D} \{f(z)+nd(y,z)\}+nd(x,y)= f_n(y)+nd(x,y)
\end{aligned}
$$
Symmetrically,
$$
f_n(y)\le f_n(x)+nd(x,y)
$$
Hence
$$
|f_n(x)-f_n(y)|\le nd(x,y)
$$
that is, $f_n$ is an $n$-Lipschitz function.
Now prove (ii). Suppose by contradiction that $l = \lim\limits_{n\to\infty} f_n(x) < f(x)$. Then by the definition of infimum, one can choose $y_n$ such that
$$
f(y_n)+nd(y_n,x)<f_n(x)+\frac{1}{n}.\quad (1)
$$
Therefore,
$$
d(y_n,x)<\frac{f_n(x)+\frac{1}{n}-f(y_n)}{n}\le \frac{C_x}{n},
$$
where $C_x = f_n(x)/n+1$. Hence $y_n\to x$. By (1),
$$
f(y_n)<f_n(x)+\frac{1}{n}.
$$
Letting $n\to\infty$ and taking the lower limit, and using lower semi-continuity, we obtain
$$
f(x)\le \liminf_{n\to\infty} f(y_n) \le l
$$
which contradicts the assumption. Therefore the assumption is false, and the proof is complete.
Remark: In the above lemma, for a general lower semi-continuous function, one can approximate its positive and negative parts separately by Lipschitz continuous functions, and hence the nonnegativity assumption can be removed. Moreover, it is also easy to prove that this condition is in fact necessary and sufficient, namely a function is lower semi-continuous if and only if it can be approximated by Lipschitz continuous functions. The proof is left as an exercise.
Weak convergence of measures
Definition 5 (Weak convergence of measures) We say that a family of probability measures $\mu_n$ on $(\mathbb{R}^d,\mathcal{B}(\mathbb{R}^d))$ converges weakly to a probability measure $\mu$ if for every bounded continuous function $f:\mathbb{R}^d\to\mathbb{R}$, one has
$$
\int_{\mathbb{R}^d}f\ d\mu_n\to \int_{\mathbb{R}^d}f\ d\mu,\quad n\to\infty.
$$
We have the famous Portmanteau theorem, which gives a series of equivalent conditions for weak convergence of measures.
Theorem 4 (Portmanteau theorem) Let $\mu_n,\mu$ be probability measures on $(\mathbb{R}^d,\mathcal{B}(\mathbb{R}^d))$. Then the following are equivalent:
(1) $\mu_n$ converges weakly to $\mu$.
(2) For every bounded Lipschitz function $f:\mathbb{R}^d\to \mathbb{R}$, one has
$$
\int_{\mathbb{R}^d}f\ d\mu_n\to \int_{\mathbb{R}^d}f\ d\mu,\quad n\to\infty.
$$
(3) For every nonnegative lower semi-continuous function $f:\mathbb{R}^d\to \mathbb{R}$, one has
$$
\int_{\mathbb{R}^d}f\ d\mu\le \liminf_{n\to\infty}\int_{\mathbb{R}^d}f\ d\mu_n
$$
(4) For every nonpositive upper semi-continuous function $f:\mathbb{R}^d\to \mathbb{R}$, one has
$$
\int_{\mathbb{R}^d}f\ d\mu\ge \limsup_{n\to\infty}\int_{\mathbb{R}^d}f\ d\mu_n
$$
(5) For every open set $G\subseteq \mathbb{R}^d$,
$$
\mu(G)\le \liminf_{n\to \infty} \mu_n(G)
$$
(6) For every closed set $F\subseteq \mathbb{R}^d$,
$$
\mu(F)\ge \limsup_{n\to \infty} \mu_n(F)
$$
(7) For every $A\in \mathcal{B}(\mathbb{R}^d)$ with $\mu(\partial A) = 0$,
$$
\lim_{n\to\infty}\mu_n(A) = \mu(A).
$$
Proof: In most books on advanced probability theory or measure theory, one can find the equivalence of (1), (2), (5), (6), and (7); see [6]. Clearly, (3) and (4) are equivalent. Thus it remains to prove: (2)$\Longrightarrow$(3) and (3)$\Longrightarrow$(5).
First, by Corollary 1, it is clear that (3)$\Longrightarrow$(5). Therefore it suffices to prove (2)$\Longrightarrow$(3).
Indeed, according to Lemma 2, it is also easy to prove this implication. The proof is complete!
We also have the famous Prokhorov theorem. This theorem is very important, but we omit its proof here. (Again, see [6].)
Definition 6 (Tightness) A family of probability measures $\Lambda$ on $(X,\mathcal{B}(X))$ is called tight if for every $\varepsilon>0$, there exists a compact set $K$ such that
$$
\mu(K^c)\le \varepsilon,\quad \forall \mu\in \Lambda.
$$
Definition 7 (Relative compactness) A family of probability measures $\Lambda$ on $(X,\mathcal{B}(X))$ is called relatively compact if its closure $\bar{\Lambda}$ is compact in $\mathcal{P}(X)$, that is, every sequence in $\Lambda$ has a subsequence which converges weakly to some measure.
Theorem 5 (Prokhorov theorem) Let $\Lambda\subset \mathcal{P}(X)$. Then
(1) If $\Lambda$ is tight, then $\Lambda$ is relatively compact in $\mathcal{P}(X)$.
(2) If $X$ is a Polish space, then the relative compactness of $\Lambda$ implies the tightness of $\Lambda$.
For $(\mathbb{R}^d,\mathcal{B}(\mathbb{R}^d))$, since it is a Polish space, we have the following corollary.
Corollary 2 A family of probability measures $\Lambda$ on $(\mathbb{R}^d,\mathcal{B}(\mathbb{R}^d))$ is compact if and only if $\Lambda$ is closed and tight.
Note: Here the topology on $\mathcal{P}(X)$ is the topology corresponding to weak convergence of measures.
With the above preparations, we can now give the proofs of Theorem 1 and Theorem 2 above.
Theorem 1 For $p\ge 1$ and any $\mu,\nu\in \mathcal{P}_p(\mathbb{R}^d)$, the set $\Pi_p(\mu,\nu)$ is nonempty, convex, and compact in the topology of weak convergence.
Proof: First prove nonemptiness: clearly the product measure $\mu\times \nu\in \Pi_p(\mu,\nu)$, so nonemptiness holds.
Next prove convexity: if $\pi_1,\pi_2\in \Pi_p(\mu,\nu)$, and $\theta\in [0,1]$ is arbitrary, we want to show that $\theta\pi_1+(1-\theta)\pi_2\in \Pi_p(\mu,\nu)$.
Clearly $\theta\pi_1+(1-\theta)\pi_2\in \mathcal{P}_p(\mathbb{R}^d)$, and for any $A\in \mathcal{B}(\mathbb{R}^d)$,
$$
(\theta\pi_1+(1-\theta)\pi_2)(A\times \mathbb{R}^d) = \theta\pi_1(A\times \mathbb{R}^d)+(1-\theta)\pi_2(A\times \mathbb{R}^d) = \theta\mu(A)+(1-\theta)\mu(A) = \mu(A)
$$
Similarly, for any $B\in \mathcal{B}(\mathbb{R}^d)$,
$$
(\theta\pi_1+(1-\theta)\pi_2)( \mathbb{R}^d\times B) = \nu(B)
$$
Therefore $\theta\pi_1+(1-\theta)\pi_2\in \Pi_p(\mu,\nu)$. Thus $\Pi_p(\mu,\nu)$ is convex.
Finally, prove compactness. By Prokhorov’s theorem, it suffices to prove that it is tight and closed.
By Prokhorov’s theorem, the constant sequences of measures $\mu_n\equiv \mu$ and $\nu_n\equiv\nu$ are both tight. Therefore for any $\varepsilon>0$, there exists a compact set $K$ such that
$$
\mu(K^c)<\frac{\varepsilon}{2},\quad \nu(K^c)<\frac{\varepsilon}{2}
$$
Since $K\times K$ is compact, for any $\pi\in \Pi_p(\mu,\nu)$,
$$
\pi((K\times K)^c)\le \pi( \mathbb{R}^d\times K^c)+\pi(K^c\times \mathbb{R}^d) = \nu(K^c)+\mu(K^c)<\varepsilon.
$$
Hence $\Pi_p(\mu,\nu)$ is tight.
Now prove that it is closed. Suppose $\pi_n\in \Pi_p(\mu,\nu)$ and $\pi_n$ converges weakly to $\pi$. We want to show that $\pi\in \Pi_p(\mu,\nu)$.
By the definition of weak convergence of measures, for every bounded continuous function $f:\mathbb{R}^d\times \mathbb{R}^d\to\mathbb{R}$, one has
$$
\int_{\mathbb{R}^d\times \mathbb{R}^d}f(x,y)\ \pi_n(dx,dy)\to \int_{\mathbb{R}^d\times \mathbb{R}^d}f(x,y)\ \pi(dx,dy),\quad n\to\infty.
$$
Hence $\pi\in \mathcal{P}_p(\mathbb{R}^d\times \mathbb{R}^d)$. In particular, for $f(x,y) = F(x)$, this is also a bounded continuous function. By Fubini’s theorem,
$$
\int_{\mathbb{R}^d}F(x)\ \mu(dx)=\int_{\mathbb{R}^d\times \mathbb{R}^d}f(x,y)\ \pi_n(dx,dy)\to \int_{\mathbb{R}^d\times \mathbb{R}^d}f(x,y)\ \pi(dx,dy) =\int_{\mathbb{R}^d\times \mathbb{R}^d}F(x)\ \pi(dx,dy) ,\quad n\to\infty.
$$
Therefore, for any $A\in \mathcal{B}(\mathbb{R}^d)$, taking $F(x) = 1_A(x)$, we get
$$
\mu(A) =\int_{\mathbb{R}^d}1_A(x)\ \mu(dx)\to \int_{\mathbb{R}^d\times \mathbb{R}^d}1_A(x)\ \pi(dx,dy) = \pi(A\times \mathbb{R}^d),\quad n\to\infty
$$
Since both sides are independent of $n$, it follows that $\mu(A) =\pi(A\times \mathbb{R}^d)$. Similarly, for any $B\in \mathcal{B}(\mathbb{R}^d)$, $\nu(B) = \pi(\mathbb{R}^d\times B)$. Thus $\pi\in \Pi_p(\mu,\nu)$.
Therefore, by Prokhorov’s theorem, $\Pi_p(\mu,\nu)$ is compact.
Theorem 2 There exists $\bar\gamma\in \Pi_p(\mu,\nu)$ such that
$$
W_p(\mu,\nu) = \left(\int_{\mathbb{R}^d\times \mathbb{R}^d}|x-y|^p \bar\gamma(dx,dy)\right)^{1/p}.
$$
Proof: By definition,
$$
W_p(\mu,\nu) = \inf_{\gamma\in \Pi_p(\mu,\nu)}\left(\int_{\mathbb{R}^d\times \mathbb{R}^d}|x-y|^p \gamma(dx,dy)\right)^{1/p}.
$$
Denote
$$
I(\gamma) =\int_{\mathbb{R}^d\times \mathbb{R}^d}|x-y|^p \gamma(dx,dy)
$$
By the definition of infimum, there exists a sequence $\gamma_n\in \Pi_p(\mu,\nu)$ such that
$$
I(\gamma_n)\to W_p^p(\mu,\nu).
$$
By Theorem 1, $\Pi_p(\mu,\nu)$ is compact, so there exists a subsequence $\gamma_{n_k}$ converging weakly to $\bar\gamma\in \Pi_p(\mu,\nu)$. Since $|x-y|^p$ is continuous, hence lower semi-continuous, by Portmanteau theorem (3),
$$
I(\bar\gamma)\le\liminf_{k\to\infty} I(\gamma_{n_k}) = W_p^p(\mu,\nu).
$$
On the other hand, since $\bar\gamma\in \Pi_p(\mu,\nu)$, by definition $I(\bar\gamma)\ge W_p^p(\mu,\nu)$. Hence $I(\bar\gamma)= W_p^p(\mu,\nu)$, and the proof is complete.
Remark: Although $|x-y|^p$ is continuous in the proof, it is unbounded, so we cannot directly use the definition of weak convergence. Therefore, we need Portmanteau theorem (3), which uses lower semi-continuity to obtain our conclusion.
Now let us prove the positive definiteness of the Wasserstein distance, namely (III): $W_p(\mu,\nu) = 0$ if and only if $\mu = \nu$.
Proof: Necessity: if $W_p(\mu,\nu) = 0$, then by Theorem 2, there exists $\pi\in \Pi_p(\mu,\nu)$ such that
$$
\int_{\mathbb{R}^d\times \mathbb{R}^d}|x-y|^p \pi(dx,dy) =W_p(\mu,\nu) = 0
$$
Hence $x=y,\ \pi$-a.s. Therefore, for any $A\in \mathcal{B}(\mathbb{R}^d)$,
$$
\begin{aligned}
\mu(A)&= \int_{\mathbb{R}^d}1_A(x)\ \mu(dx) = \int_{\mathbb{R}^d\times \mathbb{R}^d}1_A(x) \pi(dx,dy)\\\
& = \int_{\mathbb{R}^d\times \mathbb{R}^d}1_A(y) \pi(dx,dy) = \int_{\mathbb{R}^d}1_A(y)\ \nu(dy)\\\
& = \nu(A).
\end{aligned}
$$
That is, $\mu=\nu$.
Sufficiency: if $\mu=\nu$, then take $\pi(x,y) = \delta_x(y)\mu(x)\in \Pi_p(\mu,\nu)$. Then
$$
W_p^p(\mu,\nu)\le I(\pi) = \int_{\mathbb{R}^d\times \mathbb{R}^d}|x-y|^p\ \pi(dx,dy) = 0.
$$
Hence $W_p(\mu,\nu) = 0$.
Finally, we still need to prove the triangle inequality, namely (V): $W_p(\mu,\lambda)\le W_p(\mu,\nu)+W_p(\nu,\lambda)$. To do so, we need the gluing lemma. For this purpose, we make some further preparations.
Definition 8 (Probability kernel) Let $(\Omega_1,\mathcal{F}_1)$ and $(\Omega_2,\mathcal{F}_2)$ be measurable spaces. A mapping $K: \Omega_1\times \mathcal{F}_2\to \mathbb{R}$ is called a kernel from $(\Omega_1,\mathcal{F}_1)$ to $(\Omega_2,\mathcal{F}_2)$ if
(1) For every $\omega\in \Omega_1$, $K(\omega,\cdot)$ is a measure on $(\Omega_2,\mathcal{F}_2)$.
(2) For every $A\in \mathcal{F}_2$, $K(\cdot,A)$ is an $\mathcal{F}_1$-measurable mapping.
If in (1) we further require that it is a probability measure, then $K$ is called a probability kernel.
Definition 9 (Disintegration of a measure [7]) Let $(\Omega_1,\mathcal{F}_1,\mu)$ and $(\Omega_2,\mathcal{F}_2)$ be measurable spaces, and let $\lambda$ be a measure on the product space $(\Omega_1\times \Omega_2, \mathcal{F}_1\times \mathcal{F}_2)$. If there exists a kernel $K$ from $(\Omega_1,\mathcal{F}_1)$ to $(\Omega_2,\mathcal{F}_2)$ such that
$$
\lambda(A\times B) = \int_A K(x,B)\ \mu(dx),\quad \forall A\in\mathcal{F}_1,B\in \mathcal{F}_2.
$$
then $K$ is called the disintegration kernel of $\lambda$ with respect to $\mu$.
According to Theorem 14.D.10 in Section D of Chapter 14 in [7], we have the following theorem, which guarantees the existence of a disintegration kernel.
Theorem 6 Let $(\Omega_1,\mathcal{F}_1)$ and $(\Omega_2,\mathcal{F}_2)$ be measurable spaces, and suppose that $(\Omega_2,\mathcal{F}_2)$ is a Polish space. Let $\lambda$ be a $\sigma$-finite measure on the product space $(\Omega_1\times \Omega_2, \mathcal{F}_1\times \mathcal{F}_2)$, and suppose that its marginal measure $\lambda_1$ on $(\Omega_1,\mathcal{F}_1)$ is still $\sigma$-finite. Then the disintegration kernel $K$ of $\lambda$ with respect to $\lambda_1$ exists and is unique in the $\lambda_1$-a.s. sense.
With the above preparations, we can now present the gluing lemma, which also has important applications in optimal transport problems.
Theorem 7 (Gluing lemma) For $\mu,\nu,\lambda\in \mathcal{P}(\mathbb{R}^d)$, and $\pi_1\in \Pi(\mu,\nu),\pi_2\in \Pi(\nu,\lambda)$, there exists a probability measure $\gamma\in \mathcal{P}(\mathbb{R}^d\times \mathbb{R}^d\times\mathbb{R}^d)$ such that
$$
\gamma P_{12}^{-1} = \pi_1,\quad \gamma P_{23}^{-1} = \pi_2,
$$
where $P_{12}(x,y,z) = (x,y)$ and $P_{23}(x,y,z) = (y,z)$ are projection mappings, and $\gamma P_{12}^{-1}$ and $\gamma P_{23}^{-1}$ are the corresponding pushforward measures, namely for any $A,B\in \mathcal{B}(\mathbb{R}^d\times \mathbb{R}^d)$,
$$
\gamma P_{12}^{-1}(A) = \gamma(P_{12}^{-1}(A)),\quad \gamma P_{23}^{-1}(B) = \gamma(P_{23}^{-1}(B))
$$
Remark: The gluing lemma expresses that when $\pi_1$ and $\pi_2$ share a common marginal distribution $\nu$, one can “glue” them together to obtain a $\gamma\in \mathcal{P}(\mathbb{R}^d\times \mathbb{R}^d\times\mathbb{R}^d)$ whose corresponding two-dimensional marginals are exactly $\pi_1$ and $\pi_2$.
Proof: For any $A,B,C\in \mathcal{B}(\mathbb{R}^d)$, by Theorem 6, there exist disintegration kernels $K_1,K_2$ such that
$$
\pi_1(A\times B) = \int_B K_1(A,y)\nu(dy),\quad \pi_2(B\times C) = \int_B K_2(y,C)\nu(dy)
$$
Define a set function on the semialgebra
$$
J_0 = \{A\times B\times C\ ;\ A,B,C\in \mathcal{B}(\mathbb{R}^d)\}
$$
by
$$
\gamma(A\times B\times C) = \int_B K_1(A,y)K_2(y,C)\nu(dy).
$$
Then
$$
\gamma(A\times B\times \mathbb{R}^d) = \int_B K_1(A,y)K_2(y,\mathbb{R}^d)\nu(dy) = \int_B K_1(A,y)\nu(dy) = \pi_1(A\times B).
$$
And
$$
\gamma(\mathbb{R}^d\times B\times C) = \int_B K_1(\mathbb{R}^d,y)K_2(y,C)\nu(dy) = \int_B K_2(y,C)\nu(dy) = \pi_2(B\times C).
$$
Then one only needs to extend $\gamma$ uniquely to $\mathcal{B}(\mathbb{R}^d\times \mathbb{R}^d\times\mathbb{R}^d)$ by imitating the construction of finite product measures.
With these preparations, we can now give the proof of the triangle inequality, namely (V): $W_p(\mu,\lambda)\le W_p(\mu,\nu)+W_p(\nu,\lambda)$.
Proof of (V): By Theorem 2, there exist $\pi_1\in \Pi(\mu,\nu)$ and $\pi_2\in \Pi(\nu,\lambda)$ as the corresponding optimal couplings, namely
$$
W_p(\mu,\nu) = \left(\int_{\mathbb{R}^d\times \mathbb{R}^d}|x-y|^p\ \pi_1(dx,dy)\right)^{1/p}
$$
and
$$
W_p(\nu,\lambda) = \left(\int_{\mathbb{R}^d\times \mathbb{R}^d}|y-z|^p\ \pi(dy,dz)\right)^{1/p}
$$
By the gluing lemma, there exists a probability measure $\gamma\in \mathcal{P}(\mathbb{R}^d\times \mathbb{R}^d\times\mathbb{R}^d)$ such that
$$
\gamma P_{12}^{-1} = \pi_1,\quad \gamma P_{23}^{-1} = \pi_2,
$$
where $P_{12}(x,y,z) = (x,y)$ and $P_{23}(x,y,z) = (y,z)$ are projection mappings, and $\gamma P_{12}^{-1}$ and $\gamma P_{23}^{-1}$ are the corresponding pushforward measures. Define $\pi\in \mathcal{P}(\mathbb{R}^d\times \mathbb{R}^d)$ by
$$
\pi = \gamma P_{13}^{-1},
$$
where $P_{13}(x,y,z) = (x,z)$ is also a projection mapping. It is easy to prove that $\pi\in \mathcal{P}_p(\mathbb{R}^d\times \mathbb{R}^d)$, and for any $A,B\in \mathcal{B}(\mathbb{R}^d)$,
$$
\pi(A\times \mathbb{R}^d) = \gamma(P_{13}^{-1}(A\times \mathbb{R}^d)) = \gamma(A\times \mathbb{R}^d\times \mathbb{R}^d) = \pi_1(A\times \mathbb{R}^d) = \mu(A),
$$
$$
\pi(\mathbb{R}^d\times B) = \gamma(P_{13}^{-1}(\mathbb{R}^d\times B)) = \gamma(\mathbb{R}^d\times \mathbb{R}^d\times B) = \pi_2(\mathbb{R}^d\times B) = \lambda(B)
$$
Hence $\pi\in \Pi_p(\mu,\lambda)$. Therefore,
$$
\begin{aligned}
W_p(\mu,\lambda)&\le \left(\int_{\mathbb{R}^d\times \mathbb{R}^d}|x-z|^p\ \pi(dx,dz)\right)^{1/p}\\\
& = \left(\int_{\mathbb{R}^d\times \mathbb{R}^d\times \mathbb{R}^d}|x-z|^p\ \gamma(dx,dy,dz)\right)^{1/p}\\\
&\le \left(\int_{\mathbb{R}^d\times \mathbb{R}^d\times \mathbb{R}^d}|x-y|^p\ \gamma(dx,dy,dz)\right)^{1/p}+ \left(\int_{\mathbb{R}^d\times \mathbb{R}^d\times \mathbb{R}^d}|y-z|^p\ \gamma(dx,dy,dz)\right)^{1/p}\\\
& = \left(\int_{\mathbb{R}^d\times \mathbb{R}^d}|x-y|^p\ \pi_1(dx,dy)\right)^{1/p}+ \left(\int_{\mathbb{R}^d\times \mathbb{R}^d}|y-z|^p\ \pi_2(dy,dz)\right)^{1/p}\\\
& = W_p(\mu,\nu)+W_p(\nu,\lambda).
\end{aligned}
$$
where the third line uses Minkowski’s inequality.
At this point, we have proved that the Wasserstein distance is indeed a distance on $\mathcal{P}_p(\mathbb{R}^d)$.
References
[1] Thorpe M. Introduction to optimal transport. Notes of Course at University of Cambridge, 2018.
[2] Chewi S, Niles-Weed J, Rigollet P. Statistical optimal transport. arXiv preprint arXiv:2407.18163, 2024.
[3] Santambrogio. Optimal transport for applied mathematicians. Birkäuser Springer, Basel, 2015.
[5] 刘浩洋, 户将, 李勇锋, 文再文. 最优化: 建模, 算法与理论. 北京:高等教育出版社, 2020.
[6] 严加安. 测度论讲义. 第三版. 北京:科学出版社, 2021.
[7] François Baccelli, Bartłomiej Błaszczyszyn, Mohamed Karray. Random Measures, Point Processes, and Stochastic Geometry. Inria, 2024. ⟨hal-02460214v2⟩
The cover image of this article was taken in Oberammergau, Germany.
Wasserstein Distance
https://handsteinwang.github.io/2024/12/30/Wasserstein-Distance/
