## Automatic differentiation

Just a quick post to point people in the direction of a nice blog post by Damon McDougall on automatic differentiation, which can be very useful in performing sensitivity analysis and hence uncertainty quantification for numerical models.

## Article: Optimal uncertainty quantification for legacy data observations of Lipschitz functions

I’m happy to report that the article “Optimal uncertainty quantification for legacy data observations of Lipschitz functions”, jointly written with Mike McKerns, Dominik Meyer, Florian Theil, Houman Owhadi and Michael Ortiz has now appeared in ESAIM: Mathematical Modelling and Numerical Analysis, vol. 47, no. 6. The preprint version can be found at arXiv:1202.1928.

## Inverse function theorem for Lipschitz functions

Recently, while wandering the corridors of the Mathematics Department, I overheard one of the graduate students explaining the Inverse Function Theorem to some first-year undergraduates. On a non-rigorous level, the Inverse Function Theorem is one of the most accessible (or “obvious”) results in elementary calculus: if the graph of a function y = f(x) has a well-defined and non-zero slope (derivative) s at some point x0, then

1. we ought to be able to write x as a function of y, i.e. x = f−1(y) for y near f(x0),
2. and, moreover, the slope of the inverse function f−1 at f(x0) should be 1s.

The “visual proof” of this statement amounts to sketching the graph of f, observing that the graph of f−1 (if the inverse function exists at all) is the graph of f with the x and y axes interchanged, and hence that if the slope of f is approximately ΔyΔx then the slope of f−1 is approximately ΔxΔy, i.e. the reciprocal of that of f.

Recall that the derivative of a function f: ℝn → ℝm is the rectangular m × n matrix of partial derivatives

$\displaystyle \mathrm{D} f(x) = \begin{bmatrix} \dfrac{\partial f_{1}}{\partial x_{1}} & \cdots & \dfrac{\partial f_{1}}{\partial x_{n}} \\ \vdots & \ddots & \vdots \\ \dfrac{\partial f_{m}}{\partial x_{1}} & \cdots & \dfrac{\partial f_{m}}{\partial x_{n}} \end{bmatrix}$

whenever all these partial derivatives exist. With this notation, a more careful statement of the Inverse Function Theorem is that if f: ℝn → ℝn is continuously differentiable in a neighbourhood of x0 and the square n × n matrix of partial derivatives Df(x0) is invertible, then there exist neighbourhoods U of x0 and V of f(x0) and a continuously differentiable function gV → ℝn (called a local inverse for f) such that

• for every u ∈ U, g(f(u)) = u, and
• for every v ∈ V, f(g(v)) = v.

An interesting question to ask is whether one really needs continuous differentiability of f. For example, Rademacher’s theorem says that whenever f satisfies a Lipschitz condition of the form

$\displaystyle | f(x) - f(y) | \leq C | x - y | \text{ for all } x, y \in \mathbb{R}^{n}$

for some constant C ≥ 0 it follows that f is differentiable almost everywhere in ℝn with derivative having norm at most C. Is this sufficient? It turns out, courtesy of a theorem of F. H. Clarke, that the Inverse Function Theorem does hold true for Lipschitz functions provided that one adopts the right generalized interpretation of the derivative of f.

The (set-valued) generalized derivative Df(x0) of f: ℝn → ℝm at x0 is defined to be the convex hull of the set of all matrices M ∈ ℝm×n that arise as a limit

$\displaystyle M = \lim_{k \to \infty} \mathrm{D} f(x_{k})$

for some sequence (xk) in ℝn of differentiability points of f that converges to x0. One can show that, when f satisfies a Lipschitz condition in a neighbourhood of x0, Df(x0) is a non-empty, compact, convex subset of ℝm×n. The generalized derivative Df(x0) is said to be of maximal rank if every M ∈ Df(x0) has maximal rank (i.e. has rank(M) = min(mn)).

Theorem. (Clarke, 1976) If f: ℝn → ℝn satisfies a Lipschitz condition in some neighbourhood of x0 and Df(x) ⊆ ℝn is of maximal rank, then there exist neighbourhoods U of x0 and V of f(x0) and a Lipschitz function gV → ℝn such that

• for every u ∈ U, g(f(u)) = u, and
• for every v ∈ V, f(g(v)) = v.

It’s very important to note the maximal rank condition in Clarke’s Inverse Function Theorem: we need every matrix M in the generalized derivative to be non-singular. So, for example, the absolute value function on the real line ℝ does not satisfy the hypotheses of Clarke’s theorem at x = 0, even though it is Lipschitz with Lipschitz constant 1, since its generalized derivative at 0 is

$\displaystyle \mathrm{D} |0| = \bigl\{ [\ell] \in \mathbb{R}^{1 \times 1} \big| -1 \leq \ell \leq 1 \bigr\},$

which contains the non-invertible derivative matrix [0]. It is hardly surprising that the Inverse Function Theorem cannot be applied here since the absolute value function is non-injective in any neighbourhood of 0: both +δ and −δ map to +δ. On the other hand, the function f defined by

$\displaystyle f(x) := 2 x + |x| = \begin{cases} x, & \text{if } x \leq 0 \\ 3x, & \text{if } x \geq 0. \end{cases}$

has generalized derivative at 0 given by

$\displaystyle \mathrm{D} f(0) = \bigl\{ [\ell] \in \mathbb{R}^{1 \times 1} \big| 1 \leq \ell \leq 3 \bigr\},$

which is of maximal rank. The local (in fact, global) Lipschitz inverse of this function f is, unsurprisingly,

$\displaystyle f^{-1}(y) := \begin{cases} y, & \text{if } y \leq 0 \\ y/3, & \text{if } y \geq 0. \end{cases}$

## Article: Optimal Uncertainty Quantification

Almost three years on from the initial submission, the article “Optimal Uncertainty Quantification”, jointly written with Houman Owhadi, Clint Scovel, Mike McKerns and Michael Ortiz, is now in print. It will appear in this year’s second-quarter issue of SIAM Review, and is already accessible online for those with SIAM subscriptions; the preprint version can be found at arXiv:1009.0679.

This paper was a real team effort, with everyone bringing different strengths to the table. Given the length of the review process, I think that our corresponding author Houman Owhadi deserves a medal for his patience (as does Ilse Ipsen, the article’s editor at SIAM Review), but, really, congratulations and thanks to all. 🙂

We propose a rigorous framework for Uncertainty Quantification (UQ) in which the UQ objectives and the assumptions/information set are brought to the forefront. This framework, which we call Optimal Uncertainty Quantification (OUQ), is based on the observation that, given a set of assumptions and information about the problem, there exist optimal bounds on uncertainties: these are obtained as values of well-defined optimization problems corresponding to extremizing probabilities of failure, or of deviations, subject to the constraints imposed by the scenarios compatible with the assumptions and information. In particular, this framework does not implicitly impose inappropriate assumptions, nor does it repudiate relevant information. Although OUQ optimization problems are extremely large, we show that under general conditions they have finite-dimensional reductions. As an application, we develop Optimal Concentration Inequalities (OCI) of Hoeffding and McDiarmid type. Surprisingly, these results show that uncertainties in input parameters, which propagate to output uncertainties in the classical sensitivity analysis paradigm, may fail to do so if the transfer functions (or probability distributions) are imperfectly known. We show how, for hierarchical structures, this phenomenon may lead to the non-propagation of uncertainties or information across scales. In addition, a general algorithmic framework is developed for OUQ and is tested on the Caltech surrogate model for hypervelocity impact and on the seismic safety assessment of truss structures, suggesting the feasibility of the framework for important complex systems. The introduction of this paper provides both an overview of the paper and a self-contained mini-tutorial about basic concepts and issues of UQ.

One of the theorems that I make frequent use of in my uncertainty quantification (UQ) research concerns probabilities measures on Radon spaces. Without going into details, the UQ methods that I like to work with will work fine if the spaces where your uncertain parameters / functions / models / other gubbins live are Radon spaces, and might fail to work otherwise; therefore, it’s important to know what a Radon space is, and if it’s a serious restriction. (It’s also very useful to know some pithy examples use in response to questions about Radon spaces in talks and poster presentations!)

So… the definition. Consider a topological space (XT) and a probability measure μ: ℬ(T) → [0, 1] defined on the Borel σ-algebra ℬ(T) (i.e. the smallest σ-algebra on X that contains all the open sets, i.e. those sets that are listed in the topology T). The measure μ is said to be inner regular if, for every ℬ(T)-measurable set E ⊆ X,

$\mu(E) = \sup \bigl\{ \mu(K) \big| K \subseteq E \mbox{ and } K \mbox{ is compact} \bigr\}.$

This is often informally read as saying that the measure of an arbitrary measurable set can be approximated from within by compact sets. The space (XT) is called a pseudo-Radon space (my terminology) if every probability measure μ on ℬ(T) is inner regular, and if the space is also separable and metrizable then it is called a Radon space (more standard in the literature, e.g. the book of Ambrosio, Gigli & Savaré on gradient flows in metric spaces).

So, what spaces are (pseudo-)Radon spaces? It turns out that most of the “nice” spaces that one might want to consider are Radon:

• any compact subset of n-dimensional Euclidean space ℝn is Radon,
• indeed, Euclidean space itself is Radon,
• as is any Polish space (i.e. a separable and completely metrizable space),
• as indeed is any Suslin space (i.e. a continuous Hausdorff image of a Polish space).

This all seems to suggest that non-Radon spaces must be very weird beasts indeed, perhaps spaces that are topologically very large, so much so that they cannot be the image of a separable space. However, there are, in fact, “small” examples of non-Radon spaces. Since just one counterexample will suffice, it’s enough to find a single example of a non-inner-regular measure on space to show that it is not (pseudo-)Radon.

## Three Bernstein inequalities

This post concerns three unrelated inequalities, or families of inequalities, that all go by the name of Bernstein’s inequality, after Sergei Natanovich Bernstein (Сергей Натанович Бернштейн). In no particular order, they are:

• a family of inequalities in probability theory that bound the deviations of independent or weakly dependent random variables,
• an inequality bounding the Lp norms of band-limited functions, and
• an inequality bounding the derivatives of complex polynomials.

1. Deviations of Random Variables

There are quite a few Bernstein inequalities for families of independent or weakly correlated random variables X1, …, Xn. The following inequality is one of the simplest, but gives the general idea. Let X1, …, Xn be independent, but not necessarily identically distributed, real-valued random variables taking values in [−R, +R]. Then for any t > 0,

$\displaystyle \mathbb{P} \left[ \sum_{j=1}^{n} X_j > t \right] \leq \exp \left( - \frac12 \frac{t^2}{\sum_{j=1}^{n} \mathbb{E} \bigl[ X_j^2 \bigr] + \frac{R t}{3}} \right).$

For those readers who prefer to think in terms of the vector-valued random variable X = (X1, …, Xn), this inequality reads

$\displaystyle \mathbb{P} \left[ \bigl\| \mathbf{X} \bigr\|_{1} > t \right] \leq \exp \left( - \frac12 \frac{t^2}{\mathbb{E} \bigl[ \bigl\| \mathbf{X} \bigr\|_{2}^{2} \bigr] + \frac{t}{3} \bigl\| \mathbf{X} \bigr\|_{\infty} } \right).$

A similar inequality for negative t is easily obtained by replacing each random variable Xj by its negative. There are also many generalizations, some also known as Bernstein inequalities, and others falling under the general heading of concentration of measure.

2. Lp Norms of Band-Limited Functions

Suppose that f: ℝn → ℂ is band-limited in the sense that its Fourier transform ℱ[f]: ℝn → ℂ is identically zero outside the ball B(0, R) of radius R centred at the origin 0 of ℝn. Suppose also that 1 ≤ p < q < ∞. Then there is a constant C, independent of f, such that

$\displaystyle \bigl\| f \bigr\|_{L^{q}(\mathbb{R}^{n})} \leq C R^{n (\frac1p - \frac1q)} \bigl\| f \bigr\|_{L^{p}(\mathbb{R}^{n})} .$

In fact, with a possibly different constant, the same inequality holds with a weak Lp quasi-norm on the right-hand side:

$\displaystyle \bigl\| f \bigr\|_{L^{q}(\mathbb{R}^{n})} \leq C R^{n (\frac1p - \frac1q)} \bigl\| f \bigr\|_{L^{p,\infty}(\mathbb{R}^{n})} .$

This inequality is not too difficult to prove using Young’s inequality for convolutions and standard interpolation between L1,∞ and L2q,∞.

3. Derivatives of Polynomials

Suppose that p: ℂ → ℂ is a polynomial of degree n. Then Bernstein’s inequality for the polynomial p states that, on the unit disc

$\displaystyle D := \bigl\{ z \in \mathbb{C} \big| |z| \leq 1 \bigr\},$

the degree n of p, multiplied by the maximum absolute volue of p, bounds the maximum absolute value of the derivative of p:

$\displaystyle \max_{z \in D} |p'(z)| \leq n \max_{z \in D} |p(z)|.$

Similarly, for the kth derivative of p, it holds that

$\displaystyle \max_{z \in D} |p^{(k)}(z)| \leq \frac{n!}{(n-k)!} \max_{z \in D} |p(z)|.$

This Bernstein inequality is useful in the theory of polynomial approximation, since it gives some measure of approximation of derivatives “for free” once the function itself has been approximated.

## A very quick Sobolev embedding

A little while ago, while discussing scaling results in analysis, I had a conversation with a colleague who pointed out to me an elegantly brief proof of the embedding of the critical Sobolev space W1,n(ℝn) into the space BMO(ℝn) of functions of bounded mean oscillation in spatial dimension n. By definition, BMO consists of those L1 functions u such that the semi-norm

$\displaystyle \bigl\| u \bigr\|_{\mathrm{BMO}(\mathbb{R}^{n})} := \sup_B \frac1{|B|} \int_B \bigl| u(x) - u_B \bigr| \, \mathrm{d} x$

is finite, where the supremum runs over all balls B in ℝn of finite radius, |B| denotes the volume of the ball B, and uB denotes the average of u over B:

$\displaystyle u_B := \frac1{|B|} \int_B u(x) \, \mathrm{d} x .$

The usual norm on BMO is the sum of the L1 norm and the BMO semi-norm. For p < n, the Sobolev space W1,p(ℝn) embeds into a suitable Lq space; when p > n, the Sobolev space W1,p(ℝn) embeds into a space of Hölder-continuous functions. The space BMO provides a neat (although not complete — see e.g. this paper) description of what happens in the critical case p = n, namely that W1,n(ℝn) embeds into BMO(ℝn); that is, there is a constant c depending only on n such that

$\displaystyle \bigl\| u \bigr\|_{\mathrm{BMO}(\mathbb{R}^{n})} \leq c \bigl\| u \bigr\|_{W^{1,n} (\mathbb{R}^n)}.$