# Article: Optimal uncertainty quantification for legacy data observations of Lipschitz functions

I’m happy to report that the article “Optimal uncertainty quantification for legacy data observations of Lipschitz functions”, jointly written with Mike McKerns, Dominik Meyer, Florian Theil, Houman Owhadi and Michael Ortiz has now appeared in ESAIM: Mathematical Modelling and Numerical Analysis, vol. 47, no. 6. The preprint version can be found at arXiv:1202.1928.

Simply put, this article concerns the problem of bounding the probability of an event when we know very little about the probability distributions and other functions involved. Mathematically, we write this event as [g(X) ≤ 0], where g denotes the response function and X denotes its random inputs. In “the real world”, the output g(X) being negative might be some undesirable outcome such as a bridge collapsing under an earthquake, a power plant failing to produce enough electricity, a disease advancing beyond the initial site to become a genuine epidemic, &c. The last example is a good one for this paper, because the exact functional relationship g between the initial conditions X of the outbreak and the outcome g(X) are only partly understood: we have some historical data, but we would encounter severe legal and ethical problems if we proposed starting new pandemics just to learn more about the function g!

Following the approach begun here, this article approaches this problem from a mathematical perspective: how can we bound the probability P[g(X) ≤ 0] of the event [g(X) ≤ 0], despite having significant uncertainty about the probability distribution P of X and the function g, and no reasonable way to extend our limited knowledge of them? How can we find lower and upper bounds on this partially-known probability that are as tight as possible given the available information? For example, “P[g(X) ≤ 0] is between 0% and 100%” is a true statement, but far too loose to be useful, whereas “P[g(X) ≤ 0] is exactly 52.3937225%” is very precise and informative, but is likely both wrong and more precise than the available evidence really justifies.

Abstract. We consider the problem of providing optimal uncertainty quantification (UQ) — and hence rigorous certification — for partially-observed functions. We present a UQ framework within which the observations may be small or large in number, and need not carry information about the probability distribution of the system in operation. The UQ objectives are posed as optimization problems, the solutions of which are optimal bounds on the quantities of interest; we consider two typical settings, namely parameter sensitivities (McDiarmid diameters) and output deviation (or failure) probabilities. The solutions of these optimization problems depend non-trivially (even non-monotonically and discontinuously) upon the specified legacy data. Furthermore, the extreme values are often determined by only a few members of the data set; in our principal physically-motivated example, the bounds are determined by just 2 out of 32 data points, and the remainder carry no information and could be neglected without changing the final answer. We propose an analogue of the simplex algorithm from linear programming that uses these observations to offer efficient and rigorous UQ for high-dimensional systems with high-cardinality legacy data. These findings suggest natural methods for selecting optimal (maximally informative) next experiments.