2021-10-04

Functionals in Probability and Bayesian Inference

My work on transportation policy research in part involves conducting & analyzing surveys of people's travel behaviors & attitudes. Analyzing survey data requires an understanding of basic probability and statistics, which is an area that I previously felt I had just enough knowledge of to get by when learning about statistical physics but that I need to build more practical skills in now. In the process of refreshing my understanding of probability and statistics, I thought more about Bayes's theorem. In the context of hypothesis testing or inference, Bayes's theorem can be stated as follows: given a hypothesis \( \mathrm{H} \) and data \( \mathrm{D} \) such that the likelihood of measuring the data under that hypothesis is \( \operatorname{P}(\mathrm{D}|\mathrm{H}) \), and given a prior probability \( \operatorname{P}(\mathrm{H}) \) associated with that hypothesis, the posterior probability of that hypothesis is \[ \operatorname{P}(\mathrm{H}|\mathrm{D}) = \frac{\operatorname{P}(\mathrm{D}|\mathrm{H})\operatorname{P}(\mathrm{H})}{\operatorname{P}(\mathrm{D})} \] given the data. The key is that the denominator is evaluated as a sum \[ \operatorname{P}(\mathrm{D}) = \sum_{\mathrm{H}'} \operatorname{P}(\mathrm{D}|\mathrm{H}')\operatorname{P}(\mathrm{H}') \] where the label \( \mathrm{H}' \) runs over all possible hypothesis.

In practice, however, the set of hypotheses doesn't literally encompass all hypotheses but encompasses only one particular type of function with one or a few free parameters which then go into the prior probability distribution. For one free parameter, if the hypothesis specifies only the value of the (assumed continuous) parameter \( \theta \), if the prior probability of that hypothesis is given by a density \( f_{\mathrm{H}}(\theta)\), and the likelihood of measuring a (assumed continuous) data vector \( D \) under that hypothesis is the density \( f(D|\theta) \), then Bayes's theorem gives \[ f_{\mathrm{H}}(\theta|D) = \frac{f(D|\theta) f_{\mathrm{H}}(\theta)}{\int f(D|\theta') f_{\mathrm{H}}(\theta')~\mathrm{d}\theta'} \] as the posterior probability density under that hypothesis given the data.

I understood that in most cases, a single class of likelihood functions varied through a single parameter is good enough, and especially for the purposes of pedagogy, it is useful to keep things simple & analytical. Even so, I was more broadly unsatisfied with the lack of explanation for how to more generally consider summing over all possible hypotheses.

This post is my attempt to address some of those issues. Follow the jump to see more explanation as well as discussion of other tangentially related philosophical points.

Functionals

I am not an expert in functional calculus by any means. I have learned about these ideas by skimming from various books and papers. Additionally, as I do have some expertise with the basic formalism of linear algebra, I will attempt to analogize between discrete vectors & continuous functions as appropriate for discussing functionals. In particular, the idea is that a vector \( \mathbf{v} = (v_{1}, v_{2}, \ldots, v_{N}) \) can be represented in index notation as \( v_{i} \) for \( i \in \{1, 2, \ldots, N\} \). It is possible for a vector to be infinite dimensional, which means \( N \to \infty \). Furthermore, it is conceptually valid to replace the index \( i \) with the index \( x \) and pretend that it is a point in real space (although this statement shouldn't be taken literally because the cardinality of the set of integers is less than the cardinally of the set of real numbers). This is the basis for making the notational connections \( v_{i} \to v_{x} \to v(x) \to f(x) \) (where the last step is just a choice to replace the label \( v \) for the vector with \( f \) for the function).

Basic Concepts

A functional is a map from a vector, or a function, to a number. A simple example of a functional of a vector is the 2-norm squared \( \lVert \mathbf{v}\rVert_{2}^{2} = \sum_{i = 1}^{N} v_{i}^{2} \) (where I have assumed real vector spaces & real function spaces for simplicity). A more complicated nonlinear example could be \( F[\mathbf{v}] = \sum_{i} b_{i} v_{i} + \sum_{i,j,k} A_{ij} C_{jk} v_{i}^{2} v_{j}^{3} v_{k} \) where \( b_{i} \) is a specified vector and \( A_{ij} \) & \( C_{jk} \) are specified matrices. Similarly, a simple example of a functional of a function is the integral \( J[f, x] = \int_{0}^{x} f(y)~\mathrm{d}y \). A more complicated nonlinear example could be \( F[f] = \int_{0}^{1} g(x) f(x)~\mathrm{d}x + \int_{0}^{1} \int_{0}^{1} K(x, x') (f(x))^{3} (f(x'))^{5}~\mathrm{d}x'~\mathrm{d}x \) where \( g(x) \) is a specified function and \( K(x, x') \) is a specified integral kernel.

A functional can be differentiated (in the calculus sense). For a functional of a vector, this is defined as \[ \frac{\partial F}{\partial v_{i}} = \lim_{\epsilon \to 0} \frac{F[\mathbf{v} + \epsilon \mathbf{e}_{i}] - F[\mathbf{v}]}{\epsilon} \] where \( \mathbf{e}_{i} \) is the unit vector such that \( \mathbf{v} \cdot \mathbf{e}_{i} = v_{i} \). For a functional of a function, this is defined as \[ \frac{\delta F}{\delta f(x)} = \lim_{\epsilon \to 0} \frac{F[f(x') + \epsilon \delta(x - x')] - F[f(x')]}{\epsilon} \] where \( \delta(x - x') \) is the Dirac delta function and \( x' \) is a dummy index meant to distinguish from the free index \( x \) that defines the point of functional differentiation.

A functional can also be expressed in terms of a Taylor series. For a functional of a vector, the term at order \( n \) expanded around \( v_{i} = 0 \) for all \( i \in \{1, 2, \ldots, N\} \) is written as \( \frac{1}{n!} \sum_{i_{1}, i_{2}, \ldots, i_{n}} \left(\frac{\partial^{n} F}{\partial v_{i_{1}} \partial v_{i_{2}} \ldots \partial v_{i_{n}}}\right)\bigg|_{\mathbf{v} = 0} v_{i_{1}} v_{i_{2}} \cdot \ldots \cdot v_{i_{n}} \). For a functional of a function, the term at order \( n \) expanded for variations around the function \( f = 0 \) for all \( x \) is written as \( \frac{1}{n!} \int \left(\frac{\delta^{n} F}{\delta f(x_{1}) \delta f(x_{2}) \ldots \delta f(x_{n})}\right)\bigg|_{f = 0} f(x_{1}) f(x_{2}) \cdot \ldots \cdot f(x_{N})~\mathrm{d}x_{1}~\mathrm{d}x_{2}~\ldots~\mathrm{d}x_{n} \).

Finally, a functional can be integrated. For a functional of a vector, this is written as \( \int F[\mathbf{v}] \prod_{i = 1}^{N} \mathrm{d}v_{i} \). A function can be seen as a vector in the limit \( N \to \infty \) with continuous labels \( x \) replacing discrete labels \( i \); there is no neat continuous analogue of the product in contrast to the integral being the continuous analogue of the sum, but in any case, the integral of a functional of a function is written as the path integral \( \int F[f]~\mathcal{D}f \) where \( \mathcal{D}f \) is the differential element in the path integral indicating all function variations possible.

Dirac Delta Functional

Earlier, the Dirac delta function \( \delta(x - x') \) was introduced. This has the property that \[ \int_{-\infty}^{\infty} f(x') \delta(x - x')~\mathrm{d}x' = f(x) \] holds for any function \( f \). This in turn implies that the normalization \[ \int_{-\infty}^{\infty} \delta(x)~\mathrm{d}x = 1 \] must hold. Furthermore, the Dirac delta function has the Fourier expansion \[ \delta(x - x') = \int_{-\infty}^{\infty} \exp\left(\mathrm{i}k\left(x - x'\right)\right)~\frac{\mathrm{d}k}{2\pi} \] which can be useful for analytical evaluation of other integrals.

This can be generalized to \( N \) dimensions as \( \delta^{N} (\mathbf{v} - \mathbf{v}') = \prod_{i = 1}^{N} \delta(v_{i} - v'_{i}) \). This has the property that \[ \int f(\mathbf{v}') \delta^{N}(\mathbf{v} - \mathbf{v}')~\mathrm{d}^{N} v' = f(\mathbf{v}) \] holds for any function \( f \). This in turn implies that the normalization \[ \int \delta(\mathbf{v})~\mathrm{d}^{N} v = 1 \] must hold. Furthermore, the Dirac delta function has the Fourier expansion \[ \delta^{N}(\mathbf{v} - \mathbf{v}') = \int \exp\left(\mathrm{i}\sum_{j} k_{j} \left(v_{j} - v'_{j}\right)\right)~\frac{\mathrm{d}^{N} k}{(2\pi)^{N}} \] which can be useful for analytical evaluation of other integrals.

This suggests a generalization to functions as the Dirac delta functional \( \delta[f - f'] \). This would have the property that \[ \int F[f'] \delta[f - f']~\mathcal{D}f' = F[f] \] holds for any functional \( F \). This in turn implies that the normalization \[ \int \delta[f]~\mathcal{D}f = 1 \] must hold. Furthermore, the Dirac delta functional has the functional Fourier expansion \[ \delta[f - f'] \sim \int \exp\left(\mathrm{i}\int \kappa(x) \left(f(x) - f'(x)\right)~\mathrm{d}x\right)~\mathcal{D}\kappa \] which can be useful for analytical evaluation of other integrals. The functional Fourier expansion can be found in the paper (with an open access version here) "Fluctuating surface currents: An algorithm for efficient prediction of Casimir interactions among arbitrary materials in arbitrary geometries" by Reid et al, Phys. Rev. A 88, 022514 (2013), which in turn cites older papers stating similar things without proof. It should be noted that \( \sim \) is used instead of \( = \) on either side of the functional Fourier expansion of the Dirac delta functional because the right-hand side would have a denominator of \( (2\pi)^{N} \) in the limit \( N \to \infty \); in the cited paper, equality is used because this factor is absorbed into or canceled by appropriate definitions of the partition function and in any case does not ultimately affect measurable quantities in quantum electrodynamics.

Note that the Dirac delta functional \( \delta[f] \) is not the same as using a function as the argument of the standard Dirac delta function \( \delta(f(x)) \); to reduce ambiguity, I am consistently using square brackets for the Dirac delta function and parentheses for the standard Dirac delta function. The Dirac delta function with a function as its argument, under certain conditions, can be written as \( \delta(f(x)) = \sum_{\alpha} \frac{\delta(x - x_{\alpha})}{\frac{\partial f}{\partial x}\big|_{x = x_{\alpha}}} \), though this will play no further role in this post.

A Few Examples

An example of a functional derivative can be seen in the equation \[ \frac{\delta}{\delta f(x)} \int (f(x'))^{2}~\mathrm{d}x' = 2f(x) \] which can be derived from the definition of a functional derivative. This can also be seen using the following argument: if matrices and vectors are used instead, then this yields \( \frac{\partial}{\partial v_{i}} \sum_{j} v_{j}^{2} = \sum_{j} 2v_{j} \frac{\partial v_{j}}{\partial v_{i}} = 2\sum_{j} v_{j} \delta_{ij} = 2v_{i} \).

Another example of a functional derivative can be seen in the equation \[ \frac{\delta}{\delta f(x)} \int \left(\frac{\partial f(x')}{\partial x'}\right)^{2}~\mathrm{d}x' = -2\frac{\partial^{2} f(x)}{\partial x^{2}} \] which can be derived from the definition of a functional derivative. However, this didn't seem so satisfying to me, as it would intuitively seem like the sign in front should be positive based on the previous example. One way to see this is to again use the analogy to matrices and vectors. In particular, if \( u_{i} = \sum_{j} D_{ij} v_{j} \) where \( D_{ij} \) represents the matrix elements of the derivative operator, which is anti-symmetric so that \( D_{ij} = -D_{ji} \), then \( \frac{\partial}{\partial v_{i}} \sum_{j} u_{j}^{2} = 2\sum_{j} u_{j} \frac{\partial u_{j}}{\partial v_{i}} = 2\sum_{j,k,l} D_{jk} v_{k} D_{jl} \delta_{li} = 2\sum_{j,k} D_{ji} D_{jk} v_{k} \). For the last step, one can see that \( D_{ji} = -D_{ij} \) and therefore the result really is \( -2 \) (and not \( +2 \) from the previous example) multiplied by the square of the derivative operator (which is the second derivative operator) acting on the vector.

Functionals in Bayes's Theorem

I am now ready to lay out how I imagine functionals being used to generalize Bayes's theorem to account for summation over all hypotheses. Suppose a generic hypothesis \( \mathrm{H} \) defines a likelihood in the form of a probability density \( f(x) \) where data \( x \) may live in an arbitrarily high but finite dimensional space. Suppose that the prior probability of that hypothesis is given by the probability density functional \( \operatorname{p}[f] \). If data points \( x_{\mathrm{D}} \) are observed, then Bayes's theorem should be written as \[ \operatorname{p}[f|x_{\mathrm{D}}] = \frac{f(x_{\mathrm{D}}) \operatorname{p}[f]}{\int g(x_{\mathrm{D}}) \operatorname{p}[g]~\mathcal{D}g} \] where in the denominator \( g \) is a dummy label for the functional integration over all probability density functions defining the likelihood to distinguish from \( f \) in the numerator; note that \( g(x_{\mathrm{D}}) = \int g(x)\delta(x - x_{\mathrm{D}})~\mathrm{d}x \) is a functional of \( g \), namely a map from the function \( g \) to a number obtained by evaluating \( g \) at \( x_{\mathrm{D}} \), so the numerator is the product of two functionals and the denominator is a path integral of the product of two functionals. It can be observed that the posterior probability density functional \( \operatorname{p}[f|x_{\mathrm{D}}] \) is properly normalized, as \( \int \operatorname{p}[f|x_{\mathrm{D}}]~\mathcal{D}f = 1 \).

It should be noted that in the denominator of the functional form of Bayes's theorem as well as in the normalization of the prior & posterior probabilities, any path integrals over likelihood densities \( f \) (or \( g \), with replacements in the rest of this paragraph made appropriately) must satisfy the constraints that \( f(x) > 0 \) for all \( x \) and that \( \int f(x)~\mathrm{d}x = 1 \). For the former constraint, I'm not exactly sure what the form would be. Perhaps if the Heaviside step function \( \Theta(x) \) were extended to vectors such that \( \Theta^{N} (\mathbf{v}) = \prod_{i = 1}^{N} \Theta(v_{i}) \), then there could be a further extension to a Heaviside step functional \( \Theta[f] \) using the analogy between vectors and functions. That said, the Heaviside step function for vectors can be expressed through \( N \) Cartesian integrals over the Dirac delta function as \( \Theta^{N} (\mathbf{v}) = \int_{-\infty}^{v_{1}} \int_{-\infty}^{v_{2}} \ldots \int_{-\infty}^{v_{N}} \delta^{N} (\mathbf{v}')~\mathrm{d}v'_{N}~\ldots~\mathrm{d}v'_{2}~\mathrm{d}v'_{1} \), but the generalization to the functional case \( \Theta[f] = \int \delta[g]~\mathcal{D}g \) is less straightforward because of difficulty in specifying the limits for the space of functions, so I won't pursue this further. For the latter constraint, in principle, enforcement could come by including the Dirac delta function term \( \delta\left(\int f(x)~\mathrm{d}x - 1\right) \) (where this is a standard Dirac delta function, not a Dirac delta functional, even though the argument involves an integral over \( f \) and is therefore a functional of \( f \)). In practice, it isn't clear whether this leads to any meaningful analytical clarity or simplification, so I won't pursue this further.

Recovering the One-Parameter Formula

It would be useful to see how to recover the formula for \( f_{\mathrm{H}}(\theta|D) \) (stating near the beginning of this post) from the more general functional representation of Bayes's theorem. However, I confess that in this instance, I am somewhat unsure of how to proceed in the most careful way possible. The following derivation seems to me to be a bit hacked together.

I think the way to do it is to say that if the hypothesis \( \mathrm{H} \) specifies a class of likelihood densities \( h(x|\theta) \) and the prior probability over the parameter \( \theta \) is \( f_{\mathrm{H}}(\theta) \), then the prior probability density functional is \[ \operatorname{p}[f|\theta] = f_{\mathrm{H}}(\theta) \delta[f - h(x|\theta)] \] where already it can be seen that the prior probability density functional is written as being conditioned on \( \theta \). This also requires rewriting the denominator to take into account the integral over \( \theta \), so that the posterior probability density functional is \[ \operatorname{p}[f|x_{\mathrm{D}}, \theta] = \frac{f(x_{\mathrm{D}}) f_{\mathrm{H}}(\theta) \delta[f - h(x|\theta)]}{\int g(x_{\mathrm{D}}) f_{\mathrm{H}}(\theta') \delta[g - h(x|\theta')]~\mathcal{D}g~\mathrm{d}\theta'} \] although the inclusion of integration over \( \theta \) in the denominator, though conceptually consistent with integrating over all possible functions, seems to be done in an ad hoc way instead of in a way that clearly falls out of the integral of the functional. Finally, integrating both sides over the function \( f \) yields the expression \[ \int \operatorname{p}[f|x_{\mathrm{D}}, \theta]~\mathcal{D}f = \frac{h(x_{\mathrm{D}}|\theta) f_{\mathrm{H}}(\theta)}{\int h(x_{\mathrm{D}}|\theta') f_{\mathrm{H}}(\theta')~\mathrm{d}\theta'} \] as desired.

Other Thoughts on Bayesian Inference and Societal Issues

In the process of learning about Bayesian inference, I read the Wikipedia articles about Bayesian inference, jurimetrics, and statistical criticisms of claims of group identity-based discrimination, and I reread the paper "An Empirical Analysis of Racial Differences in Police Use of Force" by Fryer, NBER Working Paper 22399 (2016). From these articles, I learned a few things.

First, rereading the paper about police brutality was a useful reminder of the necessity to control for different factors when making claims about discrimination. At the same time, the author of that paper clearly explains that data about police brutality as provided by police departments is sparse, inconsistently organized, and may be beset by self-selection biases. Furthermore, all statistics, most notably including those statistics claiming that there are no significant differences with respect to the victim's race in the rates of lethal force used by police officers, are conditioned upon the existence of an interaction with the police. This means that among those who are stopped, there are no significant difference with respect to the victim's race in police use of lethal force, but there may be overall significant differences with respect to the victim's race simply because of significant differences in the rates at which police officers stop people in the first place.

Second, that paper got me thinking about how there is an infinite set of hypotheses for any phenomenon, and it is impossible for humans to fairly search the space, so humans often rely on pre-existing biases & heuristics to search. For example, racial differences in the incidence of police brutality could be explained by discrimination by police officers, racial differences in compliance of people who are stopped by police officers, something as absurd as racial differences in the fart odors of people who are stopped by police officers (implying that sufficiently offensive fart odors would instinctively drive a police officer to violence), or something else entirely. In any of these cases, likelihood densities could be constructed to ensure that the hypothesis can't easily be disproved, and prior probabilities for the hypothesis could be equal or close to 1, so the posterior probabilities would never disprove the hypothesis.

One response could be to give up entirely on any attempt to apply statistics to controversial social issues. Another response could be to acknowledge that Bayesian inference does not specifically favor certain hypotheses or interpretations and that what one measures in part reflects what one cares about; the latter point is illustrated even in the arcane context of the thermodynamics of colloids, per the paper "Celebrating Soft Matter's 10th anniversary: Testing the foundations of classical entropy: colloid experiments" by Cates & Manoharan, Soft Matter 11, 6538-6546 (2015). Related to this, the Wikipedia article about Bayesian inference has the useful reminder, courtesy of statements by Karl Popper, that Bayesian interpretations of probability fundamentally center the notion of probability being a subjective degree of belief, so it is nonsensical to claim that Bayesian inference is somehow fundamentally rational or bias-free; this can be seen in the facts that the prior probabilities of a hypothesis being exactly 0 or 1 imply respectively the same posterior probabilities irrespective of the data, and these facts hold true in limiting senses too. Thus, I believe it is OK to start with axiomatic moral values as grounding points for formulating hypotheses, as long as one is fair & honest enough to ensure that prior probabilities for those hypotheses aren't so close to 0 or 1 that no data could ever overturn them.