My work on transportation policy research in part involves conducting & analyzing surveys of people's travel behaviors & attitudes. Analyzing survey data requires an understanding of basic probability and statistics, which is an area that I previously felt I had just enough knowledge of to get by when learning about statistical physics but that I need to build more practical skills in now. In the process of refreshing my understanding of probability and statistics, I thought more about Bayes's theorem. In the context of hypothesis testing or inference, Bayes's theorem can be stated as follows: given a hypothesis \( \mathrm{H} \) and data \( \mathrm{D} \) such that the likelihood of measuring the data under that hypothesis is \( \operatorname{P}(\mathrm{D}|\mathrm{H}) \), and given a prior probability \( \operatorname{P}(\mathrm{H}) \) associated with that hypothesis, the posterior probability of that hypothesis is \[ \operatorname{P}(\mathrm{H}|\mathrm{D}) = \frac{\operatorname{P}(\mathrm{D}|\mathrm{H})\operatorname{P}(\mathrm{H})}{\operatorname{P}(\mathrm{D})} \] given the data. The key is that the denominator is evaluated as a sum \[ \operatorname{P}(\mathrm{D}) = \sum_{\mathrm{H}'} \operatorname{P}(\mathrm{D}|\mathrm{H}')\operatorname{P}(\mathrm{H}') \] where the label \( \mathrm{H}' \) runs over all possible hypothesis.
In practice, however, the set of hypotheses doesn't literally encompass all hypotheses but encompasses only one particular type of function with one or a few free parameters which then go into the prior probability distribution. For one free parameter, if the hypothesis specifies only the value of the (assumed continuous) parameter \( \theta \), if the prior probability of that hypothesis is given by a density \( f_{\mathrm{H}}(\theta)\), and the likelihood of measuring a (assumed continuous) data vector \( D \) under that hypothesis is the density \( f(D|\theta) \), then Bayes's theorem gives \[ f_{\mathrm{H}}(\theta|D) = \frac{f(D|\theta) f_{\mathrm{H}}(\theta)}{\int f(D|\theta') f_{\mathrm{H}}(\theta')~\mathrm{d}\theta'} \] as the posterior probability density under that hypothesis given the data.
I understood that in most cases, a single class of likelihood functions varied through a single parameter is good enough, and especially for the purposes of pedagogy, it is useful to keep things simple & analytical. Even so, I was more broadly unsatisfied with the lack of explanation for how to more generally consider summing over all possible hypotheses.
This post is my attempt to address some of those issues. Follow the jump to see more explanation as well as discussion of other tangentially related philosophical points.