Simple STAN model¶
My attempt at a proper STAN model.
Parameters¶
Cosmological parameters:
- \(\Omega_m\): matter density
- \(w\): dark energy equation of state
- \(\alpha\): Phillips correction for stretch
- \(\beta\): Tripp correction for colour
Population parameters:
- \(\langle M_B \rangle\): mean absolute magnitude of supernova
- \(\sigma_{M_B}\): standard deviation of absolute magnitudes
- \(\langle c_i \rangle\): mean colour, as a function of redshift
- \(\sigma_c\): standard deviation of colour
- \(\langle x_{i1} \rangle\): mean scale, as a function of redshift
- \(\sigma_{x_1}\): standard deviation of scale
- \(\rho\): correlation (matrix) between absolute magnitude, colour and stretch
Marginalised parameters:
- \(\delta(0)\) and \(\delta(\infty)\): The magnitude-mass relationship
- \(\delta \mathcal{Z}_b\): Zeropoint uncertainty for each of the g,r,i,z bands.
Per supernova parameters:
- \(m_B\): the true (latent) apparent magnitude
- \(x_1\): the true (latent) stretch
- \(c\): the true (latent) colour
- \(z\): the true redshift of the supernova
- \(m\): the true mass of the host galaxy
Model Overview¶
We wish to model our posterior, given our observations, our model \(\theta\), and selection effects \(S\). Our specific observations \(D\) are the light curves themselves, the summary statistics that result from them \(\lbrace \hat{m_B}, \hat{c}, \hat{x_1} \rbrace\), the covariance for the summary statistics \(\hat{C}\), the redshifts of the object \(\hat{z}\) and a normalised mass estimate \(\hat{m}\). We thus signify observed variables with the hat operator. In this work we will be modelling \(\lbrace \hat{m_B}, \hat{c}, \hat{x_1} \rbrace\) as having true underlying values, however assume that \(\hat{z}\) and \(\hat{m}\) are known \((\hat{z} = z,\ \hat{m}=m)\).
For simplicity, we adopt the commonly used notation that \(\eta\equiv \lbrace \hat{m_B}, \hat{c}, \hat{x_1} \rbrace\).
with
where \(R\) represents all possible data. To simplify notation in the future I define \(w \equiv \int P(S|R, \theta) P(R|\theta) \ dR\).
STAN Model¶
Let us examine only the numerator for the time being. The numerator is the model which ends up implemented in STAN, whilst the denominator can be implemented differently. For simplicity, let us denote the population parameters \(\lbrace \langle M_B \rangle, \langle x_1 \rangle, \langle c \rangle, \sigma_{M_B}, \sigma_{x_1}, \sigma_c, \rho \rbrace\) shown under the Population header as \(\gamma\).
Furthermore, in the interests of simplicity, let us examine only a single supernova for the time being. Let us denote the unnormalised likelihood for a single supernova as \(P (D_i|\theta)\).
Now, let us quickly deal with the priors so I don’t have to type them out again and again. We will treat \(\sigma_{M_B},\ \sigma_{x_1},\, \sigma_c\) with Cauchy priors, \(\rho\) with an LKJ prior, \(\delta \mathcal{Z}_b\) is constrained by zero point uncertainty from photometry (currently just set to 0.01 mag normal uncertainty) and other parameters with flat priors. The prior distributions on redshift and host mass do not matter in this likelihood (without bias corrections), as we assume redshift and mass are precisely known. So now we can focus on the likelihood, and introduce latent variables \(\eta\) to represent the true values from which the observations are drawn from.
Show/Hide derivation
\[\begin{split}P (D_i|\theta) &= P(\hat{m_B}, \hat{x_1}, \hat{c}, \hat{z}, \hat{m} | \Omega_m, w, \alpha, \beta, \gamma, \delta \mathcal{Z}_b, z, m) \\[10pt] &= \int dm_B \int dx_1 \int dc \ P(\hat{m_B}, \hat{x_1}, \hat{c}, \hat{z}, \hat{m}, m_B, x_1, c | \Omega_m, w, \alpha, \beta, \gamma, \delta \mathcal{Z}_b, z, m) \\[10pt] &= \iiint d\eta \ P(\hat{\eta}, \hat{z}, \hat{m}, \eta | \Omega_m, w, \alpha, \beta, \gamma, \delta \mathcal{Z}_b, z, m) \\[10pt] &= \iiint d\eta \ \delta(\hat{z} - z) \delta(\hat{m}-m) P(\hat{\eta}, z, m, \eta | \Omega_m, w, \alpha, \beta, \gamma, \delta \mathcal{Z}_b, z, m) \\[10pt] &= \iiint d\eta \ P(\hat{\eta}, \eta | z, m, \Omega_m, w, \alpha, \beta, \gamma, \delta \mathcal{Z}_b ) \delta(\hat{z} - z) \delta(\hat{m}-m) \\[10pt]\end{split}\]
Where I have used the fact that we assume mass and redshift are precisely known (\(\hat{z}=z\) and \(\hat{m}=m\)), and therefore do not need to be modelled with latent parameters. With precise measurements, we do not need to consider the underlying redshift and mass distributions \(P(z|\theta)\) and \(P(m|\theta)\) in this part of our model, as they will simply modify the constant of proportionality, and thus I do not write them out.
We take zeropoint uncertainty into account by computing \(\frac{\partial\hat{\eta}}{\partial\mathcal{Z}_b}\) for each supernova light curve. We thus model what would be the observed values \(\hat{\eta}_{\rm True} = \hat{\eta} + \delta\mathcal{Z}_b \frac{\partial\hat{\eta}}{\partial\mathcal{Z}_b}\), and then assume that true observed summary statistics \(\hat{\eta}_{\rm True}\) are normally distributed around the true values \(\eta\), we can separate them out.
Show/Hide derivation
\[\begin{split}P (D_i|\theta) &= \iiint d\eta \ P(\hat{\eta} | \eta, z, m, \Omega_m, w, \alpha, \beta, \gamma, \delta \mathcal{Z}_b ) P(\eta| z, m, \Omega_m, w, \alpha, \beta, \gamma, \delta \mathcal{Z}_b ) \delta(\hat{z} - z) \delta(\hat{m}-m) \\[10pt] &= \iiint d\eta \ P(\hat{\eta} | \eta, \delta \mathcal{Z}_b) P(\eta | z, m, \Omega_m, w, \alpha, \beta, \gamma) \delta(\hat{z} - z) \delta(\hat{m}-m) \\[10pt] &= \iiint d\eta \ \mathcal{N}\left( \hat{\eta} + \delta\mathcal{Z}_b \frac{\partial\hat{\eta}}{\partial\mathcal{Z}_b} |\eta, C \right) P(\eta| z, m, \Omega_m, w, \alpha, \beta, \gamma) \delta(\hat{z} - z) \delta(\hat{m}-m) \\\end{split}\]
Now, in order to calculate \(P(\eta| \Omega_m, w, \alpha, \beta, \gamma, z, m, \delta\mathcal{Z}_b)\), we need to transform from \(m_B\) to \(M_B\). We transform using the following relationship:
where we define \(\mu\) as
and \(k(z)\) as
We note that \(\mu\) is a function of \(\hat{z},\Omega_m,w\), however we will simply denote it \(\mu\) to keep the notation from spreading over too many lines.
From the above, \(M_B\) is a function of \(\Omega_m, w, \alpha, \beta, x_1, c, z, m\). Or, more probabilistically,
We can thus introduce a latent variable \(M_B\) and immediately remove the \(m_B\) integral via the delta function.
where
Show/Hide derivation
(11)¶\[\begin{split}P(\eta, M_B| \theta) &= P(m_B | M_B, x_1, c, z, m, \Omega_m, w, \alpha, \beta, \gamma, \delta\mathcal{Z}_b ) P (M_B, x_1, c, | z, m, \Omega_m, w, \alpha, \beta, \gamma, \delta\mathcal{Z}_b )\delta(\hat{z} - z) \delta(\hat{m}-m) \\[10pt] &= \delta\left(M_B - \left[ m_B - \mu + \alpha x_1 - \beta c + k(z) m\right]\right) P (M_B, x_1, c | z, m,\Omega_m, w, \alpha, \beta, \gamma, \delta\mathcal{Z}_b ) \delta(\hat{z} - z) \delta(\hat{m}-m) \\[10pt] &= \delta\left(M_B - \left[ m_B - \mu + \alpha x_1 - \beta c + k(z) m\right]\right) P (M_B, x_1, c, | \gamma) \delta(\hat{z} - z) \delta(\hat{m}-m)\\[10pt] &= \delta\left(M_B - \left[ m_B - \mu + \alpha x_1 - \beta c + k(z) m\right]\right) \mathcal{N}\left( \lbrace M_B, x_1, c \rbrace | \lbrace \langle M_B \rangle, \langle x_1 \rangle, \langle c \rangle \rbrace, V \right) \delta(\hat{z} - z) \delta(\hat{m}-m) \\[10pt]\end{split}\]
with
giving the population covariance.
Note
In this implementation there is no skewness in the colour distribution.
As we do not require normalised probabilities, we can simply add in correcting
factors that can emulate skewness. This has been done in the simple_skew
model, where we
add in a CDF probability for the colour to turn our normal into a skew normal.
Putting this back together, we now have a simple hierarchical multi-normal model. Adding in the priors, and taking into account that we observe multiple supernova, we have that a final numerator of:
Selection Effects¶
Now, the easy part of the model is done, we need to move on to the real issue - our data is biased. As the bias correction is not data dependent, but model parameter dependent (cosmology dependent), the correction for each data point is identical, such that the correction for each individual supernova is identical. I also note that, unlike the previous section, here we have to care about the redshift and mass distributions, and so I will write them out.
We assume, for any given supernova, the selection effect can be determined as a function of apparent magnitude, colour, stretch, redshift and mass. We might expect that the zero points have an effect on selection efficiency, however this is because we normally consider zero points and photon counts hand in hand. As we have a fixed experiment (fixed photon counts and statistics) with different zero points, the selection efficiency is actually independent from zero points. Thus, we can compute the bias correction as
Show/Hide derivation
(15)¶\[\begin{split}w &= \iiint d\hat{\eta} \iiint d\eta \int dM_B\ \int d\hat{z} \int \hat{m} \int dz \int dm \, P(\hat{\eta},\eta, \hat{z},z, \hat{m},m, M_B|\theta) P(S|m_B, x_1, c, z, m) \\[10pt] &= \idotsint d\hat{\eta} \, d\eta \, d\hat{z} \, dz\, d\hat{m}\, dm\, dM_B\ \mathcal{N}\left( \hat{\eta} + \delta\mathcal{Z}_b \frac{\partial\hat{\eta}}{\partial\mathcal{Z}} | \eta, C \right)\ P(S|m_B, x_1, c, z, m) \\ &\quad\quad\quad \delta\left(M_B - \left[ m_B - \mu + \alpha x_1 - \beta c + k(z) m\right]\right)\ \mathcal{N}\left( \lbrace M_B, x_1, c \rbrace | \lbrace \langle M_B \rangle, \langle x_1 \rangle, \langle c \rangle \rbrace, V \right)\delta(\hat{z} - z) \delta(\hat{m}-m) P(z|\theta) \\[10pt] &= \idotsint d\hat{\eta} \, d\eta \, dz\, dm\, dM_B\ \mathcal{N}\left( \hat{\eta} + \delta\mathcal{Z}_b \frac{\partial\hat{\eta}}{\partial\mathcal{Z}} | \eta, C \right)\ P(S|m_B, x_1, c, z, m) \\ &\quad\quad\quad \delta\left(M_B - \left[ m_B - \mu + \alpha x_1 - \beta c + k(z) m\right]\right)\ \mathcal{N}\left( \lbrace M_B, x_1, c \rbrace | \lbrace \langle M_B \rangle, \langle x_1 \rangle, \langle c \rangle \rbrace, V \right) P(z|\theta) \\\end{split}\]
Again that we assume redshift and mass are perfectly known, so the relationship between actual (latent) redshift and mass and the observed quantity is a delta function, hence why they only appear once in the equation above. The important assumption is that the detection efficiency is to good approximation captured by the apparent magnitude, colour, stretch, mass and redshift of the supernova.
As we integrate over all possible realisations, we have that over all space
and as such we can remove it from the integral. As is expected, the final weight looks exactly like our likelihood, except with some extra integral signs that marginalise over all possible experimental realisations:
Addressing each component individually:
Finally, we note that, having \(N\) supernova instead of one, we need only to normalise the likelihood for each new point in parameter space, but not at each individual data point (because the normalisation is independent of the data point). Thus our final posterior takes the following form, where I explicitly take into account the number of supernova we have:
Note
Technical aside: Calculating :math:`P(S|m_B, x_1, c, z, m) ` is not an analytic task. It has complications not just in the distance modulus being the result of an integral, but also that the colour and stretch correction factors make extra use of supernova specific values. The way to efficiently determine the efficiency is given as follows:
- Initially run a large DES-like simulation, recording all generated SN parameters and whether they pass the cuts.
- Using input cosmology to translate \(m_B, x_1, c\) distribution to a \(M_B, x_1, c\) distribution.
- Perform Monte-Carlo integration using the distribution.
This gives our correction \(w\) as
Show/Hide derivation
To go into the math, our Monte Carlo integration sample of simulated supernova is drawn from the multivariate normal distribution \(\mathcal{N}_{\rm sim}\).
As the weights do not have to be normalised, we can discard the constant factor out front. We also note that determining whether a simulated supernova has passed the cut now means converting light curve counts to flux and checking that the new fluxes pass signal-to-noise cuts.
Given a set of points to use in the integration, we can see that subtracting the above term from our log-likelihood provides an implementation of our bias correction.
Warning
A primary concern with selection effects is that they grow exponentially worse with more data. To intuitively understand this, if you have an increased number of (biased) data points, the posterior maximum becomes better constrained and you need an increased re-weighting (bias correction) to shift the posterior maximum to the correct location. Because of this, we will need to implement an approximate bias correction in Stan.
To recap, we have a full bias correction that can be computed using Monte-Carlo integration. However, Monte-Carlo integration cannot be put inside the Stan framework, and having no bias correction at all in the Stan framework means that our sampling efficiency drops to close to zero, which makes it very difficult to adequately sample the posterior adequately. As such, we need an approximate bias correction which can go inside Stan to improve our efficiency.
We can do this by looking at the selection efficiency simply as a function of apparent magnitude for the supernova. There are two possibilities that we can do. The first is to approximate the selection efficiency as a normal CDF, as was done in Rubin (2005). However, when simulating the DES data, low spectroscopic efficiency at brighter magnitudes makes a CDF an inappropriate choice. Instead, the most general analytic form we can prescribe the approximate correction would be using a skew normal, as (depending on the value of the skew parameter \(\alpha\)) we can smoothly transition from a normal CDF to a normal PDF. Thus the approximate bias function is described by
Show/Hide derivation
(24)¶\[\begin{split}w_{\rm approx} &= \int d\hat{z} \int d\hat{m_B} \, P(\hat{z},\hat{m_B}|\theta) S(m_B) \\[10pt] &= \int d\hat{z} \int d\hat{m_B} \int dz \int dm_B \, P(\hat{z},\hat{m_B},z,m_B|\theta) S(m_B) \\[10pt] &= \int d\hat{z} \int d\hat{m_B} \int dz \int dm_B \, P(\hat{z}|z) P(\hat{m_B}|m_B) P(m_B|z,\theta) P(z|\theta) S(m_B) \\[10pt] &= \int d\hat{z} \int d\hat{m_B} \int dz \int dm_B \, \delta(\hat{z}-z) \mathcal{N}(\hat{m_B}|m_B,\hat{\sigma_{m_B}}) P(m_B|z,\theta) P(z|\theta) S(m_B) \\[10pt] &= \int dz \int dm_B \, \left[ \int d\hat{m_B} \mathcal{N}(\hat{m_B}|m_B,\hat{\sigma_{m_B}}) \right] P(m_B|z,\theta) P(z|\theta) S(m_B) \\[10pt] &= \int dz \left[ \int dm_B \, S(m_B) P(m_B|z,\theta) \right] P(z|\theta) \\[10pt]\end{split}\]
As such, we have our efficiency function
With our survey efficiency thus defined, we need to describe our supernova model as a population in apparent magnitude (and redshift). This will be given by a normal function with mean \(m_B^*(z) = \langle M_B \rangle + \mu(z) - \alpha \langle x_1 \rangle + \beta \langle c \rangle\). The width of this normal is then given by \((\sigma_{m_B}^*)^2 = \sigma_{m_B}^2 + (\alpha \sigma_{x_1})^2 + (\beta \sigma_c)^2 + 2(\beta \sigma_{m_B,c} -\alpha \sigma_{m_B,x_1} - \alpha\beta\sigma_{x_1,c})\), such that we formally have
From this, we can derive an approximate weight \(w_{\rm approx}\):
Show/Hide derivation
(28)¶\[\begin{split}w_{\rm approx} &= \int dz \left[ \int dm_B \, S(m_B) P(m_B|z,\theta) \right] P(z|\theta) \\[10pt] &= \int dz \left[ \int dm_B \, \mathcal{N}_{\rm skew} (m_B | m_{B,{\rm eff}}, \sigma_{{\rm eff}}, \alpha_{\rm eff}) \mathcal{N}(m_B | m_B^*(z), \sigma_{m_B}^*) \right] P(z|\theta) \\[10pt] &= 2 \int dz \left[ \int dm_B \, \mathcal{N} \left(\frac{m_B - m_{B,{\rm eff}}}{\sigma_{{\rm eff}}}\right) \Phi\left(\alpha_{\rm eff} \frac{m_B - m_{B,{\rm eff}}}{\sigma_{{\rm eff}}}\right) \mathcal{N}\left(\frac{m_B - m_B^*(z)}{\sigma_{m_B}^*}\right) \right] P(z|\theta) \\[10pt] &= 2 \int dz \left[ \int dm_B \, \mathcal{N} \left( \frac{ m_{B,{\rm eff}} - m_B^*(z) }{ \sqrt{ \sigma_{{\rm eff}}^2 + \sigma_{m_B}^{*2} }} \right) \mathcal{N} \left( \frac{ m_B - \bar{m_B} }{ \bar{\sigma}_{m_B} }\right) \Phi\left(\alpha_{\rm eff} \frac{m_B - m_{B,{\rm eff}}}{\sigma_{{\rm eff}}}\right) \right] P(z|\theta) \\[10pt] & {\rm where }\ \ \bar{m_B} = \left( m_{B,{\rm eff}} \sigma_{m_B}^{*2} + m_B^*(z) \sigma_{{\rm eff}}^2 \right) / \left( \sigma_{m_B}^{*2} + \sigma_{{\rm eff}}^2 \right) \\[10pt] & {\rm where }\ \ \bar{\sigma}_{m_B}^2 = \left( \sigma_{m_B}^{*2} \sigma_{{\rm eff}}^2 \right) / \left( \sigma_{m_B}^{*2} + \sigma_{{\rm eff}}^2 \right) \\[10pt] &= 2 \int dz \, \mathcal{N} \left( \frac{ m_{B,{\rm eff}} - m_B^*(z) }{ \sqrt{ \sigma_{{\rm eff}}^2 + \sigma_{m_B}^{*2} }} \right) \left[ \int dm_B \, \mathcal{N} \left( \frac{ m_B - \bar{m_B} }{ \bar{\sigma}_{m_B} }\right) \Phi\left(\alpha_{\rm eff} \frac{m_B - m_{B,{\rm eff}}}{\sigma_{{\rm eff}}}\right) \right] P(z|\theta) \\[10pt] &= 2 \int dz \, \mathcal{N} \left( \frac{ m_{B,{\rm eff}} - m_B^*(z) }{ \sqrt{ \sigma_{{\rm eff}}^2 + \sigma_{m_B}^{*2} }} \right) \Phi\left( \frac{{\rm sign}(\alpha) \left( \bar{m_B} - m_{B,{\rm eff}} \right)}{ \sqrt{ \left( \frac{ \sigma_{{\rm eff}} }{ \alpha_{\rm eff} } \right)^2 + \bar{\sigma}_{m_B}^2 } } \right) P(z|\theta) \\[10pt] &= 2 \int dz \, \mathcal{N} \left( \frac{ m_{B,{\rm eff}} - m_B^*(z) }{ \sqrt{ \sigma_{{\rm eff}}^2 + \sigma_{m_B}^{*2} }} \right) \Phi\left( \frac{ {\rm sign}(\alpha) \left( m_B^*(z) - m_{B,{\rm eff}} \right) }{ \frac{\sigma_{m_B}^{*2} + \sigma_{{\rm eff}}^2}{\sigma_{{\rm eff}}^2} \sqrt{ \left( \frac{ \sigma_{{\rm eff}} }{ \alpha_{\rm eff} } \right)^2 + \frac{ \sigma_{m_B}^{*2} \sigma_{{\rm eff}}^2 }{ \sigma_{m_B}^{*2} + \sigma_{{\rm eff}}^2 } } } \right) P(z|\theta) \\[10pt]\end{split}\]Thank you Wikipedia for laying out the second last line out so nicely.
We can see here that as our skew normal approaches a normal (\(\alpha \rightarrow 0\)), the CDF function tends to \(\frac{1}{2}\) and gives us only the expected normal residual.
Note
If we wanted to use the original complimentary CDF approximation for the selection efficiency, we would get the integral of the complimentary CDF function.
Now here we depart from Rubin (2015). Rubin (2015) formulate their likelihood in terms of a combinatorial problem, taking into account the number of observed events and an unknown number of missed events. Detailed in their Appendix B, they also make “the counterintuitive approximation that the redshift of each missed SN is exactly equal to the redshift of a detected SN. This approximation is accurate because the SN samples have, on average, enough SNe that the redshift distribution is reasonable sampled.”
Unfortunately, I must disagree that this approximation is valid, because whilst the SN surveys may be able to reasonable sample the observed redshift distribution of SN, they do not adequately sample the underlying redshift distribution, which is important in my formulation.
Now, the underlying redshift distribution goes to a very high redshift, however we note that we would not have to integrate over all of it, because above the observed redshift distribution the contribution to the integral quickly drops to zero. However, sampling the high redshift tail is still necessary.
It is of interest that the difference in methodology (between my integral and Rubin’s combinatorics/redshift approximation/geometric series) leads to the following difference in bias corrections.
Note that I use capital W below, to denote a correction for the entire model, not a single supernova.
where the last line utilises a small \(\epsilon\) to aid convergences, and we discard the numerator as Rubin states with \(\epsilon > 0\) it didn’t turn out to be important.
To try and compare these different methods, I’ve also tried a similar exact redshift approximation to reduce my integral down to a product, however it does not work well.
After fitting the above posterior surface, we can remove the approximation correction and implement the actual Monte Carlo correction by assigning each point the chain the weight based on the difference between the approximate weight and the actual weight.
Final Model¶
From the mathematics laid out before, I test three models.
For the first model, as the hostmass distribution and redshift distribution are not well known, I keep mass and redshift as top level model parameters. With this, I adopt the assumption of the redshift distribution being well sampled, and apply only the approximate correction. As such, this will allow me to get a reference for the other two models.
The second and third models have mass and redshift coming from a parent population. The two models are when applying the full Monte-Carlo correction and when not (keeping it only the approximate correction).
To test the models, I have multiple datasets I test them on. The simple dataset is one constructed by hand with simple draws and a perfect selection effect. The SNANA datasets use SNANA and thus have proper selection effects which are not perfectly skew normal. There are multiple realisations of \(\Omega_m\), and a simulation which introduces a skewed colour distribution (bifurcated gaussian population).
Appendix 1 - MC inside Stan¶
Warning
Given the concerns with the importance sampling methods, I also decided to implement the bias corrections within STAN itself - that is, have Stan perform rough Monte-Carlo integration to get the bias correction in explicitly. Inserting the relevant data and structures into STAN such that I can perform Monte Carlo integration in a BHM framework significantly slows down the fits, however I believed it would at least give good results.
In addition to the odd contours, we I also see in the walk itself that we have sampling issues, with some walkers sampling some areas of posterior space more than others. Stan’s lack of convergence here is a big issue, indicating that the surfaces adding MC integration creates are intractable to Stan.
Appendix 2 - Gaussian Processes¶
Warning
As performing Monte Carlo integration inside Stan hit a dead end, I decided to investigate if I could achieve an approximate solution by using Gaussian Processes. Unfortunately, what we can see happneing in the plots below is that a single point of in the GP has a substantially higher weight than the others (a presumed combination of stochastic randomness and being in an unusual area of parameter space). However, this meant that any walker randomly initialised near this point would be stuck to it. Even if this did not happen, viewing the other walkers revealed issues with them preferring matter densities as high as possible, which is obviously not what is wanted. It seems that without a huge number of points (which Stan cannot do) our model is too high-dimensional that we cannot use a Gaussian Process.
Appendix 2 - Nearest Point GP¶
Warning
Building off a regular Gaussian Process, the high dimensionality of our model may be causing difficulties due to the GP kernel - if we are averaging or blending over too many points (too great a volume in parameter space), we would not expect accurate results. To test if this was the issue, I increased the number of points in the GP to a high value (in the thousands), and then modified the distance metric used to calculate the kernel - raising it to a power and then normalising, such that the distance to the closest point approached one, and the distance to all other points approached infinity (or really a number much larger than one).
By doing this - instead of fitting the GP hyper parameters - I have essentially created a smooth, infinitely-differentiable nearest-point-interpolator. But, looking at the results below, apparently that is exactly what I don’t want!
Whats is actually happening is that Stan uses a Hamiltonian Monte Carlo algorithm, which takes into account the gradient of the posterior surface. The nearest-point GP setup I am using has extreme gradients because the GP values quickly shift when you cross the equidistant threshold between two points. These extreme gradients, and the convoluted and chaotic boundaries given by the equidistant constraint on a thousand points in high dimensional volume, completely breaks Stan’s HMC algorithm.