Introduction

In this vignette, we provide a breif introduction to the estimation methods in ngme2. First, we give a description of the model.

A popular and flexible covariance function for random fields on ℝd\mathbb{R}^d is the MatΓ©rn covariance function: c(𝐬,𝐬′)=Οƒ2Ξ“(Ξ½)2Ξ½βˆ’1(ΞΊβˆ₯π¬βˆ’π¬β€²βˆ₯)Ξ½KΞ½(ΞΊβˆ₯π¬βˆ’π¬β€²βˆ₯),c(\mathbf{s}, \mathbf{s}') = \frac{\sigma^2}{\Gamma(\nu)2^{\nu-1}}(\kappa \|\mathbf{s}-\mathbf{s}'\|)^\nu K_\nu(\kappa\|\mathbf{s}-\mathbf{s}'\|), where Ξ“(β‹…)\Gamma(\cdot) is the Gamma function, KΞ½(β‹…)K_\nu(\cdot) is the modified Bessel function of the second kind, Ξ½>0\nu>0 controls the correlation range and Οƒ2\sigma^2 is the variance. Finally, Ξ½>0\nu>0 determines the smoothness of the field.

It is well-known (Whittle, 1963) that a Gaussian process u(𝐬)u(\mathbf{s}) with MatΓ©rn covariance function solves the stochastic partial differential equation (SPDE) $$\begin{equation}\label{spde} (\kappa^2 -\Delta)^\beta u = \mathcal{W}\quad \hbox{in } \mathcal{D}, \end{equation}$$ where Ξ”=βˆ‘i=1dβˆ‚2βˆ‚xi2\Delta = \sum_{i=1}^d \frac{\partial^2}{\partial_{x_i^2}} is the Laplacian operator, 𝒲\mathcal{W} is the Gaussian spatial white noise on π’Ÿ=ℝd\mathcal{D}=\mathbb{R}^d, and 4Ξ²=2Ξ½+d4\beta = 2\nu + d.

Inspired by this relation between Gaussian processes with MatΓ©rn covariance functions and solutions of the above SPDE, Lindgren et al.Β (2011) constructed computationally efficient Gaussian Markov random field approximations of u(𝐬)u(\mathbf{s}), where the domain π’ŸβŠŠβ„d\mathcal{D}\subsetneq \mathbb{R}^d is bounded and 2Ξ²βˆˆβ„•2\beta\in\mathbb{N}.

In order to model departures from Gaussian behaviour we will consider the following extension due to Bolin (2014): (ΞΊ2βˆ’Ξ”)Ξ²X(𝐬)=β„³Μ‡(𝐬),π¬βˆˆπ’Ÿ,(\kappa^2 - \Delta)^\beta X(\mathbf{s}) = \dot{\mathcal{M}}(\mathbf{s}),\quad \mathbf{s}\in\mathcal{D}, where β„³Μ‡\dot{\mathcal{M}} is a non-Gaussian white-noise. More specifically, we assume β„³\mathcal{M} to be a type-G LΓ©vy process.

We say that a LΓ©vy process is of type G if its increments can be represented as location-scale mixtures: Ξ³+ΞΌV+ΟƒVZ,\gamma + \mu V + \sigma \sqrt{V}Z, where Ξ³,ΞΌ\gamma, \mu and Οƒ\sigma are parameters, Z∼N(0,1)Z\sim N(0,1) and is independent of VV, and VV is a positive infinitely divisible random variable. In the SPDE we will assume Ξ³=βˆ’ΞΌE(V)\gamma = -\mu E(V) and Οƒ=1\sigma = 1.

Finally, we assume we have observations Y1,…,YNY_1,\ldots,Y_N, observed at locations 𝐬1,…,𝐬Nβˆˆπ’Ÿ\mathbf{s}_1,\ldots,\mathbf{s}_N\in\mathcal{D}, where Y1,…,YNY_1,\ldots, Y_N satisfy Yi=X(𝐬i)+Ξ΅i,i=1,…,N, Y_i = X(\mathbf{s}_i) + \varepsilon_i, \quad i=1,\ldots,N, where Ξ΅1,…,Ξ΅N\varepsilon_1,\ldots,\varepsilon_N are i.i.d. following Ξ΅i∼N(0,σΡ2)\varepsilon_i\sim N(0, \sigma_\varepsilon^2).

Finite element approximations

In this vignette we will assume basic understanding of Galerkin’s finite element method. For further details we refer the reader to the Sampling from processes given by solutions of SPDEs driven by non-Gaussian noise vignette.

Recall that we are assuming Ξ²=1\beta=1, so our SPDE is given by (ΞΊ2βˆ’Ξ”)X(𝐬)=β„³Μ‡(𝐬),π¬βˆˆπ’Ÿ.(\kappa^2 - \Delta) X(\mathbf{s}) = \dot{\mathcal{M}}(\mathbf{s}),\quad \mathbf{s}\in\mathcal{D}.

Let us introduce some notation regarding the finite element method (FEM). Let $V_n = {\rm span}\{\varphi_1,\ldots,\varphi_n\}$, where Ο†i(𝐬),i=1,…,n\varphi_i(\mathbf{s}), i=1,\ldots, n are piecewise linear basis functions obtained from a triangulation of π’Ÿ\mathcal{D}.

An approximate solution XnX_n of XX is written in terms of the finite element basis functions as Xn(𝐬)=βˆ‘i=1nwiΟ†i(𝐬),X_n(\mathbf{s}) = \sum_{i=1}^n w_i \varphi_i(\mathbf{s}), where wiw_i are the FEM weights. Let, also, 𝐟=(β„³Μ‡(Ο†1),…,β„³Μ‡(Ο†n)),\mathbf{f} = (\dot{\mathcal{M}}(\varphi_1),\ldots, \dot{\mathcal{M}}(\varphi_n)), Therefore, given a vector 𝐕=(V1,…,Vn)\mathbf{V} = (V_1,\ldots,V_n) of independent stochastic variances (in our case, positive infinitely divisible random variables), we obtain that $$\mathbf{f}|\mathbf{V} \sim N(\gamma + \mu\mathbf{V}, \sigma^2{\rm diag}(\mathbf{V})).$$

Let us now introduce some useful notation. Let 𝐂\mathbf{C} be the nΓ—nn\times n matrix with (i,j)(i,j)th entry given by 𝐂i,j=βˆ«π’ŸΟ†i(𝐬)Ο†j(𝐬)d𝐬.\mathbf{C}_{i,j} = \int_{\mathcal{D}} \varphi_i(\mathbf{s})\varphi_j(\mathbf{s}) d\mathbf{s}. The matrix 𝐂\mathbf{C} is known as the mass matrix in FEM theory. Let, also, 𝐆\mathbf{G} be the nΓ—nn\times n matrix with (i,j)(i,j)th entry given by 𝐆i,j=βˆ«π’Ÿβˆ‡Ο†i(𝐬)βˆ‡Ο†j(𝐬)d𝐬.\mathbf{G}_{i,j} = \int_{\mathcal{D}} \nabla \varphi_i(\mathbf{s})\nabla\varphi_j(\mathbf{s})d\mathbf{s}. The matrix 𝐆\mathbf{G} is known in FEM theory as stiffness matrix. Finally, let hi=βˆ«π’ŸΟ†i(𝐬)d𝐬,i=1,…,n.h_i = \int_{\mathcal{D}} \varphi_i(\mathbf{s}) d\mathbf{s}, \quad i=1,\ldots,n. Recall that Ξ³=βˆ’ΞΌE(V)\gamma = -\mu E(V). 𝐕\mathbf{V} is chosen such that E[Vi]=hiE[V_i] = h_i to ensure parameter identifiability. Then, we have that the FEM weights 𝐰=(w1,…,wn)\mathbf{w} = (w_1,\ldots,w_n) satisfy $$\mathbf{w}|\mathbf{V} \sim N(\mathbf{K}^{-1}(-\mu \mathbf{h}+\mu\mathbf{V}), \sigma^2\mathbf{K}^{-1}{\rm diag}(\mathbf{V})\mathbf{K}^{-1}),$$ where 𝐊=ΞΊ2𝐂+𝐆\mathbf{K} = \kappa^2\mathbf{C}+\mathbf{G} is the discretization of the differential operator and 𝐑=(h1,…,hn)\mathbf{h} = (h_1,\ldots,h_n).

Prediction

In order to illustrate predictions we will assume that the parameters ΞΊ\kappa, Οƒ\sigma and σΡ\sigma_\varepsilon are known. The case of unknown parameters will be treated in the next section.

Our goal in this section is to perform prediction of the latent field XX at locations where there are no observations. Usually, when doing such predictions, one provides mean and variance of the predictive distribution. We will now describe how one can generate these quantities when assuming the model defined in the previous sections.

Let us assume we want to obtain predictions at locations 𝐬̃1,…,𝐬̃pβˆˆπ’Ÿ\widetilde{\mathbf{s}}_1, \ldots, \widetilde{\mathbf{s}}_p \in \mathcal{D}, where pβˆˆβ„•p\in \mathbb{N}.

Notice that for j=1,…,pj=1,\ldots,p, Xn(𝐬̃j)=βˆ‘i=1nwiΟ†i(𝐬̃j).X_n(\widetilde{\mathbf{s}}_j) = \sum_{i=1}^n w_i \varphi_i(\widetilde{\mathbf{s}}_j). Therefore, if we let 𝐀p\mathbf{A}_p be the pΓ—np\times n matrix whose (i,j)(i,j)th entry is given by 𝐀p,ij=Ο†j(𝐬̃i)\mathbf{A}_{p,ij} = \varphi_j(\widetilde{\mathbf{s}}_i), then (Xn(𝐬̃1),…,Xn(𝐬̃p))=𝐀p𝐰.(X_n(\widetilde{\mathbf{s}}_1),\ldots, X_n(\widetilde{\mathbf{s}}_p)) = \mathbf{A}_p\mathbf{w}. Thus, to perform prediction the desired means and variances are $$E[\mathbf{A}_p \mathbf{w} | \mathbf{Y}]\quad\hbox{and}\quad V[\mathbf{A}_p\mathbf{w}|\mathbf{Y}],$$ where 𝐘=(Y1,…,YN).\mathbf{Y} = (Y_1,\ldots,Y_N).

Now, observe that the density of 𝐰|𝐘\mathbf{w}|\mathbf{Y} is not known. So, the mean and variance cannot be computed analytically.

There are two ways to circumvent that situation. Both of them are based on the fact that even though we do not know the density of 𝐰|𝐘\mathbf{w}|\mathbf{Y}, we do know the density of 𝐕|𝐰,𝐘\mathbf{V}|\mathbf{w},\mathbf{Y} and the density of 𝐰|𝐕,𝐘\mathbf{w}|\mathbf{V},\mathbf{Y}. Therefore we can use a Gibbs sampler to sample from (𝐰,𝐕)|𝐘(\mathbf{w},\mathbf{V})|\mathbf{Y}. From this we obtain, as a byproduct, marginal samples from 𝐰|𝐘\mathbf{w}|\mathbf{Y} and 𝐕|𝐘\mathbf{V}|\mathbf{Y}.

We will now provide a brief presentation of the Gibbs sampler and then we will provide the approximations of the means and variances.

Gibbs sampler

We will now briefly describe the Gibbs sampler algorithm that will be used here.

Let 𝐀\mathbf{A} be the NΓ—nN\times n matrix, whose (i,j)(i,j)th entry is given by 𝐀ij=Ο†j(𝐬i)\mathbf{A}_{ij} = \varphi_j(\mathbf{s}_i). Therefore, we have that (Xn(𝐬1),…,Xn(𝐬N))=𝐀𝐰,(X_n(\mathbf{s}_1),\ldots,X_n(\mathbf{s}_N)) = \mathbf{A}\mathbf{w}, so that π˜β‰ˆπ€π°+𝛆,\mathbf{Y} \approx \mathbf{A}\mathbf{w} + \boldsymbol{\varepsilon}, where 𝛆=(Ξ΅1,…,Ξ΅N)\boldsymbol{\varepsilon} = (\varepsilon_1,\ldots,\varepsilon_N). We will consider the above representation, i.e., we will assume that 𝐘=𝐀𝐰+𝛆,\mathbf{Y} = \mathbf{A}\mathbf{w} + \boldsymbol{\varepsilon}, and that any error from the approximation of X(β‹…)X(\cdot) by Xn(β‹…)X_n(\cdot) is captured by the measurement noise.

Therefore, under this assumption we have that 𝐘|𝐰∼N(𝐀𝐰,σΡ2𝐈).\mathbf{Y}|\mathbf{w} \sim N(\mathbf{A}\mathbf{w}, \sigma_\varepsilon^{2} \mathbf{I}). Also recall that $$\mathbf{w}|\mathbf{V} \sim N(\mathbf{K}^{-1}(-\mu \mathbf{h}+\mu\mathbf{V}), \sigma^2\mathbf{K}^{-1}{\rm diag}(\mathbf{V})\mathbf{K}^{-1}).$$ Let $$\mathbf{m} = \mathbf{K}^{-1}(-\mu \mathbf{h}+\mu\mathbf{V})\quad \hbox{and}\quad \mathbf{Q} = \frac{1}{\sigma^2}\mathbf{K}{\rm diag}(\mathbf{V})^{-1}\mathbf{K}.$$

It thus follows (see, also, Wallin and Bolin (2015) or Asar et al.Β (2020)) that 𝐰|𝐕,𝐘∼N(𝐦̃,πΜƒβˆ’1),\mathbf{w} | \mathbf{V}, \mathbf{Y} \sim N\big(\widetilde{\mathbf{m}}, \widetilde{\mathbf{Q}}^{-1}), where $$\widetilde{\mathbf{Q}} = \mathbf{Q} + \sigma_\varepsilon^{-2} \mathbf{A}^\top\mathbf{A}\quad\hbox{and}\quad \widetilde{\mathbf{m}} = \widetilde{\mathbf{Q}}^{-1}\big(\mathbf{Q}\mathbf{m}+\sigma_\varepsilon^{-2}\mathbf{A}^\top\mathbf{Y}\big).$$

To compute the conditional distribution 𝐕|𝐰,𝐘\mathbf{V}|\mathbf{w}, \mathbf{Y} one can see from Wallin and Bolin (2015), pp.Β 879, that V1,…,VnV_1,\ldots,V_n are conditionally independent given 𝐰\mathbf{w}. Furthermore, we also have from Proposition 1 from Asar et al.Β (2020)) that if V∼GIG(p,a,b)V\sim GIG(p,a,b), where GIGGIG stands for the generalized inverse Gaussian distribution with parameters p,ap, a and bb, then, for every j=1,…,nj=1,\ldots,n, Vj|𝐰,𝐘∼GIG(pβˆ’0.5,a+ΞΌ2Οƒ2,b+(𝐊𝐰+𝐑μ)j2Οƒ2).V_j|\mathbf{w},\mathbf{Y} \sim GIG\Bigg(p-0.5, a+\frac{\mu^2}{\sigma^2}, b + \frac{(\mathbf{K}\mathbf{w}+\mathbf{h}\mu)_j^2}{\sigma^2}\Bigg).

We are now in a position to use the Gibbs sampling algorithm:

  • Provide initial values 𝐕(0)\mathbf{V}^{(0)};
  • Sample 𝐰(1)|𝐕(0),𝐘\mathbf{w}^{(1)} | \mathbf{V}^{(0)},\mathbf{Y};
  • Sample 𝐕(1)|𝐰(1),𝐘\mathbf{V}^{(1)} | \mathbf{w}^{(1)}, \mathbf{Y};
  • Continue by sequentially sampling 𝐰(i)|𝐕(iβˆ’1),𝐘\mathbf{w}^{(i)}|\mathbf{V}^{(i-1)},\mathbf{Y}, and then 𝐕(i)|𝐰(i),𝐘\mathbf{V}^{(i)}|\mathbf{w}^{(i)}, \mathbf{Y} for i=1,…,ki=1,\ldots,k.

One should stop when equilibrium is reached. To obtain evidence that equilibrium has been achieved, it is best to consider more than one chain, starting from different locations, and see if they mixed well. It might also be useful to see autocorrelation plots.

Depending on the starting values, one might consider to do burn-in samples, that is, one runs a chain for some iterations, then saves the last position, throw away the rest of the samples, and use that as starting values.

It is important to observe that the samples {𝐰(i),𝐕(i)}i=1k\{\mathbf{w}^{(i)},\mathbf{V}^{(i)}\}_{i=1}^k will not be independent. However, under very general assumptions, the Gibbs sampler provides samples satisfying the law of large numbers for functionals of the sample. Therefore, one can use these samples to compute means and variances.

Standard MC estimates

Suppose we have a sample 𝐰(1),…,𝐰(k)\mathbf{w}^{(1)},\ldots, \mathbf{w}^{(k)} and then approximate the mean as E[𝐀p𝐰|𝐘]β‰ˆ1kβˆ‘i=1k𝐀p𝐰(i)E[\mathbf{A}_p \mathbf{w} | \mathbf{Y}] \approx \frac{1}{k} \sum_{i=1}^k \mathbf{A}_p\mathbf{w}^{(i)} and to approximate the variance as V[𝐀p𝐰|𝐘]β‰ˆ1kβˆ‘i=1k(𝐀p𝐰(i)βˆ’E[𝐀p𝐰|𝐘])2,V[\mathbf{A}_p \mathbf{w}|\mathbf{Y}] \approx \frac{1}{k} \sum_{i=1}^k (\mathbf{A}_p\mathbf{w}^{(i)} - E[\mathbf{A}_p\mathbf{w}|\mathbf{Y}])^2, where {𝐰(i)}i=1k\{\mathbf{w}^{(i)}\}_{i=1}^k is a sample generated using the Gibbs sampler algorithm.

Rao-Blackwellization for means and variances

The second way consists in performing a Rao-Blackwellization (Robert and Casella, 2004) for the means and variances. By following Wallin and Bolin (2015) we have that E[𝐀p𝐰|𝐘]=βˆ«π°π€p𝐰π(𝐰|𝐘)d𝐰,=βˆ«π°βˆ«π•π€p𝐰π(𝐰|𝐕,𝐘)Ο€(𝐕|𝐘)d𝐕d𝐰=βˆ«π•π€p𝐦̃π(𝐕|𝐘)d𝐕, \begin{array}{ccl} E[\mathbf{A}_p\mathbf{w}|\mathbf{Y}] &=& \int_\mathbf{w} \mathbf{A}_p \mathbf{w} \pi(\mathbf{w}|\mathbf{Y})\,d\mathbf{w},\\ &=& \int_\mathbf{w} \int_{\mathbf{V}} \mathbf{A}_p \mathbf{w} \pi(\mathbf{w}|\mathbf{V},\mathbf{Y})\pi(\mathbf{V}|\mathbf{Y})\,d\mathbf{V}\,d\mathbf{w}\\ &=& \int_\mathbf{V} \mathbf{A}_p \widetilde{\mathbf{m}} \pi(\mathbf{V}|\mathbf{Y})\,d\mathbf{V}, \end{array} where 𝐦̃=E[𝐰|𝐕,𝐘]\widetilde{\mathbf{m}} = E[\mathbf{w}|\mathbf{V},\mathbf{Y}], and its expression was given in the description of the Gibbs sampler.

Thus, we have the approximation E[𝐀p𝐰|𝐘]β‰ˆ1kβˆ‘i=1k𝐀p𝐦̃(i),E[\mathbf{A}_p\mathbf{w}|\mathbf{Y}] \approx \frac{1}{k}\sum_{i=1}^k \mathbf{A}_p \widetilde{\mathbf{m}}^{(i)}, where 𝐦̃(i)=𝐦̃(𝐕(i))\widetilde{\mathbf{m}}^{(i)} = \widetilde{\mathbf{m}}(\mathbf{V}^{(i)}), that is, 𝐦̃(i)\widetilde{\mathbf{m}}^{(i)} was computed based on 𝐕(i)\mathbf{V}^{(i)}.

Notice, also, that 𝐦̃\widetilde{\mathbf{m}} has been computed during the Gibbs sampling, so it does not imply on additional cost.

By a similar reasoning we have that V(𝐀p𝐰|𝐘)β‰ˆ1kβˆ‘i=1k𝐀p⊀(𝐐̃(i))βˆ’1𝐀p,V(\mathbf{A}_p \mathbf{w}|\mathbf{Y}) \approx \frac{1}{k}\sum_{i=1}^k\mathbf{A}_p^\top (\widetilde{\mathbf{Q}}^{(i)})^{-1}\mathbf{A}_p, where 𝐐̃(i)\widetilde{\mathbf{Q}}^{(i)} is the conditional covariance matrix of 𝐰\mathbf{w}, given 𝐕\mathbf{V} and 𝐘\mathbf{Y}, evaluated at 𝐕(i)\mathbf{V}^{(i)}.

Examples in R

SPDE based model driven by NIG noise with Gaussian measurement errors in 1D

For this example we will consider the latent process XX solving the equation (ΞΊ2βˆ’βˆ‚2/βˆ‚t2)X(t)=β„³Μ‡(t),(\kappa^2 - \partial^2/\partial t^2)X(t) = \dot{\mathcal{M}}(t), where β„³Μ‡\dot{\mathcal{M}} is a NIG-distributed white-noise. For more details on the NIG distribution as well as on how to sample from such a process we refer the reader to Sampling from processes given by solutions of SPDEs driven by non-Gaussian noise vignette. We will also be using the notation from that vignette.

Notice that we will, then, be assuming ViV_i to follow an inverse Gaussian distribution with parameters Ξ½\nu and Ξ½hi2\nu h_i^2.

We will take ΞΊ=1\kappa=1, σΡ=1\sigma_\varepsilon=1, Οƒ=1\sigma=1, ΞΌ=1\mu = 1, π’Ÿ=[0,1]\mathcal{D}=[0,1] and Ξ½=0.5\nu=0.5. We will also assume that we have 10 observations Yi=X(ti)+Ξ΅i,i=1,…,10,Y_i = X(t_i) + \varepsilon_i,\quad i=1,\ldots,10, where t1,…,t10∈[0,1]t_1,\ldots,t_{10}\in [0,1]. Notice that Ξ΅i∼N(0,1)\varepsilon_i \sim N(0,1).

Let us now build the matrix KK, generate VV and then sample from the NIG model:

We will consider a mesh with 10 equally spaced nodes. The code to sample for such a process is given below.

library(fmesher)
library(ngme2)
library(ggplot2)
library(tidyr)
library(Matrix)
set.seed(123)

loc_1d <- 0:9/9
mesh_1d <- fm_mesh_1d(loc_1d)
# inla.mesh.fem
# inla.mesh.1d.fem(mesh_1d)$c1 + inla.mesh.1d.fem(mesh_1d)$g1
# attr(simulate(noise_nig(n=10, 1,1,0.5), seed=10), "noise")$h

# specify the model we use
mu <- 1; sigma <- 1; nu <- 0.5
spde_1d <- ngme2::f(
  loc_1d,
  model = "matern",
  theta_K = log(1),
  mesh = mesh_1d,
  noise = noise_nig(mu=mu, sigma=sigma, nu=nu)
)
K <- spde_1d$operator$K

W <- simulate(spde_1d, seed = 10)[[1]]
plot(loc_1d, W, type = "l")

str(W)
#>  num [1:10] 0.03 0.048 0.0594 0.0722 0.045 ...

V <- attr(W, "noise")$V
plot(loc_1d, V, type = "l")

We will now generate 9 uniformly distributed random locations to determine the locations of the observations. Once we determine the locations we will need to build the AA matrix to obtain the values of the process XX at those locations. We can build the AA matrix by using the function fm_basis.

new_loc_1d <- sort(runif(9))

A_matrix <- fmesher::fm_basis(
  mesh = fm_mesh_1d(loc_1d),
  loc = new_loc_1d
)
A_matrix
#> 9 x 10 sparse Matrix of class "dgCMatrix"
#>                                                                  
#>  [1,] . . . 0.8416668 0.1583332 .         .         .         . .
#>  [2,] . . . 0.5365185 0.4634815 .         .         .         . .
#>  [3,] . . . 0.2234306 0.7765694 .         .         .         . .
#>  [4,] . . . .         0.2899858 0.7100142 .         .         . .
#>  [5,] . . . .         .         0.4407827 0.5592173 .         . .
#>  [6,] . . . .         .         0.4294092 0.5705908 .         . .
#>  [7,] . . . .         .         .         0.8148042 0.1851958 . .
#>  [8,] . . . .         .         .         0.7875746 0.2124254 . .
#>  [9,] . . . .         .         .         0.2507151 0.7492849 . .
sigma_eps = 1

Y <- A_matrix %*% W + sigma_eps * rnorm(9)

plot(new_loc_1d, Y, type="h")
abline(0,0)

Now, let us assume we want to obtain prediction for XX at the point t*=5t^\ast = 5. To this end we will obtain predictions using both methods, namely, the standard MC and the Rao-Blackwellization.

For both of these methods we need Gibbs samples, so let us build Gibbs samples. We will build 2 chains with different starting values and no burn-in samples.

# First step - Starting values for V
# Considering a sample of inv-Gaussian as starting values for both chains
h_nig_1d <- spde_1d$operator$h
V_1 <- matrix(ngme2::rig(10, nu, nu*h_nig_1d^2, seed = 1), nrow = 1)
V_2 <- matrix(ngme2::rig(10, nu, nu*h_nig_1d^2, seed = 2), nrow = 1)

Recall that Vj∼IG(Ξ½,hj2β‹…Ξ½),j=1,…,10V_j \sim IG(\nu, h_j^2 \cdot \nu), j=1,\ldots,10, ΞΌ=1\mu=1 and Οƒ=1\sigma=1, so that Vj|𝐰,𝐘∼GIG(βˆ’1,Ξ½+1,Ξ½β‹…hj2+(𝐊𝐰+𝐑)j2)V_j|\mathbf{w},\mathbf{Y} \sim GIG(-1,\nu + 1, \nu\cdot h_j^2 + (\mathbf{K}\mathbf{w}+\mathbf{h})_j^2) and $$\mathbf{w} | \mathbf{V}, \mathbf{Y} \sim N\big( (\mathbf{K}{\rm diag}(\mathbf{V})^{-1}\mathbf{K} + \mathbf{A}^\top\mathbf{A})^{-1}(\mathbf{K}{\rm diag}(\mathbf{V})^{-1}(\mathbf{V}-\mathbf{h})+\mathbf{A}^\top \mathbf{Y}), (\mathbf{K}{\rm diag}(\mathbf{V})^{-1}\mathbf{K} + \mathbf{A}^\top\mathbf{A})^{-1}\big).$$

Let us sample 1000 times for each chain. Let us also record the values of E[𝐰|𝐕,𝐘]E[\mathbf{w} | \mathbf{V}, \mathbf{Y}] during the sampling.

W_1 <- matrix(0, nrow=1, ncol=10)
W_2 <- matrix(0, nrow=1, ncol=10)
N_sim = 1000

# Vector of conditional means E[w|V,Y]
m_W_1 <- matrix(0, nrow=1, ncol=10)
m_W_2 <- matrix(0, nrow=1, ncol=10)

# Recall that sigma_eps = 1

Asq <- t(A_matrix)%*%A_matrix / sigma_eps^2

# Recall that mu = 1 and sigma = 1
# Gibbs sampling
for(i in 1:N_sim){
  Q_1 <- K%*%diag(1/V_1[i,])%*%K/sigma^2
  Q_2 <- K%*%diag(1/V_2[i,])%*%K/sigma^2

  resp_1 <- Q_1%*%solve(K,(-h_nig_1d + V_1[i,])*mu) + t(A_matrix)%*%Y/sigma_eps^2
  resp_2 <- Q_2%*%solve(K,(-h_nig_1d + V_2[i,])*mu) + t(A_matrix)%*%Y/sigma_eps^2

  m_W_1 <- rbind(m_W_1, t(solve(Q_1 + Asq, resp_1)))
  m_W_2 <- rbind(m_W_2, t(solve(Q_2 + Asq, resp_2)))

  Chol_1 <- chol(Q_1 + Asq)
  Chol_2 <- chol(Q_2 + Asq)

  W_1 <- rbind(W_1, m_W_1[i+1,] + t(solve(Chol_1, rnorm(10))))
  W_2 <- rbind(W_2, m_W_2[i+1,] + t(solve(Chol_2, rnorm(10))))
  V_1 <- rbind(V_1, ngme2::rgig(10,
                           -1,
                           nu + (mu/sigma)^2,
                           nu*h_nig_1d^2 + as.vector((K%*%W_1[i+1,] +h_nig_1d)^2)/sigma^2))
  V_2 <- rbind(V_2, ngme2::rgig(10,
                                -1,
                                nu + (mu/sigma)^2,
                                nu*h_nig_1d^2 + as.vector((K%*%W_2[i+1,] +h_nig_1d)^2)/sigma^2))
}

Let us organize the data to build traceplots.

df_V <- data.frame(V = V_1[,1], chain = "1", coord = 1, idx = 1:(N_sim+1))
temp <- data.frame(V = V_2[,1], chain = "2", coord = 1, idx = 1:(N_sim+1))
df_V <- rbind(df_V, temp)
for(i in 2:10){
  temp_1 <- data.frame(V = V_1[,i], chain = "1", coord = i, idx = 1:(N_sim+1))
  temp_2 <- data.frame(V = V_2[,i], chain = "2", coord = i, idx = 1:(N_sim+1))
  df_V <- rbind(df_V, temp_1, temp_2)
}

df_W <- data.frame(W = W_1[,1], chain = "1", coord = 1, idx = 1:(N_sim+1))
temp <- data.frame(W = W_2[,1], chain = "2", coord = 1, idx = 1:(N_sim+1))
df_W <- rbind(df_W, temp)
for(i in 2:10){
  temp_1 <- data.frame(W = W_1[,i], chain = "1", coord = i, idx = 1:(N_sim+1))
  temp_2 <- data.frame(W = W_2[,i], chain = "2", coord = i, idx = 1:(N_sim+1))
  df_W <- rbind(df_W, temp_1, temp_2)
}

Let us compare the posterior means:

V
#> NULL
colMeans(V_1)
#>  [1] 0.02773557 0.05786363 0.05905532 0.05825626 0.08313278 0.05708960
#>  [7] 0.08074217 0.08163200 0.08393297 0.03030863
colMeans(V_2)
#>  [1] 0.02180928 0.06761954 0.06702965 0.07914965 0.05850489 0.07764963
#>  [7] 0.08838390 0.08274400 0.08892969 0.04137508

as.vector(W)
#>  [1]  0.029985516  0.048025393  0.059415769  0.072248733  0.045016944
#>  [6]  0.030076087  0.011708335  0.005539407 -0.012462080 -0.016644480
colMeans(W_1)
#>  [1] -0.3255726 -0.3237233 -0.3200719 -0.3142525 -0.3074558 -0.3003404
#>  [7] -0.2923173 -0.2875614 -0.2844100 -0.2834330
colMeans(W_2)
#>  [1] -0.3310375 -0.3284457 -0.3243567 -0.3186586 -0.3126783 -0.3051744
#>  [7] -0.2979881 -0.2928110 -0.2901328 -0.2894928

Let us begin by building traceplots for VV:

ggplot(df_V, aes(x = idx, y = V, col = chain)) +
  geom_line() + facet_wrap(~ coord, ncol=2)

and for WW:

ggplot(df_W, aes(x = idx, y = W, col = chain)) +
  geom_line() + facet_wrap(~ coord, ncol=2)

The traceplots appear to be healthy, not being stuck anywhere and the chain apparently mixed well.

We can also build autocorrelation plots. To such an end let us prepare the data frames.

acf_V <- as.vector(acf(V_1[,1], plot=FALSE)$acf)
df_V_acf <- data.frame(acf = acf_V,
                       chain = "1", coord = 1, lag = 0:(length(acf_V)-1))
acf_V <- as.vector(acf(V_2[,1], plot=FALSE)$acf)
temp <- data.frame(acf = as.vector(acf(V_2[,1], plot=FALSE)$acf),
                   chain = "2", coord = 1, lag = 0:(length(acf_V)-1))
df_V_acf <- rbind(df_V_acf, temp)
for(i in 2:10){
  acf_V <- as.vector(acf(V_1[,i], plot=FALSE)$acf)
  temp_1 <- data.frame(acf = acf_V,
                       chain = "1", coord = i, lag = 0:(length(acf_V)-1))
  acf_V <- as.vector(acf(V_2[,i], plot=FALSE)$acf)
  temp_2 <- data.frame(acf = acf_V,
                       chain = "2", coord = i, lag = 0:(length(acf_V)-1))
  df_V_acf <- rbind(df_V_acf, temp_1, temp_2)
}

acf_W <- as.vector(acf(W_1[,1], plot=FALSE)$acf)
df_W_acf <- data.frame(acf = acf_W,
                       chain = "1", coord = 1, lag = 0:(length(acf_W)-1))
acf_W <- as.vector(acf(W_2[,1], plot=FALSE)$acf)
temp <- data.frame(acf = acf_W, chain = "2",
                   coord = 1, lag = 0:(length(acf_W)-1))
df_W_acf <- rbind(df_W_acf, temp)
for(i in 2:10){
  acf_W <- as.vector(acf(W_1[,i], plot=FALSE)$acf)
  temp_1 <- data.frame(acf = acf_W,
                       chain = "1", coord = i, lag = 0:(length(acf_W)-1))
  acf_W <- as.vector(acf(W_2[,i], plot=FALSE)$acf)
  temp_2 <- data.frame(acf = acf_W,
                       chain = "2",
                   coord = i, lag = 0:(length(acf_W)-1))
  df_W_acf <- rbind(df_W_acf, temp_1, temp_2)
}

Now, let us plot the autocorrelation plots for VV:

ggplot(df_V_acf, aes(x=lag,y=acf, col=chain)) +
          geom_bar(stat = "identity", position = "identity") +
  xlab('Lag') + ylab('ACF') + facet_wrap(~coord, ncol=2)

Now, the autocorrelation plots for WW:

ggplot(df_W_acf, aes(x=lag,y=acf, col=chain)) +
          geom_bar(stat = "identity", position = "identity") +
  xlab('Lag') + ylab('ACF') + facet_wrap(~coord, ncol=2)

We can see that the correlation is very low.

We will now move forward to obtain the predictions at t*=5/9t^\ast = 5/9. We will compute MC and Rao-Blackwellization for the predictions.

# Computing A matrix at t^\ast

A_pred <- fm_basis(mesh = mesh_1d, loc = 5/9)

# MC_estimate:

AW_1 <- A_pred%*%t(W_1)
MC_pred_1 <- cumsum(AW_1)/(1:length(AW_1))

RB_1 <- A_pred%*%t(m_W_1)
RB_pred_1 <- cumsum(RB_1)/(1:length(AW_1))

df_pred <- data.frame(idx = 1:length(AW_1), MC = MC_pred_1,
                      RB = RB_pred_1, chain = "1")

AW_2 <- A_pred%*%t(W_2)
MC_pred_2 <- cumsum(AW_2)/(1:length(AW_2))

RB_2 <- A_pred%*%t(m_W_2)
RB_pred_2 <- cumsum(RB_2)/(1:length(AW_2))

temp <- data.frame(idx = 1:length(AW_2), MC = MC_pred_2,
                      RB = RB_pred_2, chain = "2")

df_pred <- rbind(df_pred, temp)

df_pred <- tidyr::pivot_longer(df_pred,
     cols = c("MC", "RB"),
     names_to = "Method",
     values_to = "Prediction")

ggplot(df_pred, aes(x = idx, y = Prediction, col=Method)) +
  facet_wrap(~chain) + geom_line()

We notice that the Rao-Blackwellization approach converges much faster than the standard MC approach.

Model estimation

In ngme2, we employ the maximum likelihood estimation and stochastic gradient descent method to estimate the parameters.

Stochastic gradient descent and maximum likelihood

For maximum likelihood estimation the goal is to minimize f(𝛉)=βˆ’L(𝛉;𝐘)f({\boldsymbol{\theta}}) = -L({\boldsymbol{\theta}}; \mathbf{Y}), where LL is the log-likelihood function of 𝐘\mathbf{Y}.

For non-Gaussian models there is an additional complication that the log-likelihood function LL is not known in explicit form. A way to solve this problem is to use Fisher’s identity (Fisher, 1925). See also Douc et al.Β (2014) for further details.

Let 𝐔=(U1,…,Un)\mathbf{U} = (U_1,\ldots, U_n) be a sequence of observed random variables with latent variables 𝐙=(Z1,…,Zn)\mathbf{Z} = (Z_1,\ldots,Z_n), ZiZ_i being a random variable in ℝp\mathbb{R}^p. Assume that the joint distribution of 𝐔\mathbf{U} and 𝐙\mathbf{Z} is parameterized by some 𝛉{\boldsymbol{\theta}}, where π›‰βˆˆΞ˜\mathbf{{\boldsymbol{\theta}}} \in \Theta and Ξ˜βŠ‚β„p\Theta\subset\mathbb{R}^p. Assume that the complete log-likelihood L(𝛉;𝐔,𝐙)L({\boldsymbol{\theta}}; \mathbf{U},\mathbf{Z}) (with respect to some reference Οƒ\sigma-finite measure) is differentiable with respect to 𝛉{\boldsymbol{\theta}} and are regular, in the sense that one may differentiate through the integral sign. Then, the marginal log-likelihood with respect to 𝐔\mathbf{U} satisfies βˆ‡π›‰L(𝛉;𝐔)=E𝐙[βˆ‡π›‰L(𝛉;𝐔,𝐙)|𝐔].\nabla_{\boldsymbol{\theta}} L({\boldsymbol{\theta}}; \mathbf{U}) = E_{\mathbf{Z}}[\nabla_{\boldsymbol{\theta}} L({\boldsymbol{\theta}}; \mathbf{U}, \mathbf{Z})|\mathbf{U}].

Standard MC approximation of the gradient

In our context, we assume 𝐰\mathbf{w} and 𝐕\mathbf{V} to be hidden. Therefore, we may use Fisher’s identity above to the latent variable (𝐕,𝐰)(\mathbf{V},\mathbf{w}) to obtain that βˆ‡π›‰L(𝛉;𝐘)=E𝐕,𝐰[βˆ‡π›‰L(𝛉;𝐘,𝐕,𝐰)|𝐘].\nabla_{\boldsymbol{\theta}} L({\boldsymbol{\theta}}; \mathbf{Y}) = E_{\mathbf{V},\mathbf{w}}[\nabla_{\boldsymbol{\theta}} L({\boldsymbol{\theta}}; \mathbf{Y}, \mathbf{V}, \mathbf{w})|\mathbf{Y}]. Thus, the idea here is to use both samples of 𝐕\mathbf{V} and 𝐰\mathbf{w} obtained from the Gibbs sampler to approximate the gradient as

βˆ‡π›‰L(𝛉;𝐘)β‰ˆ1kβˆ‘j=1kβˆ‡π›‰L(𝛉;𝐘,𝐕(j),𝐰(j)).\nabla_{{\boldsymbol{\theta}}}L({\boldsymbol{\theta}};\mathbf{Y}) \approx \frac{1}{k} \sum_{j=1}^k \nabla_{{\boldsymbol{\theta}}} L({\boldsymbol{\theta}};\mathbf{Y},\mathbf{V}^{(j)}, \mathbf{w}^{(j)}). To this end, we will compute the gradients βˆ‡π›‰L(𝛉;𝐘,𝐕,𝐰)\nabla_{{\boldsymbol{\theta}}} L({\boldsymbol{\theta}};\mathbf{Y},\mathbf{V}, \mathbf{w}).

We have that 𝐘|𝐰∼N(𝐀𝐰,ΟƒΞ΅βˆ’2𝐈),\mathbf{Y}|\mathbf{w} \sim N(\mathbf{A}\mathbf{w}, \sigma_\varepsilon^{-2} \mathbf{I}),$$\mathbf{w}|\mathbf{V} \sim N(\mathbf{K}^{-1}(-\mu \mathbf{h}+\mu\mathbf{V}), \mathbf{K}^{-1}{\rm diag}(\mathbf{V})\mathbf{K}^{-1})$$ and 𝐕\mathbf{V} follows a GIG distribution such that for every ii, E[Vi]=hiE[V_i]=h_i.

Therefore, we have that $$\begin{array}{ccl} L((\mu,\sigma_\varepsilon); \mathbf{w}, \mathbf{V},\mathbf{Y}) &=& -n\log(\sigma_\varepsilon)-0.5\sigma_\varepsilon^{-2} (\mathbf{Y} - \mathbf{A}\mathbf{K}^{-1}(-\mu \mathbf{h}+\mu\mathbf{V}))\\ &-&0.5\sigma_\varepsilon^{-2}(\mathbf{A}(\mathbf{w}-\mathbf{m})^\top{\rm diag} (1/V_i) (\mathbf{Y} - \mathbf{A}\mathbf{K}^{-1}(-\mu \mathbf{h}+\mu\mathbf{V})-\mathbf{A}(\mathbf{w}-\mathbf{m})) + const, \end{array}$$ where constconst does not depend on (ΞΌ,Οƒ)(\mu,\sigma). Thus, we have that $$\nabla_\mu L((\mu,\sigma_\varepsilon); \mathbf{w}, \mathbf{V},\mathbf{Y}) = \sigma_\varepsilon^{-2}\mathbf{A}\mathbf{K}^{-1}(-\mathbf{h}+\mathbf{V}) {\rm diag} (1/V_i)(\mathbf{Y} - \mathbf{A}\mathbf{K}^{-1}(-\mu \mathbf{h}+\mu\mathbf{V}) - \mathbf{A}(\mathbf{w}-\mathbf{m})).$$ Now, with respect to σΡ\sigma_\varepsilon we have that $$\nabla_{\sigma_\varepsilon} L((\mu,\sigma_\varepsilon); \mathbf{w}, \mathbf{V},\mathbf{Y}) = -\frac{n}{\sigma_\varepsilon} + \frac{1}{\sigma_\varepsilon^3} (\mathbf{Y} - \mathbf{A}\mathbf{K}^{-1}(-\mu \mathbf{h}+\mu\mathbf{V})-\mathbf{A}(\mathbf{w}-\mathbf{m}))^\top{\rm diag} (1/V_i) (\mathbf{Y} - \mathbf{A}\mathbf{K}^{-1}(-\mu \mathbf{h}+\mu\mathbf{V})-\mathbf{A}(\mathbf{w}-\mathbf{m}))$$ By proceeding analogously, we obtain that the gradient with respect to ΞΊ2\kappa^2 is given by $$\nabla_{\kappa^2} L(\kappa^2; \mathbf{Y}, \mathbf{w}, \mathbf{V}) = tr(\mathbf{C}\mathbf{K}^{-1})- \mathbf{w}^\top \mathbf{C}^\top{\rm diag} (1/V_i)(\mathbf{K}\mathbf{w}+(\mathbf{h}-\mathbf{V})\mu).$$ Finally, for the gradient of the parameter of the distribution of 𝐕\mathbf{V}, we use the Rao-Blackwellized version, see the next subsection.

Rao-Blackwellized approximation of the gradient

Now, observe that we can compute the log-likelihood L(𝛉;𝐘,𝐕)L({\boldsymbol{\theta}}; \mathbf{Y}, \mathbf{V}). Indeed, we apply Fisher’s identity again to find that βˆ‡π›‰L(𝛉;𝐘,𝐕)=E𝐰[βˆ‡π›‰L(𝛉;𝐘,𝐕,𝐰)|𝐘,𝐕].\nabla_{\boldsymbol{\theta}} L({\boldsymbol{\theta}}; \mathbf{Y}, \mathbf{V}) = E_\mathbf{w}[\nabla_{\boldsymbol{\theta}} L({\boldsymbol{\theta}}; \mathbf{Y}, \mathbf{V}, \mathbf{w})|\mathbf{Y},\mathbf{V}]. So, with the above gradients, we can approximate the gradient βˆ‡π›‰L(𝛉;𝐘)\nabla_{\boldsymbol{\theta}} L({\boldsymbol{\theta}}; \mathbf{Y}) by taking the mean over the samples of 𝐕\mathbf{V} obtained by the Gibbs sampling: βˆ‡π›‰L(𝛉;𝐘)β‰ˆ1kβˆ‘j=1kβˆ‡π›‰L(𝛉;𝐘,𝐕(j)).\nabla_{{\boldsymbol{\theta}}}L({\boldsymbol{\theta}};\mathbf{Y}) \approx \frac{1}{k} \sum_{j=1}^k \nabla_{{\boldsymbol{\theta}}} L({\boldsymbol{\theta}};\mathbf{Y},\mathbf{V}^{(j)}).

Let us now compute the gradients βˆ‡π›‰L(𝛉;𝐘,𝐕)\nabla_{{\boldsymbol{\theta}}} L({\boldsymbol{\theta}};\mathbf{Y},\mathbf{V}). We begin by computing βˆ‡ΞΌL((ΞΌ,σΡ);𝐕,𝐘)\nabla_\mu L((\mu,\sigma_\varepsilon); \mathbf{V},\mathbf{Y}). To this end, we use the expression for βˆ‡ΞΌL((ΞΌ,σΡ);𝐰,𝐕,𝐘)\nabla_\mu L((\mu,\sigma_\varepsilon); \mathbf{w}, \mathbf{V},\mathbf{Y}) given in the previous subsection together with E[𝐰|𝐕,𝐘]=𝐦̃E[\mathbf{w}|\mathbf{V},\mathbf{Y}] = \widetilde{\mathbf{m}}, to conclude that $$\nabla_\mu L((\mu,\sigma_\varepsilon); \mathbf{V},\mathbf{Y}) = \sigma_\varepsilon^{-2}\mathbf{A}\mathbf{K}^{-1}(-\mathbf{h}+\mathbf{V}) {\rm diag} (1/V_i)(\mathbf{Y} - \mathbf{A}\mathbf{K}^{-1}(-\mu \mathbf{h}+\mu\mathbf{V}) - \mathbf{A}(\widetilde{\mathbf{m}}-\mathbf{m}))$$ Analogously, we also obtain that $$\nabla_{\sigma_\varepsilon} L((\mu,\sigma_\varepsilon); \mathbf{V},\mathbf{Y}) = -\frac{n}{\sigma_\varepsilon} + \frac{1}{\sigma_\varepsilon^3} (\mathbf{Y} - \mathbf{A}\mathbf{K}^{-1}(-\mu \mathbf{h}+\mu\mathbf{V})-\mathbf{A}(\widetilde{\mathbf{m}}-\mathbf{m}))^\top{\rm diag} (1/V_i) (\mathbf{Y} - \mathbf{A}\mathbf{K}^{-1}(-\mu \mathbf{h}+\mu\mathbf{V})-\mathbf{A}(\widetilde{\mathbf{m}}-\mathbf{m})).$$

Now, notice that $$\nabla_{\kappa^2} L(\kappa^2; \mathbf{Y}, \mathbf{w}, \mathbf{V}) = tr(\mathbf{C}\mathbf{K}^{-1})- \mathbf{w}^\top \mathbf{C}^\top{\rm diag} (1/V_i)\mathbf{K}\mathbf{w}-\mathbf{w}^\top \mathbf{C}^\top{\rm diag} (1/V_i)(\mathbf{h}-\mathbf{V})\mu,$$ that E[𝐰|𝐕,𝐘]=𝐦̃E[\mathbf{w}|\mathbf{V},\mathbf{Y}] = \widetilde{\mathbf{m}} and that $$E[\mathbf{w}^\top \mathbf{C}^\top{\rm diag} (1/V_i)\mathbf{K}\mathbf{w}|\mathbf{V},\mathbf{Y}] = tr(\mathbf{C}^\top{\rm diag} (1/V_i)\mathbf{K}\widetilde{\mathbf{Q}}^{-1}) + \widetilde{\mathbf{m}}^\top\mathbf{C}^\top{\rm diag} (1/V_i)\mathbf{K}\widetilde{\mathbf{m}}$$ to conclude that $$\nabla_{\kappa^2} L(\kappa^2; \mathbf{Y}, \mathbf{V}) = tr(\mathbf{C}\mathbf{K}^{-1})- tr(\mathbf{C}^\top{\rm diag} (1/V_i)\mathbf{K}\widetilde{\mathbf{Q}}^{-1}) - \widetilde{\mathbf{m}}^\top\mathbf{C}^\top{\rm diag} (1/V_i)\mathbf{K}\widetilde{\mathbf{m}}-\widetilde{\mathbf{m}}^\top \mathbf{C}^\top{\rm diag} (1/V_i)(\mathbf{h}-\mathbf{V})\mu.$$

Finally, the gradient for the parameter of the distribution of 𝐕\mathbf{V} depends on the distribution of 𝐕\mathbf{V}. To illustrate we will present the gradient with respect to the parameter Ξ½\nu when 𝐕\mathbf{V} follows inverse-Gaussian distribution, which is the situation in which we have NIG noise.

For this case we have that βˆ‡Ξ½L(Ξ½;𝐘,𝐕)=βˆ’βˆ‘j=1n12(Ξ½βˆ’1βˆ’hj2Vj+Vjβˆ’hj).\nabla_\nu L(\nu; \mathbf{Y},\mathbf{V}) = -\sum_{j=1}^n \frac{1}{2}\Bigg(\nu^{-1} -\frac{h_{j}^2}{V_j} +V_j -h_j\Bigg).

A remark on traces

On the gradients βˆ‡ΞΊ2L(ΞΊ2;𝐘,𝐕)\nabla_{\kappa^2} L(\kappa^2; \mathbf{Y}, \mathbf{V}) and βˆ‡ΞΊ2L(ΞΊ2;𝐘,𝐰,𝐕)\nabla_{\kappa^2} L(\kappa^2; \mathbf{Y}, \mathbf{w}, \mathbf{V}), we can see the traces tr(π‚πŠβˆ’1)tr(\mathbf{C}\mathbf{K}^{-1}) and $tr(\mathbf{C}^\top{\rm diag} (1/V_i)\mathbf{K}\widetilde{\mathbf{Q}}^{-1})$. These traces contain the inverses πŠβˆ’1\mathbf{K}^{-1} and πΜƒβˆ’1\widetilde{\mathbf{Q}}^{-1}.

There are efficient alternatives to handling these traces. For instance, if we want to compute tr(ABβˆ’1)tr(AB^{-1}), BB is symmetric, and the sparsity of BB is the same as the sparsity of AA, we only need to compute the elements of Bβˆ’1B^{-1} for the coordinates with non-zero entries. This is what happens, for instance, in tr(π‚πŠβˆ’1)tr(\mathbf{C}\mathbf{K}^{-1}). So, to compute this trace there is this efficient alternative. It is implemented in the ngme package.

References

  • Lindgren, F., Rue, H., and Lindstrom, J. (2011). An explicit link between Gaussian fields and Gaussian Markov random fields: the stochastic partial differential equation approach. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(4):423–498.

  • Robert, C., G. Casella (2004). Monte Carlo statistical methods, Springer Texts in Statistics, Springer, New York, USA.

  • Wallin, J., Bollin, D. (2015). Geostatistical Modelling Using Non-Gaussian MatΓ©rn Fields. Scandinavian Journal of Statistics. 42(3):872-890.

  • Whittle, P. (1963). Stochastic-processes in several dimensions. Bulletin of the International Statistical Institute, 40(2):974–994.