Introduction

Cross-validation is a statistical method used to evaluate and improve the performance of machine learning models by testing them on different subsets of the data. The goal is to ensure that the model generalizes well to unseen data and doesn’t just perform well on the training set.

Since in the ngme2 package we can fit high flexible models using Gaussian driven noise and non-Gaussian driven noise (or mixture), we need to evaluate the performance of the model using cross-validation. In this vignette, we will show how to perform cross-validation for the ngme2 models.

Doing cross-validation in the ngme2 package is straightforward. We can use the cross_validation function to perform cross-validation for the ngme2 models. The function has the following key arguments: 1. model: the model to be fitted via ngme2 function. 2. type: the type of cross-validation to be performed. It can be either k-fold, loo (leave-one-out), lpo (leave-percent-out), or custom (user-defined training and testing sets). 3. k (optional): the number of folds for k-fold cross-validation. 4. percent (optional): the percentage of data to be left out for lpo cross-validation. 5. times (optional): the number of times to repeat for lpo cross-validation. 6. train_idx (optional): the training set for custom cross-validation. 7. test_idx (optional): the testing set for custom cross-validation. 8. n_burnin: the number of burn-in samples to be discarded. 9. n_gibbs_samples: After burn-in period, the number of posterior Gibbs samples to be drawn. The prediction is based on drawing n_gibbs_samples samples from the posterior distribution. 10. N: Do it N multiple times to get more accurate results.

AR(1) Example

Let’s run a simple example using the AR(1) model. We will generate some data and fit the AR(1) model using the ngme2 function. Then, we will perform cross-validation using the cross_validation function.

library(ngme2)
set.seed(10)
n_obs = 100
ar1_model = f(1:n_obs, model="ar1", rho=0.5)
sim_fields <- simulate(ar1_model)[[1]]
plot(sim_fields, type="l", col="blue", lwd=2, xlab="Time", ylab="Value")

y <- sim_fields + rnorm(n_obs, 0, 0.5)

# Let's estimate the model in 3 ways

# 1. Poor estimation (only 10 iterations)
fit_ar_poor <- ngme(
  y ~ 0 + f(1:n_obs, model="ar1"),
  data = data.frame(y),
  family = "normal",
  control_opt = control_opt(
    iterations = 10
  )
)
#> Starting estimation... 
#> 
#> Starting posterior sampling... 
#> Posterior sampling done! 
#> Note:
#>       1. Use ngme_post_samples(..) to access the posterior samples.
#>       2. Use ngme_result(..) to access different latent models.
fit_ar_poor
#> *** Ngme object ***
#> 
#> Fixed effects: 
#>   None
#> 
#> Models: 
#> $field1
#>   Model type: AR(1)
#>       rho = 0.238
#>   Noise type: NORMAL
#>   Noise parameters: 
#>       sigma = 0.714
#> 
#> Measurement noise: 
#>   Noise type: NORMAL
#>   Noise parameters: 
#>       sigma = 0.677
#> 
#> 
#> Number of replicates is  1

# 2. Good estimation (500 iterations)
fit_ar <- ngme(
  y ~ 0 + f(1:n_obs, model="ar1"),
  data = data.frame(y),
  family = "normal",
  control_opt = control_opt(
    iterations = 500
  )
)
#> Starting estimation... 
#> 
#> Starting posterior sampling... 
#> Posterior sampling done! 
#> Note:
#>       1. Use ngme_post_samples(..) to access the posterior samples.
#>       2. Use ngme_result(..) to access different latent models.
fit_ar
#> *** Ngme object ***
#> 
#> Fixed effects: 
#>   None
#> 
#> Models: 
#> $field1
#>   Model type: AR(1)
#>       rho = 0.573
#>   Noise type: NORMAL
#>   Noise parameters: 
#>       sigma = 0.748
#> 
#> Measurement noise: 
#>   Noise type: NORMAL
#>   Noise parameters: 
#>       sigma = 0.548
#> 
#> 
#> Number of replicates is  1

# 3. Linear model
fit_linear <- ngme(
  y ~ 1 + x,
  data = data.frame(y=y, x=1:n_obs),
  family = "normal",
  control_opt = control_opt(
    iterations = 10
  )
)
#> Starting estimation... 
#> 
#> Starting posterior sampling... 
#> Posterior sampling done! 
#> Note:
#>       1. Use ngme_post_samples(..) to access the posterior samples.
#>       2. Use ngme_result(..) to access different latent models.
fit_linear
#> *** Ngme object ***
#> 
#> Fixed effects: 
#> (Intercept)           x 
#>    -0.16948     0.00919 
#> 
#> Models: 
#> Measurement noise: 
#>   Noise type: NORMAL
#>   Noise parameters: 
#>       sigma = 1.07
#> 
#> 
#> Number of replicates is  1


# Doing Cross-validation

# 10-fold cross-validation
cv1 <- cross_validation(
  list(
    linear = fit_linear,
    ar1_poor = fit_ar_poor,
    ar1 = fit_ar
  ),
  type = "k-fold",
  k = 10,
  n_gibbs_samples = 500,
  N = 5,
  seed = 10
)
cv1$mean.scores
#>                MAE       MSE  neg.CRPS neg.sCRPS
#> linear   0.8432742 1.0809549 0.7974937       NaN
#> ar1_poor 0.8072278 1.0392176 0.6847734  1.220535
#> ar1      0.7779488 0.9160218 0.5903707  1.115080

# Leave-percent-out cross-validation (20%) with 5 repetitions
cv2 <- cross_validation(
  list(
    linear = fit_linear,
    ar1_poor = fit_ar_poor,
    ar1 = fit_ar
  ),
  type = "lpo",
  percent = 0.2,
  times = 5,
  n_gibbs_samples = 500,
  seed = 10
)
cv2$mean.scores
#>                MAE      MSE  neg.CRPS neg.sCRPS
#> linear   0.8626743 1.163210 1.0252977       NaN
#> ar1_poor 0.8633346 1.164464 0.7290992  1.267883
#> ar1      0.8605305 1.156011 0.7258408  1.273118