Blog posts

2021

Spatial Cross-Validation: Estimating the Generalized Error in the Worst Scenario

1 minute read

Published:

Most recently, there has been an increase discussion towards the use of Spatial Cross-Validation (SCV) techniques to assess models learned from data with spatial dependence. While some authors advocate for always use the SCV, others does not recommend its usage since it can produce pessimistic results. In my point of view, the answer to wether your should or should not use such validation technique lies on your sample distribution, as dicussed in (cite), and how confident you are that the spatial dependence structure observed will be the same on the out of sample data. If I was not clear, imagine a sample dataset S, with a distrbution D presenting a spatial dependence structure. If S is representative enough so D is the same for the out of sample data. Them the traditional Cross-Validation (CV) will not overstimate the generalized error. On the other hand, if you do not have the confidence that the sample distribution D will be observed in the out of sample data, or you know that your sample is biased (e.g, you collected data from spatial clustered areas), then SCV is a better option than the traditional CV. However, you need to keep in mind that SCV will evaluate your model on the worst scenario, when the test set dependence structure is not the same as in the training set. In this way, the best models will be those that can generalize independently of the spatial data structure. In summary, the SCV is not a final validation technique for data with spatial dependence, but it is theoretically stablished method to evaluate your models on the worst scenario situations.