Cross-Validation Methodology in Materials Science
2018-09-01
Cross-validation is a critical part of statistical methodology, for ad hoc models cross-validation may be the only indication of model performance, and without a reasonable cross-validation methodology serious over-fitting can go undetected. This issue is particularly relevant to domains where small data-sets with a comparatively large number of features is common, for example Materials Science or Genomics. If the cross-validation method does not take into consideration the feature selection (that is, considering feature selection as part of model selection), a significant selection bias can occur, see this paper for an example with gene-expression data. In general, a reasonably robust validation methodology should be chosen before model selection, and final hold-out sets should be used when possible.