Information About

Cross-validation




Cross-validation is important in guarding against Testing Hypotheses Suggested By The Data , especially where further Samples are hazardous, costly or impossible ( Uncomfortable Science ) to collect.


NOMENCLATURE


In cross-validation, the subsample of data that is used for the initial analysis is called the training data. In systems where models are autonomously constructed from the training data (such as in datamining and artificial intelligence), the process of constructing the model is called training. The subsample(s) of data that are used to validate the initial analysis (by acting as "blind" data) are called validation data, or test data.


COMMON TYPES OF CROSS-VALIDATION



Holdout cross-validation


Holdout cross-validation is the simplest kind of cross-validation. Observations are chosen randomly from the initial sample to form the validation data, and the remaining observations are retained as the training data. Normally, less than a third of the initial sample is used for validation data. {Link without Title}


''K''-fold cross-validation


In ''K''-fold cross-validation, the original sample is partitioned into ''K'' subsamples. Of the ''K'' subsamples, a single subsample is retained as the validation data for testing the model, and the remaining ''K'' − 1 subsamples are used as training data. The cross-validation process is then repeated ''K'' times (the ''folds''), with each of the ''K'' subsamples used exactly once as the validation data. The K results from the folds then can be averaged (or otherwise combined) to produce a single estimation.


Leave-one-out cross-validation


As the name suggests, leave-one-out cross validation involves using a single observation from the original sample as the validation data, and the remaining observations as the training data. This is repeated such that each observation in the sample is used once as the validation data. This is the same as ''K''-fold cross-validation where ''K'' is equal to the number of observations in the original sample. In the case of Tikhonov Regularization , an efficient algorithm has been found.