Split Traing set and Test set

In order to use the test error of a test set as an approximation of the generalization error, we need to effectively split the initial data set into mutually exclusive training sets and test sets. The following are some common methods.

Hold-out

Divide the data set $D$ into two mutually exclusive sets, one as the training set $S$ and one as the test set $T$, satisfying $D=S{\cup}T$ and $S{\cap}T=\phi$. The common division is about 2/3-4/5 samples are used for training and the rest for testing.

It is notable that the division of the training/test sets should be as consistent as possible in the distribution of the data to avoid additional bias, the stratification is commonly used to sovle this problem.

At the same time, the results of the single hold-out are often not stable enough due to the random nature of the division, and generally we take the average of a number of random division repeated experiments.

Cross Validation

Divide the data set $D$ into $k$ mutually exclusive subsets of equal size, satisfying $D=D_1{\cup}D_2{\cup}… {\cup}D_k$, $D_i{\cap}D_j=\phi (i{\neq}j)$, similarly using stratification to obtain these subsets that keeping the data distribution as consistent as possible.

The idea of the cross-validation method is that each time a sum of $k-1$ subsets is used as the training set and the remaining is used as the test set, so as to obtain $k$ cases of training/test set division to do $k$ training &testing, and return the mean of the $k$ test results eventually.

K-fold Cross Validation

Cross-validation is also called K-fold Cross Validation, the most common value of $k$ is 10. The following gives a diagram of 10-fold cross-validation.

Leave-One-Out

Similar to the hold-out, the data set $D$ is divided into $k$ subsets at random. Therefore K-fold Cross Validation is usually repeated $p$ times as p-times k-fold Cross Validation, which is commonly 10-times 10-fold Cross Validation that performe 100 training/testing sessions.

In particular, when there is only one sample in each subsets of divied $k$ subsets, it is known as the Leave-One-Out. The results of the Leave-One-Out are more accurate, but with significant computer consumption.

Bootstrapping

What we want to evaluate is the model that was trained with the whole $D$. However, in the Hold-out and Cross Validation, the actual evaluated model uses a smaller training set than $D$ because a portion of the sample is retained for testing, which inevitably introduces some estimation biases due to differences in training sample size. The Leave-One-Out is less affected by changes in training sample size, but the computational complexity is too high. The Bootstrapping solves precisely that problem.

The basic idea of the Bootstrapping is given a dataset $D$ containing $m$ samples, randomly selected from $D$ one sample at a time copied into $D'$, and then put it back into the initial dataset $D$ to be picked up at the next sampling. Repeating $m$ times to obtain a dataset $D'$ containing $m$ samples.

It can be known that the limit of the probability that the sample remain uncollected in $m$ times of sampling is:

${\lim\limits_{m\to\infty}}{(1-\frac{1}{m})^m\to\frac{1}{e}\approx0.368}$

Thus, approximately 36.8% of the initial sample set $D$ did not appear in $D'$ through bootstrapping sampling, so $D'$ could be used as the training set and $D-D'$ as the test set. The Bootstrapping is useful when the data set is small which is difficult to spilt the training/test set effectively, however it introduces estimation bias because the data set generated by the bootstraping (random sampling) alters the distribution of the initial data set. When the initial data set is sufficient, Hold-out and Cross Validation are more commonly used.

Last updated on 2019-05-05

Edit this page