Method of Evaluation
In realistic tasks, we often have multiple algorithms to choose from, so how do chonse the best one for us? As mentioned in last chapter, we want the learner with the ‘smallest generalization error’, and the ideal solution is to evaluate the generalization error of the model and select the smallest one. However, the generalization error refers to the ability of the model to be applied to all new samples that we do not have direct access to the it.
Thus, we usually use a ‘test set’ to test the learner’s ability to discriminate on new samples, and then use the test error on the test set as an approximation of the generalization error. Obviously the test set which we select should be as mutually exclusive as possible with the training set, and here’s a little story to explain why.
Suppose the teacher has 10 questions for the students to practice, and uses the same 10 questions for the test, however some children may could only do these 10 questions and get a high score. It is clear that the score does not reflect the real level effectively.
In our task, we would like to have a well generalized models, as the teacher would like the students not only learned the course well but also gained the ability to think about what they have learned.
Training samples are equivalent to the exercises for students to practice, and the testing process is equivalent to an exam. If the test sample had been used for training, it would have been an over-optimistic estimate.