Terms

Data

Suppose we collect data on a batch of watermelons, e.g.:

(color=green; rootstock=crumpled; knock=muddy)
(color=black; rootstock=slightly crumpled; knock=dull)
(color=light from; rootstock=hard; knock=crisp)…

Each pair of brackets is a record of a watermelon. Definitions:

Data Set: The collection of all records is data set.
Sample: Each record is an instance or sample.
Feature or Attribute: A single feature is: a feature or attribute. e.g. color or percussion.
Vector: For each record represented on the axis can be represented by a vector. e.g. (green, huddled, turbid), i.e. each watermelon is: a feature vector.
Dimensionality: The number of characteristics of a sample is dimensionality.
Dimensional Disaster: The watermelon’s example dimension is 3, when the dimensionality is very large and it called dimensional disaster.

Data Set

When a computer program learns empirical data to generate an algorithm model, each record is called a training sample, and when the model is trained and we want to test the model’s performance with new samples, each new sample is called a test sample. Definitions:

Training Set: The set of all training samples is training set, [special].
Test Set: The set of all test samples is test set , [general].
Generalization: The ability of the machine-learned model to apply to the new sample is generalization. i.e. from special to general.

Classfication

In the case of the watermelon, we want the computer to train a decision model to determine whether a new watermelon is a good watermelon or not by learning data about its characteristics. What we can tell is: whether watermelon is good or bad, which is a discrete value. Likewise, there are projections of future population numbers by using population data from previous years, which are continuous values. Definitions:

Classfication: The problem where the predicted values are discrete is classification.
Regression: The problem where the predicted values are continuous is regression.

Method of learning

In our process of predicting, it is clear that we already know in advance whether the melon in the training set is a good or bad, the learner learns the characteristics of these melons and thus concludes the law. The watermelon in the training set have been marked, called marking information.

But there are also cases where the information is not marked. For example, we want to divide a pile of watermelons into two small piles according to their characteristics, so that the watermelons in a pile are as similar as possible. For this problem, we do not know beforehand how good or bad the watermelons are, the samples are not marked with information. Definitions:

Supervised Learning: The learning task for which the training data has tagged information is supervised learning, and it is easy to know that both the classification and regression described above are supervised learning categories.
Unsupervised Learing: The learning tasks for which the training data is not labeled with information are unsupervised learning, commonly known as clustering and association rules.

Last updated on 2019-05-05

Edit this page