Performance Measure
Performance measures are evaluation criteria that measure the ability of models to generalize, and when comparing the ability of different models, different performance measures often results in different judgments.
The Most Common Performance Measures
Mean Squared Erro
In the regression task which predict continuous values, the most commonly used is the Mean Squared Error
, many classical algorithms are using MSE as an evaluation function.
$E(f;D)=\frac{1}{m}\sum\limits_{i=1}^{m}(f(x_i)-y_i)^2$
More generally, for data distribution $\mathcal{D}$ and probability density functions $p(\cdot)$, the MSE can be described as
$E(f;\mathcal{D})=\int_{x\sim\mathcal{D}}(f(x)-y)^2p(x)dx$
Error rate & Precision
In the classification task which predict discrete values, the most commonly used are error rate and precision, where error rate is the number of samples classified incorrectly as a proportion of the total number of samples, and precision is the number of correctly classified samples as a proportion of the total number of samples, easily known that error rate + precision = 1.
Error rate is defined as:
$E(f;D)=\frac{1}{m}\sum\limits_{i=1}^{m}\mathbb I(f(x_i)\neq{y_i})$
Precision is defined as:
$\begin{align*}acc(f;D)&=\frac{1}{m}\sum\limits_{i=1}^{m}\mathbb I(f(x_i)=y_i)=1-E(f;D)\end{align*}$
More generally, for data distribution $\mathcal{D}$ and probability density functions $p(\cdot)$, the Error Rate and Precision can be described as:
$E(f;\mathcal{D})=\int_{x\sim\mathcal{D}}\mathbb I(f(x)\neq{y})p(x)dx$
$acc(f;\mathcal{D})=\int_{x\sim\mathcal{D}}\mathbb I(f(x)=y)p(x)dx=1-E(f;\mathcal{D})$
Accuracy/Recall/F1
For example, in a recommendation system, we only care about whether the content pushed to the user is of interest to the user (i.e. accuracy), or how much of all the content of interest to the user we pushed (i.e. recall). Therefore, the search accuracy/recall is more appropriate to describe such issues. For the binary classification, the classification result confusion matrix and the accuracy/racall are defined as follows:
Fact | Prediction | |
Positive | Negative | Positive | TP(Ture Positive) | FN(False Negative) |
Negative | FP(False Positive) | TN(Ture Negative) |
Accuracy $P$ and Recall $R$ are defined as:
$P=\frac{TP}{TP+FP}$
$R=\frac{TP}{TP+FN}$
Accuracy and Recall are a pair of contradictory measures. For example, if we want content pushed to be as interesting as possible to all users, we can only push the content which is certainly so that some content users are interestd will be missed, leads to low racall; if we want all the content which users are interested pushed, we only push all the content so that accuracy is very low.
The P-R curve
is precisely the curve describing the change of the accuracy/racall, the P-R curve is defined as follows: according to the prediction result of the learner (generally a real value or probability), the test samples are ranked, the samples most likely to be the positive example in the front, the least likely to be the positive example in the back, and the samples are predicted as the positive example one by one in this order, and the current $P$ and $R$ values are calculated each time, as shown in the figure below:
How is the P-R curve evaluated? If the P-R curve of one learner $A$ is completely covered by the P-R curve of another learner $B$, then the performance of $B$ is better than that of $A$. If the curves of $A$ and $B$ intersect, then who has a larger area under the curve and whose performance is better. But in general, the area under the curve is difficult to estimate, so the “Break-Event Point” (BEP) is derived, i.e. when P=R, the higher the value of the Break-Event Point, the better the performance.
The P and R indicators are sometimes contradictory, so they need to be considered together, and the most common method is F-Measure, also known as F-Score
, which is a weighted reconciliation average of $P$ and $R$. i.e.
$\frac{1}{F_\beta}=\frac{1}{1+\beta^2}\cdot(\frac{1}{P}+\frac{\beta^2}{R})$
$F_\beta=\frac{(1+\beta^2)\times{P}\times{R}}{(\beta^2\times{P})+R}$
n particular, when $\beta=1$ it becomes the common $F1$ measure, is a reconciled average of $P$ and $R$, the better the model performs when $F1$ is higher.
$\frac{1}{F1}=\frac{1}{2}\cdot(\frac{1}{P}+\frac{1}{R})$
$F1=\frac{2\times{P}\times{R}}{P+R}=\frac{2\times{TP}}{ALL+TP-TN}$
Sometimes we have multiple bicategorical confusion matrices, e.g., multiple trainings or training on multiple datasets, then there are two ways to estimate global performance, which are macro and micro. Macro is to calculate the $P$ and $R$ values of each confusion matrix first, then obtain the average P value $macroP$ and the average R value $macroR$ to calculating $F\beta$ or $F1$, while micro is to calculate the average TP, FP, TN, FN of the confusion matrix, then calculate $P$, $R$, and thus $F\beta$ or $F1$.
$macroP=\frac{1}{n}\sum\limits_{i=1}^{n}P_i$
$macroR=\frac{1}{n}\sum\limits_{i=1}{n}R_i$
$macro-F1=\frac{2\times{macroP}\times{macroR}}{macroP+macroR}$
$microP=\frac{\overline{TP}}{\overline{TP}+\overline{FN}}$
$microF1=\frac{2\times{microP}\times{microR}}{microP+microR}$
ROC & AUC
The ROC curve is very similar to the P-R curve, both are predicted according to the positive cases one by one according to the order of the order, the difference is that the ROC curve takes the “True Positive Rate” (TPR) as the horizontal axis and the vertical axis as the “False Positive Rate” (FPR), the ROC focuses on studying the order of the test sample based on the evaluation value.
$TPR=\frac{TP}{TP+FN}$
$FPR=\frac{FP}{TN+FP}$
A simple analysis of the image shows that when FN=0, TN must also be 0, and vice versa. We can draw a queue and try to split the queue using different truncation points (i.e. thresholds) to analyze the shape of the curve, (0,0) means that all samples are predicted as negative cases, (1,1) means that all samples are predicted as positive cases, (0,1) means the ideal case that all positive cases appear before negative cases, (1,0) means the worst case that all negative cases appear before positive cases.
Similarly, if the ROC curve of one learner $A$ is completely covered by the other learner $B$, the performance of $B$ is said to superior to that of $A$. If the curves of A and B intersect, then whose curve has more area under it and whose performance is better. The Area under the ROC Curve is defined as AUC (Area Uder ROC Curve). Different from P-R, the AUC here is estimable,i.e. the sum of the Area of each small rectangle under the AOC Curve. It’s easy to see that the larger the AUC is, the better the quality of the sort. When the AUC is 1, it mean that all the positive examples are in front of the negative ones, and when the AUC is 0 means all the negative examples are in front of the positive ones.
$AUC=\frac{1}{2}\sum\limits_{i=1}^{m-1}(x_{i+1}-x_i)\cdot(y_i+y_{i+1})$
Cost-sensitive Error Rate & Cost Curve
In the above approach, the mistakes of the learner are treated equally, but in reality the cost of predicting positive samples into negative is often not the same as predicting negative samples into positive, e.g., predicting no disease –> having disease just increases the number of checks, but having disease –> no disease increases the risk to life. Take the binary classification for example, which thus introduces a cost matrix
.
Fact | Prediction | |
class 0 | class 1 | class 0 | 0 | $cost_{10}$ |
class 1 | $cost_{01}$ | 0 |
Under non-equal error costs, we want to minimize the overall cost
so that the cost-sensitive
error rate is:
$E(f;D;cost)=\frac{1}{m}(\sum\limits_{x_i\in{D}^+}\mathbb I(f(x_i)\neq{y_i})\times{cost_{01}}+\sum\limits_{x_i\in{D^-}}{\mathbb I}(f(x_i)\neq{y_i})\times{cost_{10}})$
Similarly, for the ROC curve, it evolves into a cost curve at non-equal error costs, where the horizontal axis of the cost curve is the probability cost of taking the positive case between [0,1], where $p$ is the probability of taking the positive case, and the vertical axis is the normalized cost of taking the value [0,1].
$P(+)cost=\frac{p\times{cost_{01}}}{p\times{cost_{01}}+(1-p)\times{cost_{10}}}$
$cost_{norm}=\frac{FNR\times{p}\times{cost_{01}}+FPR\times(1-p)\times{cost_{10}}}{p\times{cost_{01}}+(1-p)\times{cost_{10}}}$
Plot the cost curve is simple: set the coordinates of a point on the ROC curve as (TPR, FPR) to calculated FNR, then plot a line segment from (0, FPR) to (1, FNR) in the cost plane, the area under the line segment represents the desired overall cost under that condition, so transform each point of the ROC curve earth into a line segment in the cost plane, and take the lower boundary of all line segments, the enclosed area is the desired overall cost of the learner under all conditions, as shown in the figure: