This post will summary some loss functions that are used in training a neural network.
Loss Function
This post presumes that you are familiar with basic pipeline of training a neural network.
The loss function is used to guide the training process of a neural network. For classification problem, mean squared error (L2 loss) and cross entropy loss are widely used, for a regression problem L1 loss sometimes will be used, for object detection problem there’s a loss function named focal loss which is believed can ease the training process of the CNN based detector, for mask-rcnn object segmentations there’s also corresponding loss functions named average binary cross entropy that can improve the segmentation performance results.
Finally, there are quite a lot of complicated applications that use the combination of different loss functions mentioned above.
Mean squared Error or you can call it L2 Loss
Mean squared error is the most common loss function in machine learning, I believe it is the most intuitive loss function for every machine learning beginner.
The formula is :
$$MSE := \frac{1}{n}\sum_{t=1}^{n}e_t^2$$
In classification problem, usually, n is the number of classes and e is the difference between the ground truth and the inference result of your model. For example, if there’s 3 classes in total, for a image with label 0, the ground truth can be represent by a vector [1, 0, 0] and the output of the neural network can be [0.5, 0.2, 0.3], then the MSE can be calculated by:
$\frac{sumOfDifference}{numberOfClasses} = \frac{(0.5 - 1)^2 + (0.2 - 0)^2 + (0.3 - 0)^2}{3} = 1.26666$
MSE is the straight line between two points in Euclidian space, in neural network, we apply back propagation algorithm to iteratively minimize the MSE, so the network can learning from your data, next time when the network see the similar data, the inference result should be similar to the output of the training output.
Cross Entropy Loss
Cross entropy loss is a another common loss function that commonly used in classification or regression problems. Cross entropy is more advanced than mean squared error, the induction of cross entropy comes from maximum likelihood estimation in statistics.
In my personal engineering experience, cross entropy usually can get a better result over MSE, if want to get more intuition, you can read this article from James D. McCaffrey, there’s an example illustrates that for two classification results, they could have the same MSE but different accuracy, whereas, if the results have different accuracy, even a small difference on the confidence of the prediction, the cross entropy can be different.
The formula of cross entropy is:
$$H_{y’} (y) := - \sum_{i} y_{i}’ \log (y_i)$$
Where $y_{i}’$ is the ground truth label of $i$th training instance and $y_i$ is the prediction result of your classifier for the $i$th training instance. During training, minimize cross entropy $H_{y’} (y)$ by BP algorithm.
Even though I use cross entropy almost every day, but until now, I should admit that I haven’t got a good intuition on cross entropy, the reason is that there’s a roadmap to the formula of today’s cross entropy, the physical meaning behind has already vanished by the identical substitution of some basic components of cross entropy. Originally, cross entropy should be a binary based representation in information theory to describe the minimum number of bits it requires to represent a signal space. It should be binary based for log function. But today, in tensorflow, I think the base of log function is e, the natural logarithm.
Hopefully, this article: A Friendly Introduction to Cross-Entropy Loss by Rob DiPietro can give you some intuition of where does the cross entropy come from.
Cross entropy is probably the most important loss function in deep learning, you can see it almost everywhere, but the usage of cross entropy can be very different.
L1 Loss for a position regressor
L1 loss is the most intuitive loss function, the formula is:
$$ S := \sum_{i=0}^n|y_i - h(x_i)| $$
Where $S$ is the L1 loss, $y_i$ is the ground truth and $h(x_i)$ is the inference output of your model.
People think that this is almost the most naive loss function. There are good aspect of it, firstly, it indeed give you a reasonable description of the difference between the output and the ground truth of the model; secondly, it’s simple and computation effective.
If you are interested in CNN based object detection task, you can find there’s a region proposal network (RPN) in two stage object detection model (RCNN, Fast-RCNN, Faster-RCNN etc.), RPN is used to generate position candidates that may contains a target object, then use a classifier to judge which class the object belongs to.
I remember there’s someone asked the author of Fast-RCNN why use L1 loss for RPN, the author RBG replied that they just used it and it worked, by using L2 loss should have similar result if not better.
Average Binary Cross Entropy for segmentation
The formula cannot be displayed properly, I tried many ways but didn’t succeed in making it correct. So I think today I should stop here. Sad~
TBC…
Focal Loss, reduce the pain of training a dense object detector
A combination of Loss functions to reach your destination
Take a break
Recently, I’ve been in blue, a blue is a blue, but never mind, nobody cares.
Cheers~
License
The content of this blog itself is licensed under the Creative Commons Attribution 4.0 International License.
The containing source code (if applicable) and the source code used to format and display that content is licensed under the Apache License 2.0.
Copyright [2017] [yeephycho]
Licensed under the Apache License, Version 2.0 (the “License”);
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
Apache License 2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an “AS IS” BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either
express or implied. See the License for the specific language
governing permissions and limitations under the License.