Updated March 20, 2023

Overview of Loss Functions in Machine Learning

In Machine learning, the loss function is determined as the difference between the actual output and the predicted output from the model for the single training example while the average of the loss function for all the training examples is termed as the cost function. This computed difference from the loss functions( such as Regression Loss, Binary Classification, and Multiclass Classification loss function) is termed the error value; this error value is directly proportional to the actual and predicted value.

How does Loss Functions Work?

The word ‘Loss’ states the penalty for failing to achieve the expected output. If the deviation in the predicted value than the expected value by our model is large, then the loss function gives the higher number as output, and if the deviation is small & much closer to the expected value, it outputs a smaller number.

Here’s an example of when we are trying to predict house sales price in metro cities.

Predicted Sales Price (In lakh)	Actual Sales Price(In lakh)	Deviation (Loss)
Bangalore: 45		0 (All predictions are correct)
Pune: 35
Chennai: 40
Bangalore: 40	Bangalore: 45	5 lakh for Bangalore, 2 lakh for Chennai
Pune: 35	Pune: 35
Chennai: 38	Chennai: 40
Bangalore: 43		2 lakh for Bangalore, 5 lakh for, Pune2 lakh for Chennai,
Pune: 30
Chennai: 45

It is important to note that, amount of deviation doesn’t matter; the thing which matters here is whether the value predicted by our model is right or wrong. Loss functions are different based on your problem statement to which machine learning is being applied. The cost function is another term used interchangeably for the loss function, but it holds a slightly different meaning. A loss function is for a single training example, while a cost function is an average loss over the complete train dataset.

Types of Loss Functions in Machine Learning

Below are the different types of the loss function in machine learning which are as follows:

1. Regression loss functions

Linear regression is a fundamental concept of this function. Regression loss functions establish a linear relationship between a dependent variable (Y) and an independent variable (X); hence we try to fit the best line in space on these variables.

Y = X0 + X1 + X2 + X3 + X4….+ Xn

X = Independent variables
Y = Dependent variable

Mean Squared Error Loss

MSE(L2 error) measures the average squared difference between the actual and predicted values by the model. The output is a single number associated with a set of values. Our aim is to reduce MSE to improve the accuracy of the model.

Consider the linear equation, y = mx + c, we can derive MSE as:

MSE=1/N ∑i=1 to n (y(i)−(mx(i)+b))2

Here, N is the total number of data points, 1/N ∑i=1 to n is the mean value, and y(i) is the actual value and mx(i)+b its predicted value.

Mean Squared Logarithmic Error Loss (MSLE)

MSLE measures the ratio between actual and predicted value. It introduces an asymmetry in the error curve. MSLE only cares about the percentual difference between actual and predicted values. It can be a good choice as a loss function when we want to predict house sales prices, bakery sales prices, and the data is continuous.

Here, the loss can be calculated as the mean of observed data of the squared differences between the log-transformed actual and predicted values, which can be given as:

L=1nn∑i=1(log(y(i)+1)−log(^y(i)+1))2

Mean Absolute Error (MAE)

MAE calculates the sum of absolute differences between actual and predicted variables. That means it measures the average magnitude of errors in a set of predicted values. Using the mean square error is easier to solve, but using the absolute error is more robust to outliers. Outliers are those values, which deviate extremely from other observed data points.

MAE can be calculated as:

L=1nn∑i=1∣∣y(i)−^y(i)∣∣

2. Binary Classification Loss Functions

These loss functions are made to measure the performances of the classification model. In this, data points are assigned one of the labels, i.e. either 0 or 1. Further, they can be classified as:

Binary Cross-Entropy

It’s a default loss function for binary classification problems. Cross-entropy loss calculates the performance of a classification model, which gives an output of a probability value between 0 and 1. Cross-entropy loss increases as the predicted probability value deviate from the actual label.

Hinge loss

Hinge loss can be used as an alternative to cross-entropy, which was initially developed to use with a support vector machine algorithm. Hinge loss works best with the classification problem because target values are in the set of {-1,1}. It allows to assign more error when there is a difference in sign between actual and predicted values. Hence resulting in better performance than cross-entropy.

Squared Hinge loss

An extension of hinge loss, which simply calculates the square of the hinge loss score. It reduces the error function and makes it numerically easier to work with. It finds the classification boundary that specifies the maximum margin between the data points of various classes. Squared hinge loss fits perfect for YES OR NO kind of decision problems, where probability deviation is not the concern.

3. Multi-class Classification Loss Functions

Multi-class classification is the predictive models in which the data points are assigned to more than two classes. Each class is assigned a unique value from 0 to (Number_of_classes – 1). It is highly recommended for image or text classification problems, where a single paper can have multiple topics.

Multi-class Cross-Entropy

In this case, the target values are in the set of 0 to n i.e {0,1,2,3…n}. It calculates a score that takes an average difference between actual and predicted probability values, and the score is minimized to reach the best possible accuracy. Multi-class cross-entropy is the default loss function for text classification problems.

Sparse Multi-class Cross-Entropy

One hot encoding process makes multi-class cross-entropy difficult to handle a large number of data points. Sparse cross-entropy solves this problem by performing the calculation of error without using one-hot encoding.

Kullback Leibler Divergence Loss

KL divergence loss calculates the divergence between probability distribution and baseline distribution and finds out how much information is lost in terms of bits. The output is a non-negative value that specifies how close two probability distributions are. To describe KL divergence in terms of probabilistic view, the likelihood ratio is used.

In this article, initially, we understood how loss functions work and then we went on to explore a comprehensive list of loss functions with used case examples. However, understanding it practically is more beneficial, so try to read more and implement it. It will clarify your doubts thoroughly.

Quiz Result
Total Questions	Correct Answers	Wrong Answers	Percentage