Understanding Loss Functions

5 minute read

Published:

Understanding Loss Functions in Machine Learning

Loss functions play a critical role in machine learning, measuring the discrepancy between the model’s predictions and the desired outcomes. This document explores the components, effects, and trade-offs of loss functions, including task-specific, regularization, and auxiliary terms.


1. Understanding the Loss Function

We are going to measure our unhappiness with outcomes such as this one with a loss function (or sometimes also referred to as the cost function or the objective). Intuitively, the loss will be high if we’re doing a poor job of classifying the training data, and it will be low if we’re doing well.

The loss function measures how far the model’s predictions are from the desired outcomes. During training, the optimizer minimizes this loss by adjusting the model’s weights.

A standard loss function, ( L ), may consist of multiple terms:

[ L = L_{\text{task}} + \lambda_1 L_{\text{reg}} + \lambda_2 L_{\text{aux}} ]

  • ( L_{\text{task}} ): The primary task-specific loss (e.g., Cross-Entropy for classification, MSE for regression).
  • ( L_{\text{reg}} ): A regularization term to control model complexity (e.g., weight decay or sparsity).
  • ( L_{\text{aux}} ): Auxiliary terms for specific behaviors or constraints (e.g., fairness, disentanglement).

Classification

binary and multi class cross entropy:

first of all lets talk about entropy, entopy measures the randomness or uncertenty of an event. when H is large, it means the uncertainy is high, we can’t expect what will happen, if H is small, it means we have pretty of certainty about occurance of event.

Information theory view. The cross-entropy between a “true” distribution p and an estimated distribution q is defined as: H(p,q)=−∑p(x)logq(x) p interpretation is the distribution where all probability mass is on the correct class, ie p = [0, 0 … 1, …0] when its in the right place.

this is also equivalent to minimizing the KL divergence between the two distributions (a measure of distance). In other words, the cross-entropy objective wants the predicted distribution to have all of its mass on the correct answer.

In the probabilistic interpretation, we are therefore minimizing the negative log likelihood of the correct class, which can be interpreted as performing Maximum Likelihood Estimation (MLE). A nice feature of this view is that we can now also interpret the regularization term R(W) in the full loss function as coming from a Gaussian prior over the weight matrix W , where instead of MLE we are performing the Maximum a posteriori (MAP) estimation.

lets be reminded of MLE and MAP. Maximum likelihood estimation choose value that maximizes the probability of observed data

P(D|theta) likelihood P(theta) prior P(theta|D) posterior

theta_MLE = arg max P(Dtheta)

maximum a posteriori estimation choose value that is most probable give observed data and prior belief.

theta_MAP = arg max P(thetaD) = arg max P(Dtheta)P(theta)
remember we dont have a model for P(thetaD) or P(theta) only for P(Dtheta)

theta is the training data, D is the labels?

To be precise, the SVM classifier uses the hinge loss, or also sometimes called the max-margin loss. The Softmax classifier uses the cross-entropy loss.

2. Effects of Adding a Term

2.1. Regularization

  • Purpose: Prevent overfitting by constraining the model’s weights or outputs.
  • Common Terms:
    1. L1 Regularization: Promotes sparsity in weights by penalizing their absolute values: [ L_{\text{reg}} = \alpha \sum |w| ]
    2. L2 Regularization (Weight Decay): Penalizes large weights, ensuring smoother predictions: [ L_{\text{reg}} = \beta \sum w^2 ]
  • Effect on the Network:
    • L1 encourages sparse weights (e.g., feature selection).
    • L2 reduces large weight magnitudes, making the model less sensitive to noise.

2.2. Auxiliary Tasks

  • Purpose: Improve the model’s learning by introducing auxiliary objectives that provide additional supervision or encourage specific behavior.
  • Example Terms:
    • Multi-task learning: [ L = L_{\text{main}} + \lambda L_{\text{aux}} ] In segmentation, an auxiliary loss can help learn edges explicitly.
  • Effect on the Network:
    • Enhances learning of shared representations.
    • Stabilizes training, especially in cases of data scarcity or imbalance.

2.3. Adversarial Loss

  • Purpose: Encourage outputs to align with specific distributions.
  • Example:
    • In GANs: [ L = L_{\text{gen}} + \lambda L_{\text{adv}} ] where ( L_{\text{adv}} ) enforces realism in generated outputs.
  • Effect on the Network:
    • Drives the generator to produce outputs indistinguishable from real data.

2.4. Fairness or Constraint Terms

  • Purpose: Impose constraints for fairness, interpretability, or other properties.
  • Example Terms:
    1. A fairness loss to ensure equal accuracy across demographic groups: [ L_{\text{fair}} = \text{disparity metric} ]
    2. A disentanglement loss to separate latent factors.
  • Effect on the Network:
    • Alters the optimization trajectory to prioritize fairness or disentanglement, potentially sacrificing some predictive performance.

2.5. Domain-Specific Losses

  • Purpose: Solve specific challenges in tasks like generative modeling, segmentation, or video processing.
  • Example Terms:
    1. Perceptual Loss in style transfer or super-resolution: [ L_{\text{perceptual}} = ||\phi(\hat{y}) - \phi(y)||^2 ] where ( \phi ) is a pre-trained feature extractor (e.g., VGG).
    2. Temporal Consistency Loss in video tasks: [ L_{\text{temp}} = ||\hat{y}_{t+1} - \hat{y}_t|| ]
  • Effect on the Network:
    • Improves task-specific performance (e.g., better texture matching, smoother videos).

3. Trade-Offs When Adding Terms

3.1. Balancing Terms

  • The relative weight of the added term (e.g., ( \lambda )) controls its importance.
  • Setting ( \lambda ) too high can overshadow the primary task loss, leading to suboptimal performance.

3.2. Increased Complexity

  • Introducing additional terms may require:
    • More computational resources.
    • Careful tuning of hyperparameters.
    • Additional labeled data for auxiliary tasks.

3.3. Risk of Conflicting Objectives

  • If the added term conflicts with the primary task (e.g., over-regularization), it can hinder learning.

This guide outlines the key aspects of loss functions in machine learning, emphasizing how additional terms can shape the model’s performance. Understanding these mechanisms enables better task-specific design and improved results.

refrences: https://cs231n.github.io/linear-classify/