Exploring Segmentation Methods and Dealing with Noisy Labels in Segmentation

13 minute read

Published:

Table of Contents

  1. Dealing with Noisy Labels in Segmentation
    • Confidence in Segmentation
    • Post-Processing
    • Leveraging a Small Clean Dataset with a Larger Noisy Dataset
    • Identifying Outliers Using Clustering
    • MixUp in Segmentation
  2. Classic Segmentation Approaches
  3. Deep Learning Approaches
  4. Choosing Loss Functions for Segmentation
  5. Conclusion

Dealing with Noisy Labels in Segmentation

Noisy labels in segmentation tasks can significantly impact model performance. This blog post explores strategies to mitigate the impact of label noise, from preprocessing techniques to advanced learning methods.


1. Confidence in Segmentation

How Confidence is Used:

  • Confidence is derived from the softmax probabilities for each pixel, representing the model’s certainty about its predictions.
  • It helps filter noise by identifying pixels with low confidence and treating them differently during training.

Applications in Segmentation:

  • Noise Filtering: Ignore or down-weight pixels with low confidence.
  • Pseudo-Labeling: Use high-confidence predictions to generate pseudo-labels for noisy or unlabeled data.
  • Uncertainty Estimation: Techniques like Monte Carlo Dropout provide pixel-wise uncertainty estimates.

2. Post-Processing

Post-processing techniques refine segmentation masks, particularly useful for noisy labels.

Steps for Post-Processing:

  1. Edge Detection:
    • Use methods like the Canny Edge Detector to identify object boundaries.
  2. Morphological Operations:
    • Apply erosion and dilation to clean noisy regions or connect disjointed segments.
  3. CRFs (Conditional Random Fields):
    • Align segmentation masks with object boundaries for smoother transitions.
  4. Contour-Based Refinement:
    • Fit contours to segmented regions to smooth out irregularities caused by noise.

3. Leveraging a Small Clean Dataset with a Larger Noisy Dataset

When a small portion of the dataset has clean labels, these strategies help extract the best from both datasets:

3.1. Pretraining on the Clean Dataset:

  • Train on Clean Data: Pretrain the model on the clean dataset to learn accurate features.
  • Fine-Tune on Noisy Data: Use robust loss functions (e.g., GCE, Symmetric Loss) or confidence-based filtering to minimize the impact of noisy labels.

3.2. Semi-Supervised Learning:

  • Pseudo-Labeling: Generate pseudo-labels for the noisy dataset using high-confidence predictions.
  • Consistency Regularization: Ensure consistent predictions for augmented versions of noisy data.
  • MixMatch: Blend supervised and unsupervised learning for improved performance.

3.3. Curriculum Learning:

  • Start with easy, clean examples.
  • Gradually introduce noisy or uncertain samples as the model becomes more robust.

4. Identifying Outliers Using Clustering

Clustering algorithms help detect noisy labels by grouping similar pixels or regions and flagging outliers.

Steps for Clustering-Based Outlier Detection:

  1. Feature Extraction:
    • Extract features from the model’s intermediate layers or raw data.
  2. Clustering:
    • Use algorithms like K-Means or DBSCAN to group pixels or regions.
  3. Flagging Outliers:
    • Identify outliers as points that deviate from cluster centers or belong to small clusters.
  4. Label Correction:
    • Use flagged regions to refine noisy segmentation masks.

5. MixUp in Segmentation

MixUp, a data augmentation technique, is gaining popularity for segmentation tasks.

How MixUp Works:

  • Combine two images and their masks:
    • ( x_{\text{mix}} = \lambda x_1 + (1 - \lambda) x_2 )
    • ( y_{\text{mix}} = \lambda y_1 + (1 - \lambda) y_2 )
  • ( \lambda ): Random mixing coefficient sampled from ([0, 1]).

Benefits:

  • Improves generalization by creating new training samples.
  • Reduces overfitting in small datasets.
  • Increases robustness to label noise.

Challenges:

  • Blending masks introduces ambiguity at boundaries, requiring careful adjustments.

Summary

  1. Confidence in Segmentation: Use softmax probabilities for noise filtering and pseudo-labeling.
  2. Post-Processing: Refine masks with edge detection, morphological operations, or CRFs.
  3. Small Clean Dataset with Noisy Dataset: Pretrain on clean data, use semi-supervised learning, or adopt curriculum learning.
  4. Clustering for Outlier Detection: Use clustering algorithms to identify and correct noisy labels.
  5. MixUp in Segmentation: A growing technique to mitigate noise, with challenges in handling blended masks.

By combining these approaches, you can significantly improve segmentation results even in the presence of noisy labels.

Segmentation Methods: Classic and Deep Learning Approaches

Segmentation is a crucial task in computer vision, with applications ranging from medical imaging to autonomous driving. In this post, I explore two classic approaches to segmentation—clustering and graph-based methods—and two popular deep learning models, Mask R-CNN and U-Net. Additionally, I provide insights into choosing loss functions for segmentation tasks.


Classic Segmentation Approaches

1. Clustering-Based Segmentation

Segmentation as Clustering

  • Pro:
    • Generally simple.
    • Can handle most data distributions with sufficient effort.
  • Con:
    • Hard to capture global structure.
    • Performance is limited by simplicity.

Mean-Shift Clustering for Segmentation

  • How It Works:
    • Cluster: Groups data points in the attraction basin of a mode.
    • Attraction Basin: The region where all trajectories lead to the same mode.
    • Key steps:
      1. Extract features (color, gradients, texture).
      2. Initialize windows at individual pixel locations.
      3. Perform mean shift for each window until convergence.
      4. Merge windows ending near the same mode.
    • more specifically:
    • given distrinution of N pixels in feature spae find modes (clusters) of distribution.
    • set mi = fi as initial mean for each pixel.
    • repeat the foloowing for each mean:
      • place windown of size w around mi
      • computer centroid m with the window , set mi = m
      • stop if shift in mean mi is less than a threshold epsilon mi is the mode.
    • label all pixels that hame same mode as belonging to the same cluster.

In statistics, the mode is the value that appears most often in a set of data values.[1] If X is a discrete random variable, the mode is the value x at which the probability mass function takes its maximum value (i.e., x=argmaxxi P(X = xi)). In other words, it is the value that is most likely to be sampled. Like the statistical mean and median, the mode is a way of expressing, in a (usually) single number, important information about a random variable or a population.

i used this video to understand better: https://www.youtube.com/watch?v=PCNz_zttmtA

if we map the pixels to feature space we can get the image pixel feature distribusion, then we can add axis that tell us how dense the distribution is. each hill represents a cluster. the peak\mode of hill represents center of cluster. each pixel climb the steepest hill withim is neighborhood. pixel assigner to the hill that it climbs. it will “end” at the mode of the hill. we end up with clusters. the algorithm go ike this: (assuming the feature space is 2D) for each sample pixel we want to find the hill that that pixel belongs to (the cluster). for each pixel in the feature space, we will take a window, assume a circele with radius w around that pixel. and we look at the other pixels in that window and computer their centroid (by calculating the mean or weighted mean). then you’re moving the circle window to that calculated mean. we can imaigine the vector connecting from the initial pixel to the calculated mean. we repeat this process. we’re going to end up climbing the steepest in some point you’re going to stop moving, the calculated mean will be the same initial pixel. that is the pick or mode of that hill. the initial pixel is going to be clustered to that mode. pixels that assigned to the same mode are belong to the same cluster.

drawing

  • Pro:
    • No assumption about the number of clusters. ( in contrust to k-means the num of segments also sensitive to initialization).
    • Handles unusual distributions. robust to outliers
    • Simple to implement.
  • Con:
    • Choice of window size affects performance.
    • Computationally expensive for large datasets.

2. Graph-Based Segmentation

Images as Graphs

  • Representation:
    • Node (vertex) for each pixel.
    • Edges between pairs of pixels, with an affinity weight (w_{pq}).
      • (w_{pq}): Measures similarity (e.g., inversely proportional to differences in color or position).

Types of Graph Construction

  1. Fully Connected:
    • Captures all pairwise similarities.
    • Computationally infeasible for large images.
  2. Neighboring Pixels:
    • Very fast but captures only local interactions.
  3. Local Neighborhood:
    • Balances speed and interaction range.

Graph Cut for Segmentation

  • Breaking the Graph:
    • Delete edges between segments.
    • Use low similarity (low weight) edges to separate regions.
  • Cost of a Cut:
    • Sum of weights of edges being cut.
  • Goal: Find the minimum cut that divides the graph into meaningful segments.
  • Drawbacks:
    • Biased toward cutting small, isolated components.
    • Weight of the cut is proportional to the number of edges.

Deep Learning Approaches

1. Mask R-CNN

Overview:

Mask R-CNN extends Faster R-CNN by adding a branch to predict binary masks for objects alongside class labels and bounding boxes. It outputs segmentation masks in a pixel-to-pixel manner.

built upon Faster RCNN in parallel to predicting the class and box offset, Mask R-CNN also outputs a binary mask for each RoI. The mask branch is a small FCN applied to each RoI, predicting a segmentation mask in a pixel-topixel manner

we rely on the dedicated classification branch to predict the class label used to select the output mask.extracting the spatial structure of masks can be addressed naturally by the pixel-to-pixel correspondence provided by convolutions.we predict an m × m mask from each RoI using an FCN [30]. This allows each layer in the mask branch to maintain the explicit m × m object spatial layout without collapsing it into a vector representation that lacks spatial dimensions

This pixel-to-pixel behavior requires our RoI features, which themselves are small feature maps, to be well aligned to faithfully preserve the explicit per-pixel spatial correspondence. This motivated us to develop the RoIAlign layer.

during training, we define a multi-task loss on each sampled RoI as L = Lcls + Lbox + Lmask. The mask branch has a Km2 dimensional output for each RoI, which encodes K binary masks of resolution m × m, one for each of the K classes.

To this we apply a per-pixel sigmoid, and define Lmask as the average binary cross-entropy loss.

For an RoI associated with ground-truth class k, Lmask is only defined on the k-th mask (other mask outputs do not contribute to the loss).

Our definition of Lmask allows the network to generate masks for every class without competition among classes;

drawing

in the paper they also mentioned that Our framework can easily be extended to human pose estimation. We model a keypoint’s location as a one-hot mask, and adopt Mask R-CNN to predict K masks, one for each of K keypoint types (e.g., left shoulder, right elbow). This task helps demonstrate the flexibility of Mask R-CNN.

Key Features:

  • RoIAlign Layer:
    • Ensures spatial alignment of features for precise per-pixel predictions.
  • Multi-Task Loss:
    • (L = L_{cls} + L_{box} + L_{mask}), where (L_{mask}) is a binary cross-entropy loss applied to the predicted mask.
  • Flexibility:
    • Easily extends to tasks like human pose estimation by predicting keypoint masks.

Mask R-CNN Architecture


2. U-Net

Overview:

U-Net is a convolutional neural network architecture designed for biomedical image segmentation. It features a symmetric encoder-decoder structure with skip connections.

Key Features:

  • Skip Connections:
    • Preserve spatial information by connecting encoder features directly to corresponding decoder layers.
  • Multi-Scale Context:
    • Combines local and global context for accurate segmentation.
  • Wide Applications:
    • Works well for medical imaging, remote sensing, and other pixel-level tasks.

Choosing Loss Functions for Segmentation

1. Binary Cross-Entropy (BCE):

  • Usage:
    • Measures differences between actual and predicted masks.
    • Assumes balanced classes, struggles with heavy class imbalances.
  • Limitations:
    • Poor performance on rare or small objects.

2. Dice Coefficient and Dice Loss:

dice coefficient is calculated from the precision and recall of a prediction. then is scores the overlap between predicted segmentation and the ground truth, Dice coefficient = F1 score: a harmonic mean of precision and recall. In other words.

good segmentation will have big area of overlap, so 2100/(100+100) results in 1, bad segmentation will have small area of overlap, so 250/(100+100) results in 0.5.

  • Usage:
    • Calculates overlap between predicted and true masks.
    • Adapted to loss as: [ L_{\text{dice}} = 1 - \frac{2 \cdot |P \cap T|}{|P| + |T|} ] where (P) is the predicted mask and (T) is the ground truth mask.
  • Advantages:
    • Handles class imbalances better than BCE.\

Binary Cross-Entropy Cross-entropy is used to measure the difference between two probability distributions. It is used as a similarity metric to tell how close one distribution of random events are to another, and is used for both classification (in the more general sense) as well as segmentation.

The binary cross-entropy (BCE) loss therefore attempts to measure the differences of information content between the actual and predicted image masks. It is more generally based on the Bernoulli distribution, and works best with equal data-distribution amongst classes. In other terms, image masks with very heavy class imbalance may (such as in finding very small, rare tumors from X-ray images) may not be adequately evaluated by BCE.

This is due to the fact that the BCE treats both positive (1) and negative (0) samples in the image mask equally. Since there may be an unequal distribution of pixels that represent a given object (say, a car from the first image above) and the rest of the image, the BCE loss may not effectively represent the performance of the deep-learning model.

Dice Coefficient This is a widely-used loss to calculate the similarity between images and is similar to the Intersection-over-Union heuristic. The Dice Coefficient has as such, been adapted to a loss function as the Dice Loss


Conclusion

Segmentation can be approached through classical methods like clustering and graph cuts or through advanced deep learning models like Mask R-CNN and U-Net. Each method has its advantages, with the choice depending on the task, dataset, and computational constraints. Loss functions such as Dice Loss or Binary Cross-Entropy further refine performance, ensuring robust and accurate segmentation results.