Image Processing in Embedded Systems

10 minute read

Published: August 02, 2025

Draft Mode

image processing in embedded systems

what am i trying to learn in this post?:

skills as such pointer arithmetic (for buffer manipulation), memory layout intuition (stride, alignment, padding), efficient code (avoid unnecessary copies), thinking about memory saftey (manage allocations, prevent buffer overflows), edge handling and robustness, transition between high to low level.

implemententaion for practice:

crop convulotion max pooling sobel filter downsampling resize rgb to grayscale transpose box filter gaussian filter resize nearest neighbor flipping integral image

theory:

Why do we have a stride S > N This happens very often in real-world image and matrix representation. the reasons are: 1) Memory Alignment: GPUs, SIMD Instructions, and cache lines prefer that rows start at certain byte boundaries. so even if your row is 3 bytes the actual row size might be padded to 4/8. 2) Efficient row access: stride allows jumping to the start of the next row using row_start += stride, without needing to caculate lengths dynamically. 3) Sub-matrix views or crops: a library may store a larger image but let you process a region, and your vie may be a window within that buffer but rows are still padded.

pointers notes:

when you pass int* ptr to a function the pointer is passed byt value, the function can access and modify the memory it points to. this line is example for pointer arithmetic: *(ptrOut++) = *(ptrIn + j * S + i); pay attention, trying to reassign the pointer inside the function copyMatrix won’t affect the caller and can be dangourous. why? if you allocated memory inside and didnt return it to the caller it can cause memory leak. the caller doesnt know where the output is stored. you can do it like this but it is better to avoid hidden allocation and surprises, better design is to let the caller allocate memory and pass pointers into function.

uint8_t* copyMatrix(uint8_t* ptrIn, int M, int N, int S) {
    uint8_t* ptrOut = new uint8_t[M * N];
    // fill it
    return ptrOut;
}

// Usage:

uint8_t* output = copyMatrix(ptrIn, M, N, S);

remember when you pass an array to a function it decays to a pointer to its first element, we can run both and they are equivalet: copyMatrix(matIn, ptrOut, M, N, S); or copyMatrix(&matIn[0], ptrOut, M, N, S);

what to do with borders?

for example the box filter: 1) Only divide by the actual number of valid pixels 2) replicate padding (copy the borders valies). instead of skipping invalid neighbors clamp them to the nearest valid pixel. 3) Zero padding (not good for average pixes, border pixels darken heavily. ofthen used in cnn but not in image smoothing. 4) Mirror Padding, reflect the image accross the border.

Integers Division constraints:

there are operations that can be expensive or unsupported in DSP. like division, or floating point. also modulo and square root and trigonometric function and alot of if-else trees. it is usually prefreeed to use integer math, bit shifts, table lookups, fixed point math and linear approximations.

when a processor doesnt have a floating point unit we work with integers. lets say we want to represent the value 1.5 but we use only integer so we will use scaling factor of 256 1.5 * 256 = 384, then then when needed shift right by 8 bits to interpret them. we do it cuz id were doing division with integers we will get 7/2=3 so we lose precision. so convert the some DSPs/NPUs don’t have efficient or any division instructions. division by constant is typically replaced by bit shifts (if power-of-two) and multiply by revirpcal then shift (Quantization). for example, const uint16_t scale = (1 « 14) / 9; out = (sum * scale) » 14; this simulates sum/9 with fixed point math.

instead of x/d we are doing (x * reciprocal) » shift_amount, the reciprocal is : reciprocal = (1 « 14) / 9 = 1820 // fixed-point scale with 14-bit shift and then we’re doing the shift output = (sum * 1820) » 14;.

considerations will be overflow, tuning by trying different fixed point scales based on precision needs. consider rounding bias before shifting.

what would we do if the division factor is a parameter? precompute fixed point approximations for expected counts: const uint16_t fixedPointMult[10] = {0, 0, 8192, 5461, …};

stack and heap:

if we declate int matIn[MN] = {…} in main its a fixed size array and its allocated in the main so its stored on the stack, once main() ends its disappears. using int ptrOut = new int[M*N] IT IS A DYNAMIC MEMORY ALOCATION, ITS ON THE HEAP AND IT STAYS ALLOCATED UNTIL YOU DELETED[] ptrOut. Stack memory is automatically managed, heap is manual Use heap when matrix is large, its size is dynamic, you need the matrix to outlive the function, you want to buit framework and reusable code. use stack when matrix is small, local, short lived buffer and you want less manual memory managment. in my use case image operations like transpose, box filter etc, its more realistic and scalable to use ptrIn and ptrOut on heap.

syntax notes:

dont forget to declare the type vector<vector> use const for constant number for example const int rows = 3; allocate pointer using uint8_t* ptrPut = new uint8_t[] or new uint8_t[SIZE] or malloc remember to free memory properly delete[] for example delete[] ptr; we can do for (int i = 0; i < 5; ++i) { or for (int i = 0; i < 5; i++) { in a for loop, ++i does not skip the first iteration. It’s just a slightly more efficient and preferred style when used with complex types like iterators.

///

## full code for copying without the strides
motivation: stride aware memory access is critical in image processing and embedded systems where you work with DMA transfers, padded buffers or hw aligned memory. 
also you often process images in flat buffers to optimize cache and minimize memory usage. you implement custom layers or operation in C for DSP/NPU and need to control memory layout explicity.

#include <iostream>
#include <cstdint>   // For uint8_t
#include <iomanip>   // For formatted output

void copyMatrix(uint8_t* ptrIn, uint8_t* ptrOut, int M, int N, int S) {
    for (int j = 0; j < M; ++j) {
        for (int i = 0; i < N; ++i) {
            *(ptrOut++) = *(ptrIn + j * S + i);
        }
    }
}

int main() {
    const int M = 3;  // rows
    const int N = 3;  // actual data columns
    const int S = 5;  // stride: 3 data + 2 garbage per row

    // Heap-allocate input with stride
    uint8_t* ptrIn = new uint8_t[M * S];

    // Initialize input matrix: fill meaningful data and garbage
    for (int j = 0; j < M; ++j) {
        for (int i = 0; i < N; ++i)
            ptrIn[j * S + i] = j * N + i + 1;  // e.g., 1..9
        for (int i = N; i < S; ++i)
            ptrIn[j * S + i] = 255;  // garbage (e.g., 255)
    }

    // Heap-allocate output buffer (no stride, compact)
    uint8_t* ptrOut = new uint8_t[M * N];

    // Perform the copy
    copyMatrix(ptrIn, ptrOut, M, N, S);

    // Free memory
    delete[] ptrIn;
    delete[] ptrOut;

    return 0;
}

now i will do for box filter

void applyBoxFilter(uint8_t* ptrIn, uint8_t* ptrOut, int M, int N, int S, int filterSize=3){
	uint16_t sum = 0;
	int center_pixel_pos = 0;
	int neighbor_pos = 0;
	for (int i=0; i<M; i++){
		for (int j=0; j<N; j++){
			sum = 0;
			center_pixel_pos = ptrIn + i*S + j;
			for (int k_row = -1; k_row<2; k_row++){
				for (int k_col = -1; k_col<2; k_col++){
					neighbor_pos = central_pixel_pos + k*S + kcol;
					sum + = *neighbor_pos;
				}
			}
			*(ptrOut + i*S + j) = 28*(sum/255);
			}
		}
	}
}

———— from chatgpt —————————–

#include <cstdint>
#include <algorithm> // for std::clamp (optional)

void applyBoxFilter(uint8_t* ptrIn, uint8_t* ptrOut, int M, int N, int S, int filterSize = 3) {
    int half = filterSize / 2;

    for (int i = 0; i < M; ++i) {
        for (int j = 0; j < N; ++j) {
            uint16_t sum = 0;
            int count = 0;

            // Center pixel pointer
            uint8_t* center_pixel_ptr = ptrIn + i * S + j;

            // Loop over neighbors
            for (int k_row = -half; k_row <= half; ++k_row) {
                for (int k_col = -half; k_col <= half; ++k_col) {
                    int ni = i + k_row;
                    int nj = j + k_col;

                    // Bounds check
                    if (ni >= 0 && ni < M && nj >= 0 && nj < N) {
                        uint8_t* neighbor_ptr = ptrIn + ni * S + nj;
                        sum += *neighbor_ptr;
                        count++;
                    }
                }
            }

            // Option 1: true division
            // *(ptrOut + i * S + j) = static_cast<uint8_t>(sum / count);

            // Option 2: fixed-point approximation of /9 (your logic)
            // sum/255 is roughly sum >> 8
            // then *28 approximates /9
            uint8_t value = static_cast<uint8_t>(std::clamp((sum * 28) >> 8, 0, 255));
            *(ptrOut + i * S + j) = value;
        }
    }
}

RGB

#include <cstdint>

void rgbToGrayscale(
    const uint8_t* ptrRGB,  // input RGB image (flat)
    uint8_t* ptrGray,       // output grayscale buffer
    int rows,               // number of rows (M)
    int cols,               // number of columns (N)
    int strideRGB           // stride in input buffer (in bytes, e.g., 3*N if tightly packed)
) {
    for (int i = 0; i < rows; ++i) {
        for (int j = 0; j < cols; ++j) {
            const uint8_t* pixel = ptrRGB + i * strideRGB + j * 3;

            uint8_t R = pixel[0];
            uint8_t G = pixel[1];
            uint8_t B = pixel[2];

            // Use integer approximation: (77*R + 150*G + 29*B) >> 8 ≈ Gray
            uint8_t gray = (77 * R + 150 * G + 29 * B) >> 8;

            ptrGray[i * cols + j] = gray;
        }
    }
}

Share on

Twitter Facebook LinkedIn

Shani Israelov

Image Processing in Embedded Systems

Draft Mode

image processing in embedded systems

what am i trying to learn in this post?:

implemententaion for practice:

theory:

pointers notes:

what to do with borders?

Integers Division constraints:

stack and heap:

syntax notes:

now i will do for box filter

———— from chatgpt —————————–

RGB

Share on

You May Also Enjoy

Understanding Loss Functions

Understanding Loss Functions in Machine Learning

Diffusion Models Extensions

Diffusion Models Extensions

Exploring Segmentation Methods and Dealing with Noisy Labels in Segmentation

Table of Contents

Dealing with Noisy Labels in Segmentation

Diffusion Models for Virtual Try-On