Here is the source that explains the implementation of the convolution operation using a large matrix multiplication:
https://sgugger.github.io/convolution-in-depth.html

Convolution in Depth

So the intuition behind this approach is to build a new matrix in which each row is each receptive field flattened. So int the end if we multiply that matrix with flattened kernel(in the same way) we get a flattened output matrix so we need just to reshape. The image shows an example of such matrix. Let's go over a code sample taken from the source(with a bit change for simplicity) that creates this matrix for a case with stride 1 and no padding
```python
import numpy as np
k1,k2 = k.shape # Kernel size(kernel k)
mb, ch, n1, n2 = x.shape # batches, channels, height and width (input x)
```
Over here we just define dimension variables
```python
start_idx = np.array([j + (n2)*i for i in range(0,n1-k1+1) for j in range(0,n2-k2+1) ])
# Output: numpy.array([1, ..., 33])
```
This is the list of all indexes that can have 0,0 index in a receptive field
```python
grid = np.array([j + (n2)*i + (n1) * (n2) * k for k in range(0,ch) for i in range(k1) for j in range(k2)])
Output: numpy.array([0, 1,2, 7,8,9,14,15, 16])
```
This means a list of indexes of a any receptive field relative to the receptive field if we would flatten it

```python
to_take = start_idx[:,None] + grid[None,:]
```
so over here we are adding each starting index to each index in the flattened receptive field related to itself and we would get the matrix where each row is the list of indexes of the receptive field(now relative to x) so now we can use numpy function take that takes an numpy array, and some list of indexes and changes those indexes for the value of this index in that numpy array

```python
batch = np.array(range(0,mb)) * ch * (n1) * (n2)
```
list of indexces of batches

```python
x.take(batch[:,None,None] + to_take[None,:,:])
```
now we using the take function so it returns the matrix we need for the convolution

Matrix Multiplication Implementation for the forward prop

Mathieu, M., Henaff, M., & LeCun, Y. (2013). Fast training of convolutional networks through ffts. arXiv preprint arXiv:1312.5851.

Fast Training of Convolutional Networks through FFTs

Convolution FFT implementation is referenced from Siraj Raval video:
https://www.youtube.com/watch?v=FTr3n7uBIuE

Convolutional Neural Networks - The Math of Intelligence (Week 4)

The Convolution theorem states that convolution of two signals is the pointwise multiplication of their Fourier transforms. So if we take an inverse of the product of two Fourier transforms we can get the convolution of two signals. So here is how it can be done in python:
```python
fft_result = np.fft.fft2(image, target_dim) * np.fft.fft2(feature, target_dim)
target = np.fft.ifft2(fft_result).real
```
And this way works a lot faster than usual implementation of the convolution

Implementation of the FFT on the convolution

This node before explained how it works, and now we need to understand how to write the operation mathematically and more importantly code it efficiently. So suppose we have the input I with the shape H x H x C, where B - the batch, H - is the height and width of the image( just taking square image for the example), C - the amount of channels. One kernel k has a size of K x K x C( also taking square kernel for the sake of the example). b is the bias for that kernel

$ ( I * K)_{i,j} = \sum_{m=0}^{k-1} \sum_{n=0}^{k-1} \sum_{c=0}^{C} K_{m,n,c} I_{i + m, j + n, c} + b$
-------------------------------------------------
$ ( I * K)$ - is the output matrix for the current kernel with size of H-K+1, H-K+1. This is the variation of the Convolution with no padding and stride one. Most of the libraries call it convolution but it is actually an operation called cross-correlation. In terms of coding the formula for convolution, there are several ways to implement it. One of them might be just straight brute force with nested loops, turning everything int a big matrix multiplication, and even using Fourier transforms. While brute force solution is straight forward, other two ways are not that easy to visualize and understand


University of California, Berkeley

The cross-correlation operation involves mapping a filter matrix (or kernel) to every possible position on an input matrix. At each position, the corresponding elements of the input and the kernel are multiplied together, and the products are summed to calculate a single entry in the resulting output matrix. While a strict mathematical convolution requires flipping the two-dimensional kernel both horizontally and vertically prior to this process, deep learning implementations typically omit this mirroring step and compute the cross-correlation directly. Despite this technical distinction, it is standard terminology in deep learning literature to refer to the cross-correlation operation simply as a convolution.

The Cross-Correlation (Convolution) Operation

A very good source on how Convolution Neural Networks work mathematically:
https://www.jefkine.com/general/2016/09/05/backpropagation-in-convolutional-neural-networks/


Backpropagation In Convolutional Neural Networks

Mathematical Implementation of Forward Propagation

Here is the visualization of the convolution operation were you can play with different hyperparameters of CNN:
https://ezyang.github.io/convolution-visualizer/index.html

Convolution Visualizer

On the left, we have a 6 x 6 grayscale image, so we represent it using a 6 x 6 x 1 matrix because there are no separate RGB channels. To detect vertical edges, we construct a 3 x 3 matrix, called filter or kernel, shown in the middle. We convolve the 6 x 6 matrix with the 3 x 3 filter. The output of this convolution operator will be a 4 x 4 matrix.

To compute the upper left element of this 4 x 4 matrix, we take the 3 x 3 filter and paste it on top of the 3 x 3 region of the original input image and sum up all the 9 the element-wise products. So, the result would be 3 x 1 + 1 x 1 + 2 x 1 + 0 x 0 + 5 x 0 + 7 x 0 + 1 x (-1) + 8 x (-1) + 2 x (-1) = -5. Next, to compute the second element, we shift the pasted filter matrix one step to the right on the 6 x 6 image, do the same element-wise products, and then add them up, which will give us -4, and so on.

To compute the next rows, we shift the pasted filter matrix one step down on the 6 x 6 image. Then, we repeat the same element-wise products, and additions for the second row, and so on.

Calculating Cross-Correlation (Convolution) Operation Example

In a standard cross-correlation operation, the convolution window slides over the input tensor one element at a time. However, to increase computational efficiency or to downsample the representation, we can traverse multiple elements per slide, skipping intermediate locations. The number of rows and columns traversed per slide is called the stride. Strided convolution is particularly useful when the convolution kernel is large, as it efficiently captures a broad area of the underlying image.

Strided Convolution

A 2D convolution layer performs an entry per entry multiplication between the input and the filter, where the input and the filter are 2D matrices. 

In a 3D convolution layer, the same operation is performed on 3D inputs and the filters, such that:
- We multiply corresponding entries in the input and filter cubes and add up the results to determine the output entry.
- In addition to shifting (striding) the filter on two dimensions of the input matrix to perform the convolution operation, we also shift it in the third dimension.

For example, if we have a 4 x 4 x 4 input and a 3 x 3 x 3 filter, because we can shift the filter for only one entry per dimension, the output will be 2 x 2 x 2.

3D Convolution Layers

For example, if we have a 3 x 3 x 3 RGB filter, to perform the convolution operation, we start with placing the filter in the upper left most position of the image. We can consider this 3 x 3 x 3 filter as a cube with 27 parameters, where the first dimension is called "width," the second is called height, and the third indicates the number of channels. We multiply each of these 27 numbers with the corresponding numbers from the red, green, and blue channels of the image and sum up the results to calculate the corresponding output number. Then to compute the next output, we take this cube and slide it over by one, do the 27 multiplications, and add up the results to calculate the next output number, and so on. This way, convolving a 6 x 6 x 3 image with a 3 x 3 x 3 filter will give us a 4 x 4 output.

Convolutions Over Volumes

A convolutional layer performs a cross-correlation operation between its input and a kernel, and subsequently adds a scalar bias to the result to generate an output. During model training, the convolutional layer learns two key parameters: the kernel (weights) and the scalar bias. Prior to training, these kernel weights are generally initialized with random values, analogous to the initialization process in fully connected layers.

Convolutional Layer

To execute a two-dimensional cross-correlation, the convolution window starts at the upper-left of the input tensor and systematically translates across it, moving rightwards and downwards. At each valid position, the kernel tensor undergoes an elementwise multiplication with the currently overlaid input subtensor. The sum of these individual products produces a single scalar, which becomes the corresponding element in the resulting output tensor.

Two-Dimensional Convolution Operation Procedure

When applying a two-dimensional cross-correlation without padding, the resulting output tensor is dimensionally smaller than the input. Since the kernel must fit completely within the boundaries of the input tensor, an input of size $$n_\textrm{h} \times n_\textrm{w}$$ convolved with a kernel of size $$k_\textrm{h} \times k_\textrm{w}$$ will yield an output shape calculated by the formula: $$(n_\textrm{h}-k_\textrm{h}+1) \times (n_\textrm{w}-k_\textrm{w}+1)$$.

Computation of Convolution Output Size

A convolution kernel can act as a finite difference operator to locate pixel changes for edge detection. For example, a $$1 \times 2$$ kernel like $$[1, -1]$$ computes the difference between horizontally adjacent pixels, $$x_{i,j} - x_{(i+1),j}$$. This cross-correlation operation serves as a discrete approximation of the first derivative in the horizontal direction. Mathematically, for an image function $$f(i,j)$$, its derivative is $$-\partial_i f(i,j) = \lim_{\epsilon \to 0} \frac{f(i,j) - f(i+\epsilon,j)}{\epsilon}$$. By applying this kernel, the output is zero where adjacent pixels are identical and non-zero at boundaries, effectively detecting edges.

Convolution Kernel as a Finite Difference Operator

As an example of padding in a convolutional layer, consider a $$3 	imes 3$$ input tensor padded with zero values to increase its size to $$5 	imes 5$$. If we apply a two-dimensional cross-correlation using a $$2 	imes 2$$ kernel on this padded input, the resulting output will be a $$4 	imes 4$$ matrix. The added zero pixels do not change the sum of the products when the convolution window covers them.

Example of Two-Dimensional Cross-Correlation with Padding

In convolutional operations, there are standard baseline configurations that are assumed unless specified otherwise. By default, the padding is set to $$0$$ (meaning no extra border pixels are added) and the stride is set to $$1$$ (meaning the convolution window slides exactly one element at a time).

Default Padding and Stride Values

When the input data contains multiple channels, denoted as $$c_	extrm{i}$$, the convolution kernel must have the same number of input channels to perform cross-correlation. If the kernel's two-dimensional spatial window shape is $$k_	extrm{h} 	imes k_	extrm{w}$$, a kernel tensor of shape $$k_	extrm{h} 	imes k_	extrm{w}$$ is required for every input channel. Concatenating these $$c_	extrm{i}$$ tensors yields a convolution kernel with an overall shape of $$c_	extrm{i} 	imes k_	extrm{h} 	imes k_	extrm{w}$$.

Multi-Channel Convolution Kernel Structure

In contrast to a regular convolution that reduces input elements via a kernel, the transposed convolution broadcasts input elements via the kernel, producing an output that is generally larger than the input. For a basic transposed convolution with a stride of $$1$$ and no padding, an $$n_h 	imes n_w$$ input tensor and a $$k_h 	imes k_w$$ kernel are used to produce an $$(n_h + k_h - 1) 	imes (n_w + k_w - 1)$$ output tensor. For each element in the input tensor, it is multiplied by the entire kernel to produce a $$k_h 	imes k_w$$ tensor. This resulting tensor replaces a corresponding portion of an intermediate tensor initialized with zeros, based on the element's original position. The final output is formed by summing all these intermediate results. For example, if both the input tensor and the kernel are $$2 	imes 2$$ matrices with values $$[[0, 1], [2, 3]]$$, the basic transposed convolution yields a $$3 	imes 3$$ output tensor with values $$[[0, 0, 1], [0, 4, 6], [4, 12, 9]]$$.

Learn Before

$( I * K)_{i,j} = \sum_{m=0}^{k-1} \sum_{n=0}^{k-1} \sum_{c=0}^{C} K_{m,n,c} I_{i + m, j + n, c} + b$

Related

Learn After

Learn Before

(I∗K)i,j=∑m=0k−1∑n=0k−1∑c=0CKm,n,cIi+m,j+n,c+b ( I * K)_{i,j} = \sum_{m=0}^{k-1} \sum_{n=0}^{k-1} \sum_{c=0}^{C} K_{m,n,c} I_{i + m, j + n, c} + b(I∗K)i,j​=∑m=0k−1​∑n=0k−1​∑c=0C​Km,n,c​Ii+m,j+n,c​+b

Related

Learn After

$( I * K)_{i,j} = \sum_{m=0}^{k-1} \sum_{n=0}^{k-1} \sum_{c=0}^{C} K_{m,n,c} I_{i + m, j + n, c} + b$