Convolutional neural network

A convolutional neural network (CNN) is a type of neural network that uses “convolution” to detect features in an image

images are represented as a 2dimensional array of pixels

pixels are represented as an array of “channels”, with each channel often corresponding to R, G, or B (red, green, blue)

convolution is the process of sliding (translating) a “filter” over the image and performing dot multiplication between the filter and each portion of the image that it covers in order to derive a new version of the image
 the “filter” is a smaller 2dimensional array with “weights”, meaning that it contains higher values in certain sections that correspond to various features
 the “weights” will ensure that dotmultiplication with the original image will accentuate the features of that image that match the shape of the weights in the filter
 the better the weights fit the image, the higher the values of the resulting array will be
 “stride” is the rate at which the filter moves across the image
 a stride of 2 means that the filter skips a column/row of pixels each time it moves across/down the image
 a larger stride will result in a smaller resulting image, since the dot product of each convolution is what composes the new array
 to calculate the resulting image dimensions
result = floor((input_image_dimensions  filter_dimensions) / stride) + 1
 to calculate the resulting image dimensions
 a problem:
 the left and right edges of the arrays will get “less attention” from the convolution because they will be multiplied fewer times
 to solve this, pad the left and right edges of the image with zeroes
 padding which results in an “output” array of the same dimensions as the original image is called “same padding”
 the left and right edges of the arrays will get “less attention” from the convolution because they will be multiplied fewer times
 the filter should have the same “depth” as the input image, meaning that the filter should have “pixels” with the same number of channels as the original image
 the “output array” of a convolution layer is known as an “activation map”
 the “filter” is a smaller 2dimensional array with “weights”, meaning that it contains higher values in certain sections that correspond to various features

in addition to the convolution layer, there is another type of layer called a “pooling layer”
 the purpose of the pooling layer is to reduce the size of the image, effectively reducing the number of trainable parameters
 a common form of pooling layer is “max pooling”, which simply takes the maximum value from squares of a certain size within the original array and creates a new 2D array with those maximum values
 the result is a lower resolution form of the same array
e.g. original:  40 3  1 12   21 9  14 0        9 71  20 5   3 2  10 9  stride = 2, pooling size = 2 result:  40  14        71  20 
 the result is a lower resolution form of the same array

CNNs are composed of various convolution and pooling layers
 We pass an input image to the first convolution layer. Output is an activation map. Filters extract relevant features which are accentuated in the activation map.
 Each filter should have a different feature to aid correct class prediction. If size needs to be retained, use same padding. Otherwise, valid padding helps reduce the number of parameters.
2a. Pooling helps with this too.  Several convolution and pooling layers help to make prediction more precise. As we go deeper, more specific features are extracted. Shallow, more generic.
 Output layer is “fully connected”, meaning that the layer is flattened into an output that represents the prediction classes.
 Output layers are compared and an error (loss) gradient (curve) is generated.
 Error gradient is then backpropagated across the layers in order to recalculate the weights and bias values accordingly. This effectively “shows the network how bad it did” so that it can do better next time.
 One forward and one backward pass completes one training cycle.
 Once the weights are tuned, the model can be used for more accurate prediction.