Back to notes

Convolutional neural network

Keywords: machine learning, linear algebra, calculus, neural networks, object detection, computer vision
  • A convolutional neural network (CNN) is a type of neural network that uses "convolution" to detect features in an image

  • images are represented as a 2-dimensional array of pixels

  • pixels are represented as an array of "channels", with each channel often corresponding to R, G, or B (red, green, blue)

  • convolution is the process of sliding (translating) a "filter" over the image and performing dot multiplication between the filter and each portion of the image that it covers in order to derive a new version of the image

    • the "filter" is a smaller 2-dimensional array with "weights", meaning that it contains higher values in certain sections that correspond to various features
      • the "weights" will ensure that dot-multiplication with the original image will accentuate the features of that image that match the shape of the weights in the filter
      • the better the weights fit the image, the higher the values of the resulting array will be
      • "stride" is the rate at which the filter moves across the image
        • a stride of 2 means that the filter skips a column/row of pixels each time it moves across/down the image
        • a larger stride will result in a smaller resulting image, since the dot product of each convolution is what composes the new array
          • to calculate the resulting image dimensions
            result = floor((input_image_dimensions - filter_dimensions) / stride) + 1
      • a problem:
        • the left and right edges of the arrays will get "less attention" from the convolution because they will be multiplied fewer times
          • to solve this, pad the left and right edges of the image with zeroes
          • padding which results in an "output" array of the same dimensions as the original image is called "same padding"
      • the filter should have the same "depth" as the input image, meaning that the filter should have "pixels" with the same number of channels as the original image
      • the "output array" of a convolution layer is known as an "activation map"
  • in addition to the convolution layer, there is another type of layer called a "pooling layer"

    • the purpose of the pooling layer is to reduce the size of the image, effectively reducing the number of trainable parameters
    • a common form of pooling layer is "max pooling", which simply takes the maximum value from squares of a certain size within the original array and creates a new 2D array with those maximum values
      • the result is a lower resolution form of the same array
          | 40 3 | 1 12 |
          | 21 9 | 14 0 |
          | ---- | ---- |
          | 9 71 | 20 5 |
          | 3 2  | 10 9 |
        stride = 2, pooling size = 2
          | 40 | 14 |
          | -- | -- |
          | 71 | 20 |
  • CNNs are composed of various convolution and pooling layers

    1. We pass an input image to the first convolution layer. Output is an activation map. Filters extract relevant features which are accentuated in the activation map.
    2. Each filter should have a different feature to aid correct class prediction. If size needs to be retained, use same padding. Otherwise, valid padding helps reduce the number of parameters. 2a. Pooling helps with this too.
    3. Several convolution and pooling layers help to make prediction more precise. As we go deeper, more specific features are extracted. Shallow, more generic.
    4. Output layer is "fully connected", meaning that the layer is flattened into an output that represents the prediction classes.
    5. Output layers are compared and an error (loss) gradient (curve) is generated.
    6. Error gradient is then backpropagated across the layers in order to recalculate the weights and bias values accordingly. This effectively "shows the network how bad it did" so that it can do better next time.
    7. One forward and one backward pass completes one training cycle.
    8. Once the weights are tuned, the model can be used for more accurate prediction.