Hello and welcome! Let's talk about Convolutional Neural Networks, which are specialized kind
of neural networks that have been very successful particularly at computer vision tasks, such
as recognizing objects, scenes, and faces among many other applications.
First, let's take a step back and talk about what convolution is. Convolution is a mathematical
operation that combines two signals and is usually denoted with an asterisk. Let's say
we have a time series signal a, and we want to convolve it with an array of 3 elements.
What we do is simple, we multiply the arrays elementwise, sum the products, and shift the
second array. Then, we do the same thing again for the other elements by moving the second
array over the first one like a sliding window.
Technically, what we do here is cross-correlation rather than convolution. Mathematically speaking,
the second signal needs to be flipped in order for this operation to be considered a convolution.
But in the context of neural networks, the terms convolution and cross-correlation are
used pretty much interchangeably. That's a little off topic but you might ask why would
anyone want to flip one of the inputs. One reason is that doing so makes the convolution
operation commutative. When you flip the second signal, a * b becomes equal to b * a. This
property isn't really useful in neural networks, so there is no need to flip any of the inputs.
In digital signal processing, this operation is also called filtering a signal a with a
kernel b, which is also called a filter. As you may have noticed, this particular kernel
computes the local averages by averaging the values within a window. If we plot this signal
and the result when it's convolved with this averaging filter, we can see that the result
is basically a smoothed version of the input.
We can easily extend this operation to two dimensions. Let's convolve this 8x8 image
with this 3x3 filter for example. Just like the previous example, we overlay the kernel
on the image, multiply the elements, sum the products, and move to the next tile. This
specific kernel is actually an edge detector that detects the edges in one direction. It
has a weak response over the smooth areas in an image, and a strong response to the
edges. If we apply the same kernel to a larger grayscale image like this one, the output
image would look like this where the vertical edges are highlighted. If we transpose the
kernel, then it detects the horizontal edges.
The filter in the previous example smoothed its input whereas in this example the filter
does the opposite and makes the local changes, such as the edges, more pronounced. The idea
is that kernels can be used to extract some certain features from input signals.
The input signals don't have to be a grayscale image. It can be an RGB color image for example,
and we can learn 3-dimensional filters to extract features from these inputs. The inputs
don't even have to be images. They can be any type of data that has a grid-like structure,
such as audio signals, video, and even electroencephalogram signals. Both the inputs and the filters can
be n-dimensional.
There's a lot that can be said about convolutions and filter design. But since the focus of
this video is not digital signal processing, I think this is enough background to understand
what happens inside a convolutional neural network.
In the earlier examples, we convolved the input signals with kernels having hardcoded
parameters. What if we could learn these parameters from data and let the model discover what
kind of feature extractors would be useful to accomplish a task? Let's talk about that
now.
Let's say we have an 8x8 input image. In a traditional neural network, each one of the
hidden units would be connected to all pixels in the input. Now imagine if this was a 300x300
RGB image. Then we would have 270,000 weights for a single neuron. Now, that's a lot
of connections. If we built a model that had many fully connected units at every layer
like this, the model would be big, slow, and prone to overfitting.
One thing we can do here is to connect each neuron to only a local region of the input
volume. Next, we can make an assumption that if one feature is useful in one part of the
input it's likely that it would be useful in the other parts too. Therefore, we can
share the same weights across the input.
Looks familiar? Yes, what this unit does here is basically convolution.
A layer that consists of convolutional units like these is called a convolutional layer.
Convolutional networks, also called ConvNets and CNNs, are simply neural networks that
use convolutional layers rather than using only fully connected layers.
The parameters learned by each unit in a convolutional layer can be thought of as a filter. The outputs
of these units are simply the filtered versions of their inputs. Passing these outputs through
an activation function, such as a ReLU, gives us the activations at these units, each one
of which responds to one kind of feature.
As compared to traditional fully-connected layers, convolutional layers have fewer parameters,
where the same parameters are used in more than one place. This makes the model more
efficient, both statistically and in computational terms.
Although convolutional layers are visualized as running sliding windows over the inputs
and multiplying the elements, they aren't usually implemented that way. As compared
to for loops, matrix multiplications are faster and scale better. So instead of sliding a
window using for loops, many libraries implement convolution as a matrix multiplication.
Let's assume that we have an RGB image as input and have four 3x3x3 kernels. We can
reshape these kernels into 1x27 arrays each. Together, they would make a 4x27 matrix, where
each row represents a single kernel. Similarly, we can divide the input into image blocks
that are the same size as the kernels and rearrange these blocks into columns. This
would produce a 27xN matrix, where N is the number of blocks. By multiplying the matrices,
we can compute all these convolutions at once. Each row in this resultant matrix would give
us the filter outputs when reshaped back to input dimensions.
Another type of layer that is commonly used in convolutional neural nets is the pooling
layer. A pooling layer downsamples its input by locally summarizing them. Max pooling,
for example, subsamples its input by picking the maximum value within a neighborhood.
Alternatively, average pooling takes the average.
In many cases, we care about if some features exist in the input regardless of their exact
position. Pooling layers make this easier by making the outputs invariant to small translations
in the input. Because even if the input is off by a few pixels, the local maxima would
still make it to the next layers. Another obvious advantage of pooling is that it reduces
the size of the activations that are fed to the next layer, which reduces the memory footprint
and improves the overall computational efficiency.
A typical convolutional neural network usually stacks convolutional and pooling layers on
top of each other and sometimes use traditional fully connected layers at the end of the network.
An interesting property of convolutional neural networks is that they learn to extract features.
Early convolutional layers, for example, learn primitive features such as oriented edges.
After training a model, the filters in the first layer usually look like Gabor-like filters,
edge detectors, and color-contrast sensitive filters.
As we move towards the output layer, the features become more complex and neurons start to respond
to more abstract, more specific concepts. We can observe neurons that respond to cat
faces, human faces, printed text, and so on.
The dots you see in the activations of this convolutional layer can be a result of neurons
that respond to cats, pets, or animals in general. One of them, for example, can be
a neuron that activates only if there is a cat in the input picture. The following layers
make use of this information to produce an output such as a class label with some probability.
An interesting thing is, the concepts that are learned by the intermediate layers don't
have to be a part of our target classes. For example, a scene classifier can learn a neuron
that responds to printed text even if that's not one of the target scene types. The model
can learn such units if they help detect books and classify a scene as a library.
This is somewhat similar to how visual information is processed in the primary visual cortex
in the brain, which consists of many simple and complex cells. The simple cells respond
primarily to oriented edges and bars of particular orientations, similar to early convolutional
layers.
The complex cells receive inputs from simple cells and respond to similar features but
have a higher degree of spatial invariance, somewhat like the convolutional layers after
the pooling layers. As the signal moves deeper into the brain, it's postulated that it might
reach specialized neurons that fire selectively to specific concepts such as faces and hands.
An advantage of using pooling layers in our network is that it increases the receptive
field of the subsequent units helping them see a bigger picture. The term receptive field
comes from neuroscience and refers to a particular region that can affect the response of a neuron.
Similarly, the receptive field of an artificial neuron refers to the spatial extent of its
connectivity. For example, the convolutional unit in the earlier example had a receptive
field of 3x3. Units in the deeper layers have a greater receptive field since they indirectly
have an access to a larger portion of the input. Let's have another example and for
simplicity, let's assume both the input and the filter is one dimensional. This unit has
access to three pixels at a time. If we add a pooling layer followed by another convolutional
layer on top of that, a single unit at the end of the network gains access to all 8 pixels
in the input.
Of course, pooling is not the only factor that increases the receptive field. The size
of the kernel obviously has an impact. A larger kernel would mean that a neuron sees a larger
portion of its input.
A larger receptive field can also be achieved by stacking convolutions. In fact, it is usually
preferable to use smaller kernels stacked one on another as compared to using a larger
kernel, since doing so usually reduces the number of parameters and increases non-linearity
when a non-linear activation function is used at the output of each unit. For example, a
stack of two 3x3 convolutions would have the same receptive field as a single 5x5 convolution,
while having fewer mathematical operations and more non-linearities.
One thing to pay attention when stacking convolutional layers is how the size of the input volume
changes before and after a layer. Without any padding, the spatial dimensions of the
input shrink by one pixel less than the kernel dimensions. For example, if we have an 8x8
input and a 3x3 kernel the output of the convolution would be 6x6. Many frameworks call this type
of convolution a 'valid' convolution or a convolution with valid padding. Valid convolution
might cause some problems. Especially if we use larger kernels or stack many layers on
top of each other, the amount of information that gets thrown out might be critical.
There is an easy hack that helps improve the performance by keeping information at the
borders. What it does is to pad the input with zeros so that the spatial dimensions
of the input is preserved after the convolutions. This type of zero padding is called 'SAME'
padding by many frameworks. Zero padding commonly used and works fine in practice, although
it's not ideal from a digital signal processing perspective since it creates artificial discontinuities
at the borders.
Another hyperparameter that has an impact on the receptive field is the stride of the
sliding window. So far, we used a stride of one in the examples. This is usually the default
behavior of a convolutional layer. If we set it to two, for example, the sliding window
moves by two pixels instead of one, leading to a larger receptive field. Using a stride
larger than one has a downsampling effect that is similar to pooling layers and some
models use it as an alternative to pooling.
One thing that is sometimes confused with stride is the dilation rate. A dilated convolution,
also known as atrous convolution or à trous convolution, uses filters with holes. Just
like pooling and strided convolutions, dilated convolutions also learn multi-scale features.
But instead of downsampling the activations, dilated convolutions expand the filters without
increasing the number of parameters. This type of convolutions can be useful if a task
requires the spatial resolution to be preserved. For example, if we are doing pixel-wise image
segmentation, pooling layers might lead to a loss in detail. Using dilated convolutions
preserves spatial resolution while increasing the receptive field. However, this approach demands
more memory and comes at a computational cost since the activations need to be kept in memory
at full resolution.
In this video, we talked about the building blocks of convolutional neural networks. We
also covered what some of the hyperparameters in convolutional networks are and what they
do.
In the next video, we will talk about how to choose these hyperparameters and how to
design our own convolutional neural network. We will also cover some of the architectures
that have been widely successful at a variety of tasks and went mainstream.
Ok, that's all for today. It's already been a litter longer than usual. As always, thanks
for watching, stay tuned, and see you next time.
Không có nhận xét nào:
Đăng nhận xét