A problem arises when you want information from the edge of an image given the nature of filters.
This is where padding comes in, where an image is padded on each side with empty pixels.
So, finally, how does this all come together?
After all of the convolutional and pooling layers, standard fully connected layers are usually used to produce a result, very similar to that of a traditional neural network.
As a result:
The first layers of a CNN learn basic features such as edges, corners, etc.
The middle layers learn to detect parts of objects.
The final layers learn to recognize entire objects.