Convolutional Neural Networks | Deep Learning Specialization | Coursera

Course Planning

Week 1: Foundations of convolutional neural networks

Learn to implement the foundational layers of CNNs (pooling, convolutions) and to stack them properly in a deep network to solve multi-class image classification problems.

  • Convolutional neural networks
    • C4W1L01 Computer vision
    • C4W1L02 Edge detection example
    • C4W1L03 More edge detection
    • C4W1L04 Padding
    • C4W1L05 Strided convolutions
    • C4W1L06 Convolutions over volume
    • C4W1L07 One layer of a convolutional network
    • C4W1L08 Simple convolutional network example
    • C4W1L09 Pooling layers
    • C4W1L10 CNN example
    • C4W1L11 Why convolutions?
  • Practice questions
    • C4W1Q1 Quiz: The basics of ConvNets
  • Programming assignments
    • C4W1P1 Convolutional model: Step by step
    • C4W1P2 Convolutional model: Application
Week 2: Deep convolutional models: Case studies

Learn about the practical tricks and methods used in deep CNNs straight from the research papers.

  • Case studies
    • C4W2L01 Why look at case studies?
    • C4W2L02 Classic networks
    • C4W2L03 ResNets
    • C4W2L04 Why ResNets work
    • C4W2L05 Networks in networks and 1×1 convolutions
    • C4W2L06 Inception network motivation
    • C4W2L07 Inception network
  • Practical advices for using ConvNets
    • C4W2L08 Using open-source implementation
    • C4W2L09 Transfer learning
    • C4W2L10 Data augmentation
    • C4W2L11 State of computer vision
  • Practice questions
    • C4W2Q1 Deep convolutional models
  • Programming assignments
    • C4W2P1 Keras tutorial – The happy house (not graded)
    • C4W2P2 Residual networks
Week 3: Object detection

Learn how to apply your knowledge of CNNs to one of the toughest but hottest field of computer vision: Object detection.

  • Detection algorithms
    • C4W3L01 Object localization
    • C4W3L02 Landmark detection
    • C4W3L03 Object detection
    • C4W3L04 Convolutional implementation of sliding windows
    • C4W3L05 Bounding box predictions
    • C4W3L06 Intersection over union
    • C4W3L07 Non-max suppression
    • C4W3L08 Anchor boxes
    • C4W3L09 YOLO algorithm
    • C4W3L10 (Optional) Region proposals
  • Practice questions
    • C4W3Q1 Detection algorithms
  • Programming assignments
    • C4W3P1 Car detection with YOLOv2
Week 4: Special applications: Face recognition & neural style transfer

Discover how CNNs can be applied to multiple fields, including art generation and face recognition. Implement your own algorithm to generate art and recognize faces!

  • Face recognition
    • C4W4L01 What is face recognition?
    • C4W4L02 One shot learning
    • C4W4L03 Siamese network
    • C4W4L04 Triplet loss
    • C4W4L05 Face verification and binary classification
  • Neural style transfer
    • C4W4L06 What is neural style transfer?
    • C4W4L07 What are deep ConvNets learning?
    • C4W4L08 Cost function
    • C4W4L09 Content cost function
    • C4W4L10 Style cost function
    • C4W4L11 1D and 3D generalizations
  • Practice questions
    • C4W4Q1 Special applications: Face recognition & neural style transfer
  • Programming assignments
    • C4W4P1 Art generation with neural style transfer
    • C4W4P2 Face recognition for the happy house

C4W1L11 Why convolutions?

  • Why do we use CNNs over DNNs?
    • Parameter sharing
      • A feature detector that is useful in one part of the image is probably useful in another part of another part of the image.
      • Thus, parameters are shared along different parts of the image.
      • Parameter sharing enables a CNN uses much less parameters than a DNN does.
    • Sparsity of connections
      • In each layer, each output value depends only on a small number of inputs.
    • Through these two reasons, CNNs need the small number of parameters and are less prone to be overfitted.
    • CNNs have translation invariance, which is aroused by the convolution structure having parameter sharing and sparsity of connections.

C4W1P1 Convolutional model: Step by step

  • In this assignment, you will implement convolutional (CONV) and pooling (POOL) layers in numpy, including both forward propagation and (optionally) backward propagation.

C4W2L02 Classic networks

  • Classic neural network architectures
    • LeNet-5
    • AlexNet
    • VGGNet, VGG-16
  • LeNet-5 [Lecun et al., 1998. Gradient-based learning applied to document recognition]
    • 30K paramters
    • Valid convolution
    • As a layer gets deeper, n_H and n_W get smaller, and f_c gets bigger.
    • No pooling layers. But stride=2 was used for pooling (pooling >> non-linearity)
    • Sigmoid and tanh activations were used.
  • AlexNet [Krizhevsky et al., 2012. ImageNet Classification with Deep Convolutional Neural Networks]
    • 60M parameters
    • Max pooling
    • ReLU
    • Multiple GPUs
    • Local response normalization: Many researchers found that this was not helpful much.
    • This network gave a huge impact on computer vision societies to consider to use deep learning for computer vision.
  • VGG-16 [Simonyan & Zisserman, 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition]
    • 138M parameters
    • Really simple, recursive architecture
    • 16 layers that have learnable parameters

C4W2L03 ResNets

  • ResNet [He et al., 2016. Deep residual networks for image recognition]
  • Residual block
  • short cut = skip connection
  • Skip connections allow you to train much deeper networks.

Residual Block

Residual Network

C4W2L04 Why ResNets work

  • Activation functions should be ReLUs.
  • z^{[l+2]} and a^{[l]} have the same dimension or shape.
    • a^{[l]} is the layer passed to z^{[l+2]} by the skip connection.
    • Thus, a ResNet needs many same convolutions in residual blocks
  • Having the same dimension, or using zero-padded a^{[l]}.

 \begin{aligned} a^{[l+2]} &= \textup{ReLU}(z^{[l+2]}+a^{[l]}) \\ &=\textup{ReLU}(W^{[l+2]}a^{[l+1]}+b^{[l+2]}+a^{[l]}) \\ &\approx \textup{ReLU}(a^{[l]}) \\ &= a^{[l]} \end{aligned}

if weight decay is forced to make W^{[l+2]} \approx 0 and b^{[l+2]} \approx 0.

  • It is easy for a residual block to learn the identity function because of the skip connection.
  • Thus, adding a residual block does not hurt performance but results in the same or improved performance.

C4W2L05 Networks in networks and 1×1 convolutions

  • 1×1 convolution = network in network [Lin et al., 2013. Network in network]
  • [My thought] An 1×1 convolution can be understood as a fully-connected layer between filters in the two layers.
  • When are 1×1 convolutions useful?
    • They are used when to shrink the number of channels like pooling (n_c^{[l]}>n_c^{[l+1]}).
    • They are used when to learn a new representation of the channels (n_c^{[l]}=n_c^{[l+1]}).

C4W2L06 Inception network motivation

[Szegedy el al., 2014. Going deeper with convolutions]

Inception Module

  • 1×1, 3×3, and 5×5 covolutions are all same convolutions.
  • Max-pooling needs padding to fit into the same image size.
  • The 1×1 convolution can solve the problem of computational cost of convolution by reducing the number of channels.

C4W2L07 Inception network

Inception Module Used in the Inception Network

  • The softmax prediction layers give regularizing effects on the inception network.
  • GoogLeNet
  • The purpose of the inception modules is to make a network deeper.

C4W2L08 Using open-source implementation

  • Search open-source implementation in GitHub.

C4W2L09 Transfer learning

  • One step that could help your computation as you just pre-compute that layers activation, for all the examples in training sets and save them to disk and then just train the softmax classes right on top of that. The advantage of the safety disk or the pre-compute method or the safety disk is that you don’t need to recompute those activations everytime you take a epoch or take a post through a training set.
  • Suppose you reuse learned parameters to other tasks.
    • If the number of classes is small, then it is enough to set the parameters not trainable.
    • If the number of classes is large, then it is encouraged to make the parameters of only last few layers trainable or to replace them with new parameters.
      • The more training data, the smaller number of frozen first layers.
      • If training data is super large, you can use the learned parameters like parameter initialization. To set all the parameter trainable.
  • In computer vision, transfer learning is what you should always do unless you have exceptionally large data set and computing resources.

C4W2L10 Data augmentation

  • Data augmentation is one of the techniques that is often used to improve the performance of computer vision systems.
  • For the majority of computer vision problems, we feel like we just can’t get enough data. And this is not true for all applications of machine learning, but it does feel like it’s true for computer vision.
  • Common augmentation methods
    • More common
      • Mirroring
      • Random cropping
      • Color shifting: This can make your model more robust to color variation
        • PCA color augmentation
    • Less common (because the following needs more complex computation)
      • Rotation
      • Shearing
      • Local warping
  • The data augmentation methods have their hyperparameters that decide how much distort data.
  • Use feasible data augmentation open-sources first.

C4W2L11 State of computer vision

  • Two sources of knowledge in machine learning
    • Source 1: labeled data – data
    • Source 2: hand engineered features, network architectures – information processing systems
  • Source 1 ↓ requires Source 2 ↑
  • Source 1 ↑ requires Source 2 ↓
Tips for doing well on benchmarks and winning competitions
  • You wouldn’t really use the following tips in a production or a system that you deploy in an actual application
  • Method 1: Ensembling
    • Train several networks independently and average their outputs.
      • Training 3~15 networks for ensembling is quite typical.
    • [-] The problem is while ensembling, multiple networks should be loaded in the computer memory.
  • Method 2: Multi-crop at test time
    • Run classifier on multiple cropped test images and average results.
    • Used in actual production systems.
    • [-] Multi-crop at test time consumes more computation resources.
Use open source code!
  • Use architectures of networks published in the literature.
  • Use open source implementation if possible.
  • Use pretrained models and fine-tune on your dataset.

C4W2Q1 Deep convolutional models

  • Training a plain deeper network generally allows the network to fit more complex function. However, if the network is very deep, it is not able to fit complex functions because of the vanishing and exploding problems. The problems can be solved with skip-connections.
  • Using a skip-connection helps the gradient descent to backpropagate and thus helps you to train deeper networks.
  • The skip-connection makes it easy for the network to learn an identity mapping between the input and the output within the ResNet block.
  • A single  inception block allows the network to use a combination of 1×1, 3×3, 5×5 convolutions and pooling.
  • Inception blocks usually use 1×1 convolutions to reduce the input data volume’s size before applying 3×3 and 5×5 covolutions.

C4W3L01 Object localization

C4W3L02 Landmark detection

  • Landmarks are points to characterize some visual object.
    • A face has landmarks such as points of eyes, lips, noses, a jaw, and so on.
    • A human body pose has landmarks such as joints of the body.
  • For landmark detection with k landmarks, the target vector is designed as follows.

 \textup{target vector} = (\textup{object},l_{1x},l_{1y},l_{2x},l_{2y},...,l_{kx},l_{ky})

  •  The vector has 1+2k dimensions. One element is for detecting whether a certain object or not, 2k elements is for predicting all the landmarks of the object.

C4W3L03 Object detection

  • Sliding window detection
    • Get cropped images from sliding different sizes of windows.
    • Put the images to a ConvNet.
    • Train the network to determine whether an input image is a certain object.
  • Sliding window detection generates a lot of inputs to train. This results in high computational cost for ConvNets.

C4W3L04 Convolutional implementation of sliding windows

[Sermanet et al., 2014, OverFeat: Integrated recognition, localization and detection using convolutional networks]

  • Convolutional implementation of sliding windows 
    • Key idea: to reuse repeated convolutional computation
    • Benefit: to compute object regions in an image in parallel
    • Drawback: to predict not accurate bounding boxes

  • In the last layer the size of which is 8×8×4, each 1×1×4 box works as a softmax layer of 4-class classification for each object region.

C4W3L05 Bounding box predictions

  • How to make bounding box predictions more accurate
  • Solutions
    • YOLO algorithm
      • [Redmon et al., 2014, You only look once: Unified, real-time object detection]
      • Labels for training
        • (p_c, b_x, b_y, b_w, b_h, c_1, c_2, c_3)
          • p_c:
            • p_c = 1: an object is in this grid cell. The center of the bounding box is in the grid cell.
            • p_c = 0: no object is in the gird cell.
          • b_x, b_y, b_w, b_h: (Regression or classification?)
            • b_x, b_y: the location of a bounding box
            • b_w, b_h: width and height of the bounding box
          • c_1, c_2, c_3: a vector for classify an object into one class among the three.
            • (c_1, c_2, c_3) can only be (1, 0, 0)(0, 1, 0), or (0, 0, 1).
  • YOLO predicts locations of bounding boxes. So, the locations are more accurate than sliding windows.
  • YOLO is built on the covolutional implementation.
  • The YOLO paper is hard to read to understand the details.

C4W3L06 Intersection over union

  • Intersection over union
    • Intersection over union (IoU) is a measure to evaluate how accurate object localization is.
    • IoU(prediction) = Intersection(groundTruth, prediction) / Union(groundTruth, prediction)
    • If IoU(prediction) \leq 0.5, prediction is correct. Otherwise, prediction is incorrect.
      • 0.5 is just a convention threshold. There is no theoretical threshold.

C4W3L07 Non-max suppression

  • One problem of object detection is that an object can be detected multiple times.
  • Non-max suppression is an algorithm to detect an object only once
Non-max suppression algorithm for 1-class object detection
  1. Predict (p_c, b_x, b_y, b_w, b_h) in an image. Put all boxes into a set S_1.
  2. Discard all boxes with p_c \leq 0.6 from the set S_1.
  3. Pick the box with the largest p_c in S_1. Set the box B_{\textup{max}}.
  4. Discard every box B with \textup{IoU}(B, B_{\textup{max}}) \geq 0.5 from S_1.
  5. Move B_{\textup{max}} from S_1 to the set S_2.
  6. Go to step 3 if S_1 is not empty. Otherwise, end the algorithm.

Finally, the set S_2 is the result of the non-max suppression algorithm. S_2 has less overlapped boxes than before.

  • If there are multiple classes, apply the non-max suppression algorithm to the boxes for each class.
  • Q: Should the non-max suppression algorithm apply to the set of boxes with the same anchor box?
    • In the following context, all anchor boxes are dealt with together.

C4W3L08 Anchor boxes

  • Anchor boxes was suggested to solve the problem that two midpoints of two objects are in the same grid cell. The YOLO algorithm assumes that only one midpoint in a grid cell.
  • Each achor boxes have different ratio between height and width.
  • Anchor boxes are used to classify bounding boxes.
How to classify bounding boxes?
  1. Set the midpoints of anchor boxes to be same as the midpoint of a bounding box.
  2. The bounding box classified into the anchor box that has the highest IoU with the bounding box.
    1. How large anchor boxes?
The shape of the prediction vector

If anchor boxes are two, the following is the prediction vector. p^{(1)} is the element to determine whether an object of anchor box 1 is in the grid cell. p^{(2)} is for anchor box 2.

 (p^{(1)}_c, b^{(1)}_x, b^{(1)}_y, b^{(1)}_w, b^{(1)}_h, c^{(1)}_1, c^{(1)}_2, c^{(1)}_3,p^{(2)}_c, b^{(2)}_x, b^{(2)}_y, b^{(2)}_w, b^{(2)}_h, c^{(2)}_1, c^{(2)}_2, c^{(2)}_3)

  • The ConvNet for YOLO can be specialized for objects of each anchor box.
If the midpoints of 2 objects of the same anchor box are in the same grid cell
  • the YOLO algorithm only can detect one object out of the 2 objects.
  • This is a limitation of the algorithm.
  • In reality, this rarely happens because anchor boxes and grid cells are fairly a lot.
How to define anchor boxes?
  1. Manually define anchor boxes by following researchers’ intuition.
  2. Use the k-means algorithm on the ground truth bounding boxes to determine k and k anchor boxes. – advanced technique

C4W3L09 YOLO algorithm

ConvNet in the YOLO algorithm
ConvNet Structure

Input image → ConvNet → Output with the size of (#grid cells)×(1 + 4 + #classes)×(#anchor boxes)

  • (1 + 4 + #classes)
    • 1: p_c. prediction probability that an object of a certain anchor box in a certain grid cell
    • 4: b_x, b_y, b_w, b_h. The location and size of the bounding box of the object
    • #classes: the size of one-hot encoded class representation.
  • Q: Is it okay for the objective function that if p_c = 0, the other elements are all randomized.
Outputting the non-max supressed outputs
  1. For each grid cell, get 2 predicted bounding boxes. (2 means the number of anchor boxes per grid cell.)
  2. Get rid of low probability predictions (low p_c).
  3. For each class (predestrian, car, motorcycle),
    1. Use the non-max suppression algorithm to generate final prediction.
In step 3
  • If 2 predicted bounding boxes in a grid cell are of the same class, only 1 box will be survived.
  • If 2 predicted bounding boxes in a grid cell are of the different classes, the 2 boxes will be survived.

C4W3L10 (Optional) Region proposals


[Girshick et al., 2013, Rich feature hierarchies for accurate object detection and semantic segmentation]

  • R-CNN means Regions with Convolutional Neural Networks
  • Process:
    1. Image input
    2. → Segmentation. Propose regions
    3. → Apply a CNN classifier on the bounding boxes of segmented regions to predict a class and a bounding box.
  • The bounding box the classifier predicts may more accurately fit an object.
  • #(segmented regions) > #(sliding windows)
    • ⇒ Reduce computation cost
  • The R-CNN is quite slow. To overcome its slowness, Fast R-CNN and Faster R-CNN were introduced.
Fast R-CNN

[Girshick, 2015, Fast R-CNN]

  • Improve the 3rd step of the process with convolutional implementation of sliding windows.
Faster R-CNN

[Ren, 2015, Faster R-CNN: Towards real-time object detection with region proposal networks]

  • Improve the 2nd step of the process with a convolutional networks.

C4W4L01 What is face recognition?

Face verification vs Face recognition
Face verification
  • Input: Face image, human ID
  • Output: Decide whether the face image is of the human ID.
  • Face image → Yes/No
Face recognition
  • Input: Face image
  • Output: Decide whether the face image is whose face out of K people in the database.
  • Face image → Face ID

C4W4L02 One shot learning

One-shot learning
  • Def. Learning from one example to recognize the person again.
One idea to implement one-shot learning: Similarity function
  • One idea to implement one-shot learning is to calculate the distance between the vector representations, so called embeddings, of two faces.
  • Let \textup{img1} and \textup{img2} be embeddings of the two images.

d(\textup{En(img1)},\textup{En(img2)})<\textup{threshold} ⇒ img1 and img2 are same faces.

d(\textup{En(img1)},\textup{En(img2)})\ge\textup{threshold} ⇒ img1 and img2 are different faces.

  • This idea is very similar to clustering.

C4W4L03 Siamese network

[Taigman et al., 2014, DeepFace: Closing the gap to human-level performance in face verification][LINK]

  • Train a ConvNet of face recognition that classifies many different faces.
  • Get rid of the highest layer for classification. Then, the network becomes a network for encoding a face. The highest layer of the new network will be the encoding \textup{En(img)} of a given face image \textup{img}.
  • Figure out a threshold that clusters the faces of the same people.
  • Using the threshold, perform face recognition as follows.

d(\textup{En(img1)},\textup{En(img2)})<\textup{threshold} ⇒ img1 and img2 are same faces.

d(\textup{En(img1)},\textup{En(img2)})\ge\textup{threshold} ⇒ img1 and img2 are different faces.

d(\textup{En(img1)},\textup{En(img2)})=(\textup{En(img1)} - \textup{En(img2)})^2

C4W4L04 Triplet loss

[Schroff et al., 2015, FaceNet: A unified embedding for face recognition and clustering][LINK]

What is the triplet loss?

FaceNet used an objective function called a triplet loss.

Choose one image and call it an anchor A.

Let a face image P whose person is same as A. Say P as a positive.

Let a face image N whose person is different from A. Call N as a negative.

In FaceNet, the learning objective is as follows. d is the same as the previous one.

d(A,P) < d(A,N)

d(A,P) - d(A,N) < 0

The learning objective becomes to minimize d(A,P) - d(A,N).

In reality, we need a margin \alpha that guarantees to learn enough gaps d(A,P) - d(A,N). So, the learning objective becomes d(A,P) + \alpha < d(A,N) where \alpha > 0. Thus, the objective function is defined as follows.

To minimize J'(A,P,N) = d(A,P) - d(A,N) + \alpha

However, if minimizing d(A,P) - d(A,N) + \alpha < 0, d(A,N) keep increasing, which is unnecessary. Thus, we define the objective function as follows.

To minimize J(A,P,N) = max(0, d(A,P) - d(A,N) + \alpha)

Choosing the triplets A, P, N during training

During training, mostly it is easy to satisfy d(A,P) + \alpha < d(A,N). So, try to train on the triplets (A,P,N) such that d(A,P) \approx d(A,N).

C4W4L05 Face verification and binary classification

[Taigman et al., 2014, DeepFace: Closing the gap to human-level performance in face verification][LINK]

  • An alternative of the triplet loss

\hat{y}=\sigma(\sum_{k=1}^{|N_{\textup{En}}|} w_k d[\textup{En}(\textup{img1})_{k}-\textup{En}(\textup{img2})_{k}]+b)

  • d can be any distance able to measure difference between two face images.
  • This is logistic regression to predict whether two given images are same (y=1) or different (y=1).

Learning the similarity function

  • Precomputation: To compute images in the database to make the prediction faster.

C4W4L06 What is neural style transfer?

[Gatys et al., 2015. A neural algorithm of artistic style][LINK]

Neural style transfer
  • Key concepts: content (C), style (S), generated image (G)
  • C + S ⇒ G

C4W4L07 What are deep ConvNets learning?

[Zeiler and Fergus, 2013, Visualizing and understanding convolutional networks][LINK]

  • Key word: Visualizing each deep ConvNet layers
  • As the layer gets higher, it determines more abstract visual features.
  • Q: How to visualize each units in each layer?

C4W4L08 Cost function

The cost function for neural style transfer

 J(G)= \alpha \cdot J_{\textup{content}}(C,G) + \beta \cdot J_{\textup{style}}(S,G)

or if reducing the parameter redundancy

 J(G)= \alpha \cdot J_{\textup{content}}(C,G) + (1-\alpha) \cdot J_{\textup{style}}(S,G)

Find the generated image G
  1. Initiate G randomly. (G: 100×100×3 = w×h×c)
    • Parameter initialization
    • Each pixel are a parameter to train.
  2. Use gradient to minimize J(G)
    • Gradient-based learning of parameters
    • G:=G-\beta\frac{\partial}{\partial{G}}J(G)

C4W4L09 Content cost function

  • Say you use a hidden layer l to compute content cost.
    • Image content is usually represented in low layers. Let us say the layers as layer l.
    • l is generally not too shallow and not too deep
  • Use pre-trained ConvNet (E.g., VGG network).
  • Let a^{[l](C)} be the activation of layer l on the content image.
  • Let a^{[l](G)} be the activation of layer l on the style image.
  • We presume that if a^{[l](C)} and a^{[l](G)} are similar, then both image C and G have similar content. So, J_{\textup{content}}(C,G) is defined as follows.

 J_{\textup{content}}(C,G)=\frac{1}{2} \left \| a^{[l](C)} - a^{[l](G)} \right \|^2

  • The layers in the range of l contribute to the content of images.

C4W4L10 Style cost function

  • Say you are using the activation of the layer l to measure “style.”
  • Define style as correlation between activations across (high-level) channels.
  • How correlated are the activations across different channels? In the paper, the style matrix G^{[l]} \in n^{[l]}_{c} \times n^{[l]}_{c} measures the correlation between channels at layer lG^{[l]}_{kk'} measures the correlation between channel k and channel k' at layer l.


  • G^{[l](S)}: the style matrix of the style image
  • G^{[l](G)}: the style matrix of the generated image
  • The following functions are the style cost functions. J_{\textup{style}}^{[l]} is the style cost function at layer l, and J_{\textup{style}} is the overall style cost function.

 J_{\textup{style}}^{[l]}(S,G)=\frac{1}{(2n^{[l]}_{H}n^{[l]}_{W}n^{[l]}_{C})^2} \sum_{k} \sum_{k'} (G_{kk'}^{[l](S)} - G_{kk'}^{[l](G)})^2

 J_{\textup{style}}(S,G)=\sum_{l} \lambda^{[l]} J_{\textup{style}}^{[l]}(S,G)

  • \lambda^{[l]} is a hyperparameter.
  • The layers in the range of layer l contribute to the style of images.

C4W4L11 1D and 3D generalizations

  • Where 1D ConvNets are used.
    • Heartbeat signals
  • Where 3D COnvNets are used.
    • CT scan images
    • Movie videos


Leave a Reply

Your email address will not be published. Required fields are marked *