Lecture Planning
Week 1: Introduction to Deep Learning
 Welcome to the Deep Learning Specialization
 C1W1L01 Welcome
 Introduction to Deep Learning
 C1W1L02 Welcome
 C1W1L03 What is a neural network?
 C1W1L04 Supervised Learning with Neural Networks
 C1W1L05 Why is Deep Learning taking off?
 C1W1L06 About this Course
 C1W1R1 Frequently Asked Questions
 C1W1L07 Course Resources
 C1W1R2 How to use Discussion Forums
 C1W1L08 Geoffrey Hinton interview
 Practice Questions
 C1W1Q1 Introduction to deep learning
Week 2: Neural Networks Basics
 Logistic Regression as a Neural Network
 C1W2L01 Binary Classification
 C1W2L02 Logistic Regression
 C1W2L03 Logistic Regression Cost Function
 C1W2L04 Gradient Descent
 C1W2L05 Derivatives
 C1W2L06 More Derivative Examples
 C1W2L07 Computation Graph
 C1W2L08 Derivatives with a Computation Graph
 C1W2L09 Logistic Regression Gradient Descent
 C1W2L10 Gradient Descent on m examples
 Python and Vectorization
 C1W2L11 Vectorization
 C1W2L12 More Vectorization Examples
 C1W2L13 Vectorizing Logistic Regression
 C1W2L14 Vectorizing Logistic Regression’s Gradient Output
 C1W2L15 Broadcasting in Python
 C1W2L16 A note on python/numpy vectors
 C1W2L17 Quick tour of Jupyter/iPython Notebooks
 C1W2L18 Explanation of logistic regression cost function (optional)
 Python and Vectorization
 C1W2L11 Vectorization
 C1W2L12 More Vectorization Examples
 C1W2L13 Vectorizing Logistic Regression
 C1W2L14 Vectorizing Logistic Regression’s Gradient Output
 C1W2L15 Broadcasting in Python
 C1W2L16 A note on python/numpy vectors
 C1W2L17 Quick tour of Jupyter/iPython Notebooks
 C1W2L18 Explanation of logistic regression cost function (optional)
 Practice Questions
 C1W2Q1 Neural Network Basics
 Programming Assignments
 C1W2P1 Practice Programming Assignment: Python Basics with numpy (optional)
 C1W2P2 Programming Assignment: Logistic Regression with a Neural Network mindset
Week 3: Shallow Neural Networks
 Shallow Neural Networks
 C1W3L01 Neural Networks Overview
 C1W3L02 Neural Network Representation
 C1W3L03 Computing a Neural Network’s Output
 C1W3L04 Vectorizing across multiple examples
 C1W3L05 Explanation for Vectorized Implementation
 C1W3L06 Activation functions
 C1W3L07 Why do you need nonlinear activation functions?
 C1W3L08 Derivatives of activation functions
 C1W3L09 Gradient descent for Neural Networks
 C1W3L10 Backpropagation intuition (optional)
 C1W3L11 Random Initialization
 Practice Questions
 C1W3Q1 Shallow Neural Networks
 Programming Assignment

C1W3P1 Planar data classification with a hidden layer

Week 4: Deep Neural Networks
 Deep Neural Network
 C1W4L01 Deep Llayer neural network
 C1W4L02 Forward Propagation in a Deep Network
 C1W4L03 Getting your matrix dimensions right
 C1W4L04 Why deep representations?
 C1W4L05 Building blocks of deep neural networks
 C1W4L06 Forward and Backward Propagation
 C1W4L07 Parameters vs Hyperparameters
 C1W4L08 What does this have to do with the brain?
 Practice Questions
 C1W4Q1 Key concepts on Deep Neural Networks
 Programming Assignments
 C1W4P1 Key concepts on Deep Neural Networks
 C1W4P2 Building your deep neural network: Step by Step
 C1W4P3 Deep Neural Network Application
C1W2L03 Logistic Regression Cost Function
Loss function: to measure how bad prediction of a single example is.
 Loss function = error function
 For a single example
Cost function: to measure the average of the loss function of each examples.
 For the overall examples
The loss function computes the error for a single training example; the cost function is the average of the loss functions of the entire training set.
Logistic regression can be viewed as a small neural network.
C1W2L04 Gradient Descent
dw =
db =
C1W2L05 Derivatives
Intuitive understanding of derivatives
If you already understand derivatives, you can skip this video.
C1W2L10 Gradient Descent on m Examples
 For loop: sequential processing
 Vectorization = matrix computation: parallel processing
Practice Programming Assignment (Optional)
 Actually, we rarely use the “math” library in deep learning because the inputs of the functions are real numbers. In deep learning we mostly use matrices and vectors. This is why numpy is more useful.
 np.linalg.norm: to get norm of rows
 np.reshape is widely used. In the future, you’ll see that keeping your matrix/vector dimensions straight will go toward eliminating a lot of bugs.
 np.dot(): matrix multiplication
 np.multiply(), * operator: elementwise multiplication
Programming Assignment: Logistic Regression with a Neural Network mindset
 Many software bugs in deep learning come from having matrix/vector dimensions that don’t fit.
 Flattening technique
 X_flatten = X.reshape(X.shape[0], 1).T
 Common steps for preprocessing a new dataset are:
 Figure out the dimensions and shapes of the problem (m_train, m_test, num_px, …)
 Reshape the datasets such that each example is now a vector of size (num_px * num_px * 3, 1)
 “Standardize” the data
 Preprocessing the dataset is important.
 You implemented each function separately: initialize(), propagate(), optimize(). Then you built a model().
 Tuning the learning rate (which is an example of a “hyperparameter”) can make a big difference to the algorithm.
C1W3L06 Activation Functions
 Activation functions introduced: sigmoid, tanh, ReLU, LeakyReLU
 Sigmoid
 Not used in practice.
 If is large enough, the gradient at is nearly 0. So, almost no update by a gradient descent method.
 tanh
 A translated version of sigmoid.
 The mean of activations by tanh can be 0 but sigmoid cannot have zeromean activations. The zeromean activation helps learning fast. Normalizing effect?
 ReLU
 If , its parameter can learn without the gradient vanishing problem.
 If , its parameter is not able to learn, say, is not updated.
 It is reported that if a layer with ReLU has sufficiently lots of units, the problem at is not a big deal.
 can escape from with updates of such that .
 LeakyReLU
 In the range of , the gradient is between 0 and 1, and mostly 0.01.
 ReLU is the default choice. If you need LeakyReLU in particular, then use LeakyReLU.
C1W3L11 Random Initialization
 W=np.random.randn((a,b)) * 0.01
 Why not W=np.random.randn((a,b)) * 100 ?
 [If activation function g is sigmoid or tanh]
 W is small → z is small → The derivative of a=g(z) is not so small to update W. → W will be updated!
 W is large → z is large → The derivative of a=g(z) is so small. → W will not be updated!
 [If activation function is z=0 symmetric – my thought]
 W is small → z is small → It is easy to change z to be z>0 or z<0 with gradient update. → Able to learn nonlinear decisions
 W is large → z is large → It is hard to change z to be z>0 or z<0 with gradient update. → Rarely able to learn nonlinear decisions
 [If activation function g is sigmoid or tanh]
 Why not W=np.random.randn((a,b)) * 100 ?
C1W4L04 Why deep representations?
 To approximate a function, a deep network requires fewer hidden units than a shallow network does. – circuit theory and deep learning
 A shallow network requires exponentially more hidden units than a deep network does.
 Andew Ng thinks this theory is less useful for gaining intuition of deep representation.
C1W4L06 Forward and Backward Propagation
C1W4L06 Forward and Backward Propagation
C1W4L08 What does this (deep learning) have to do with the brain?
 The structure of artificial neural networks have the simplified structure of biological neurons. However, the biological ones have more complicated systems.
 Learning methods of artificial neural networks, such as backpropagation, do not seem to be used in real biological neural networks. The biological neurons may use different learning algorithms.
 The analogy of biological neurons to understand deep learning is not quite useful.