Brief Information
 Instructor: Vincent Vanhoucke (Principal Scientist at Google Brain)
 Flatform: Udacity
 Course homepage: https://www.udacity.com/course/deeplearning–ud730
 Duration
 20170824~25: Took Lesson 1, 37 without programming assignments.
Course Overview
 Lesson 1: From Machine Learning to Deep Learning
 Lesson 2: Assignment: notMNIST
 Lesson 3: Deep Neural Networks
 Lesson 4: Convolutional Neural Networks
 Lesson 5: Deep Models for Text and Sequences
 Lesson 6: Software and Tools
 Lesson 7: Build a Live Camera App
Lesson 1: From Machine Learning to Deep Learning
Key words
softmax, onehot encoding, crossentropy
Onehot Encoding
 Onehot encoding is the idea that categories or classes are represented as a dimensional vector whose elements are 0 or 1 and only one element is 1.
 and
 For example, if there are three classes: red, green, and blue, then we encode red as , green , and blue
 The reason why only one element is 1 and the others are 0 is to represent that each element and the sum of all elements are probability. Then we can use the crossentropy of softmax and onehot encoding. Suppose the label of some example is red, i.e., . represents the true probability distribution of the example over the three colors, say, , , and .
Crossentropy
 Crossentropy is measure that represents the difference between two probabilities.

 : Softmax output. An unnatural probability distribution.
 : Onehot encoded Label. True probability distribution.
 “The crossentropy of {L} and {S}”
 Suppose . Then, . As gets closer to 1, gets decreasing.
 What it means for to get closer to 1 is that the sum of the others is decreasing.
 After , becomes the answer of the softmax.
Loss function of average crossentropy
 Loss = Average crossentropy over training examples:
 : a training example
 : the number of training examples
Numerical Stability
1 2 3 4 5 6 7 8 9 10 
In [1]: val = 1000000000 In [2]: for i in xrange(1000000): ...: val = val + 0.000001 ...: In [3]: val = val  1000000000 In [4]: print val 0.953674316406 
 The result should be 1 but is not.
What if val = 1?
1 2 3 4 5 6 7 8 9 10 
In [5]: val = 1 In [6]: for i in xrange(1000000): ...: val = val + 0.000001 ...: In [7]: val = val  1 In [8]: print val 0.999999999918 
 The result is very close to 1.
Normalized Inputs and Initial Weights
 Methods to normalize input
 Subtract the mean > divided by the standard deviation > normalization has be done
 Subtract the mean > divided by [(maxmin) / 2] > normalization has be done
 Initialize weights
 Zero mean:
 Equal variance:
Measuring Performance
 Training
 Validation
 Testing
Lesson 1: Deep Neural Networks
Linear Models
 Advantages
 Linear models are stable, which means they do not give out very large or small output, say, they give out bounded outputs.
 Derivatives of linear models are very simple to compute.
 Disadvantages
 Stacked two linear layers perform equally to one linear layer. Nonlinear models are required to make a model do complex tasks.
Rectified Linear Units (ReLUs)
 ReLUs are suggested because of the limit of linear models.
Training a Deep Learning Network
 A deep network model works well with a large amount of data.
Generalization
 Choose a model with capacity as large as the model is able to describe a given dataset.
 → The model tends to overfit to the dataset.
 → We have to discourage overfitting, which is called generalization.
 Methods of generalization
 Early stopping[termination]
 Choose parameters of a model when its validation error is lowest.
 Regularization
 Early stopping[termination]
Regularization
 Regularization is a process of introducing additional information in order to prevent overfitting. – From Wikipedia(en)
 Methods of regularization
 L2 regularization
 Dropout
L2 regularization
Dropout
 At each step of learning parameters in a layer, at each step randomly chosen % of activations become 0.
 Effect: The learning parameters become not strongly dependent on the training dataset. This results in preventing from overfitting.
 If dropout does not work for you, you should use a bigger network.
Validation Set Size
 If validation was not well sampled, then
 apply the cross validation method, or
 get a lot more data.
 Cross validation takes long time to compute.
Stochastic Gradient Descent (SGD)
 Batch gradient descent
 compute with all training examples at each update iteration.
 is not suitable for deep learning architectures, which require a lot of training examples.
 is computationally heavy if the training data set is very large.
 Stochastic gradient descent
 compute with one training examples at each update iteration.
 is computationally light.
 is able to find plausible solution.
 is suitable for deep learning architectures.
 The direction of batch gradient is the direction of the steepest gradient.
 The direction of stochastic gradient is NOT the direction of the steepest gradient.
Helping SGD
 Input
 Zero mean:
 Equal variance:
 Weight initialization
 Zero mean:
 Equal variance:
Improving Gradient Descent Optimization
 Two methods are introduced to improve gradient descent optimization.
 Momentum
 Learning rate decay
 Momentum
 Momentum is to use the accumulated gradient that is the weighted average of previous gradients.
 Learning rate decay
 Learning rate decay is to decrease learning rate as the number of update steps gets increasing.
 Learning rate decay adaptive learning rate
 Sometimes increasing learning rate is required, for example, when we encounter a plateau that should be avoided.
Learning Rate Tuning
 Advice: Always start with a small learning rate.
Hyperparameters
 Initial learning rate
 Parameters for learning rate decay
 Parameters for momentum
 Batch size
 Parameter for weight initialization
ADAGRAD (ADAptive subGRADient method)
 ADAGRAD is a gradient descent optimization method that uses parameters such as:
Initial learning rateParameters for learning rate decayParameters for momentum Batch size
 Parameter for weight initialization
 ADAGRAD determines momentum and learning rate adaptively to the loss/error hyperplane.
Lesson 5: Deep Models for Text and Sequences
Semantic Ambiguity
 Important idea: Similar words tend to occur in similar contexts.
Word2vec
 Context
 window
 word > embedded word(onehot encoding) > linear layer > softmax > embedded word in window
 Use sampled softmax to computer crossentropy because the length of softmax output vector is the number of words in vocabulary.
 Sampled softmax: Pick the element that has 1. Randomly sample the elements that has 0. Then, use the chosen elements for learning.
Recurrent Neural Networks, RNNs
 To learn weights in a RNNs, gradients backpropagate to the previous times. However, the gradients are respective to the same weights, which is called correlated update.
 The correlated update for stochastic gradient descent causes the exploding or vanishing gradients problem.
The Exploding Gradients Problem
[picture]
 The deeper backpropagation becomes, the more exploding gradient is .
 Solution: Gradient clipping
The Vanishing Gradient Problem
[picture]
 The weights are only learned for the end of the sequence.
 Solution: LSTM (LongShort Term Memory)
 The weight in the RNN is replaced with a LSTM cell. The other structure remains same.
eUnsolved Questions
 How to backpropagate in a recurrent neural network.