Deep Learning | Udacity

Brief Information

Course Overview

  • Lesson 1: From Machine Learning to Deep Learning
  • Lesson 2: Assignment: notMNIST
  • Lesson 3: Deep Neural Networks
  • Lesson 4: Convolutional Neural Networks
  • Lesson 5: Deep Models for Text and Sequences
  • Lesson 6: Software and Tools
  • Lesson 7: Build a Live Camera App

Lesson 1: From Machine Learning to Deep Learning

Key words

softmax, one-hot encoding, cross-entropy

One-hot Encoding
  • One-hot encoding is the idea that n categories or classes are represented as a n dimensional vector v whose elements are 0 or 1 and only one element is 1.
    • v \in \{0,1\}^n and |v|=1
  • For example, if there are three classes: red, green, and blue, then we encode red as (1,0,0), green (0,1,0), and blue (0,0,1)
  • The reason why only one element is 1 and the others are 0 is to represent that each element and the sum of all elements are probability. Then we can use the cross-entropy of softmax and one-hot encoding. Suppose the label of some example is red, i.e., (1,0,0)(1,0,0) represents the true probability distribution of the example over the three colors, say, P(red)=1P(green)=0, and P(blue)=0.
  • Cross-entropy is measure that represents the difference between two probabilities.
  • H(L,S)=D(S,L)=\sum_{i}^{nclass}{L_i \cdot(-\textup{log}(s_i))}=-\sum_{i}^{nclass}{L_i \cdot\textup{log}(s_i)}
    • S: Softmax output. An unnatural probability distribution.
    • L: One-hot encoded Label. True probability distribution.
    • “The cross-entropy of {L} and {S}”
  • Suppose L_k = 1. Then, D(S,L)=-L_k \cdot\textup{log}(s_k)=-\textup{log}(s_k). As s_k gets closer to 1,  D(S,L) gets decreasing.
  • What it means for s_k to get closer to 1 is that the sum of the others s_i (i\neq k) is decreasing.
    • After s_k > 0.5, k becomes the answer of the softmax.
Loss function of average cross-entropy
  • Loss = Average cross-entropy over training examples: L(w,b)=\frac{1}{N}\sum_{j}^{N}D(S(wx^{(j)}+b),L^{(j)})
    • (x^{(j)},L^{(j)}): a training example j (j=1,2,...,N)
    • N: the number of training examples
Numerical Stability


  • The result should be 1 but is not.

What if val = 1?

  • The result is very close to 1.
Normalized Inputs and Initial Weights
  • Methods to normalize input
    • Subtract the mean -> divided by the standard deviation -> normalization has be done
    • Subtract the mean -> divided by [(max-min) / 2] -> normalization has be done
  • Initialize weights
    • Zero mean: Mean(w_1)=Mean(w_2)=...=Mean(w_m)=0
    • Equal variance: \sigma=Var(w_1)=Var(w_2)=...=Var(w_m)
Measuring Performance
  • Training
  • Validation
  • Testing

Lesson 1: Deep Neural Networks

Linear Models
  • Advantages
    • Linear models are stable, which means they do not give out very large or small output, say, they give out bounded outputs.
    • Derivatives of linear models are very simple to compute.
  • Disadvantages
    • Stacked two linear layers perform equally to one linear layer. Nonlinear models are required to make a model do complex tasks.
Rectified Linear Units (ReLUs)
  • ReLUs are suggested because of the limit of linear models.
Training a Deep Learning Network
  • A deep network model works well with a large amount of data.
  • Choose a model with capacity as large as the model is able to describe a given dataset.
    • → The model tends to overfit to the dataset.
    • → We have to discourage overfitting, which is called generalization.
  • Methods of generalization
    • Early stopping[termination]
      • Choose parameters of a model when its validation error is lowest.
    • Regularization
  • Regularization is a process of introducing additional information in order to prevent overfitting. – From Wikipedia(en)
  • Methods of regularization
    • L2 regularization
    • Dropout
L2 regularization
  • Loss'=Loss+\frac{1}{2}\lambda ||w||_2^2
  • At each step of learning parameters in a layer, at each step randomly chosen p% of activations become 0.
  • Effect: The learning parameters become not strongly dependent on the training dataset. This results in preventing from overfitting.
  • If dropout does not work for you, you should use a bigger network.
Validation Set Size
  • If validation was not well sampled, then
    • apply the cross validation method, or
    • get a lot more data.
  • Cross validation takes long time to compute.
Stochastic Gradient Descent (SGD)
  • Batch gradient descent
    • compute with all training examples at each update iteration.
    • is not suitable for deep learning architectures, which require a lot of training examples.
    • is computationally heavy if the training data set is very large.
  • Stochastic gradient descent
    • compute with one training examples at each update iteration.
    • is computationally light.
    • is able to find plausible solution.
    • is suitable for deep learning architectures.
  • The direction of batch gradient is the direction of the steepest gradient.
  • The direction of stochastic gradient is NOT the direction of the steepest gradient.
Helping SGD
  • Input
    • Zero mean: Mean(x_1)=Mean(x_2)=...=Mean(X_m)=0
    • Equal variance: \sigma=Var(x_1)=Var(x_2)=...=Var(x_m)
  • Weight initialization
    • Zero mean: Mean(w_1)=Mean(w_2)=...=Mean(w_m)=0
    • Equal variance: \sigma=Var(w_1)=Var(w_2)=...=Var(w_m)
Improving Gradient Descent Optimization
  • Two methods are introduced to improve gradient descent optimization.
    • Momentum
    • Learning rate decay
  • Momentum
    • Momentum is to use the accumulated gradient that is the weighted average of previous gradients.
  • Learning rate decay
    • Learning rate decay is to decrease learning rate as the number of update steps gets increasing.
    • Learning rate decay \in adaptive learning rate
      • Sometimes increasing learning rate is required, for example, when we encounter a plateau that should be avoided.
Learning Rate Tuning

  • Advice: Always start with a small learning rate.
  • Initial learning rate
  • Parameters for learning rate decay
  • Parameters for momentum
  • Batch size
  • Parameter for weight initialization
ADAGRAD (ADAptive subGRADient method)
  • ADAGRAD is a gradient descent optimization method that uses parameters such as:
    • Initial learning rate
    • Parameters for learning rate decay
    • Parameters for momentum
    • Batch size
    • Parameter for weight initialization
  • ADAGRAD determines momentum and learning rate adaptively to the loss/error hyperplane.

Lesson 5: Deep Models for Text and Sequences

Semantic Ambiguity
  • Important idea: Similar words tend to occur in similar contexts.
  • Context
  • window
  • word -> embedded word(one-hot encoding) -> linear layer -> softmax -> embedded word in window
    • Use sampled softmax to computer cross-entropy because the length of softmax output vector is the number of words in vocabulary.
    • Sampled softmax: Pick the element that has 1. Randomly sample the elements that has 0. Then, use the chosen elements for learning.
Recurrent Neural Networks, RNNs
  • To learn weights in a RNNs, gradients back-propagate to the previous times. However, the gradients are respective to the same weights, which is called correlated update.
  • The correlated update for stochastic gradient descent causes the exploding or vanishing gradients problem.
The Exploding Gradients Problem


  • The deeper back-propagation becomes, the more exploding gradient is .
  • Solution: Gradient clipping

The Vanishing Gradient Problem


  • The weights are only learned for the end of the sequence.
  • Solution: LSTM (Long-Short Term Memory)
    • The weight in the RNN is replaced with a LSTM cell. The other structure remains same.

eUnsolved Questions

  • How to back-propagate in a recurrent neural network.



Leave a Reply

Your email address will not be published. Required fields are marked *