##### Lecture 6 | Training Neural Networks I

###### Sigmoid

- Problems of the sigmoid activation function
- Problem 1: Saturated neurons kill the gradients.
- Problem 2: Sigmoid outputs are not zero-centered.
- Suppose a given feed-forward neural network has hidden layers and all activation functions are sigmoid.
- Then, except the first layer, the other layers get only positive inputs.
- If , then all the gradients are positive.
- If the gradients are only positive, then the update direction gets very constrained.

- Problem 3: exp() is a bit expensive computation. – (a minor problem)
- Numerical methods now well solves this problem.

###### tanh (tangent hyperbolic)

- Zero centered
- The problem 2 has been solved.

- The problem 1 and 3 are still remained.

###### ReLU (rectified linear unit)

- The problem 1 has been solved in the positive region.
- Actually more biologically plausible than sigmoid. The detail was not introduced in this lecture.
- AlexNet used ReLU.
- Problems
- Problem 1: Not zero-centered
- The gradient of each weight is zero or positive.
- The update direction is always the combination of zeros or positives.
- The update direction is restricted. This effects inefficient optimization.

- Problem 2: dead ReLU
- 20% of units are never active nor updated, which are called dead ReLUs.

- Initialization
- People like to initialize ReLU neurons with slightly positive biases (e.g. 0.01)

- Leaky ReLU
- PReLU (Parametric Rectifier)
- ELU (Exponential Linear Unit)
- Between leaky ReLU and ReLU

###### Maxout

- Nonlinear
- a generalized form of ReLU and leaky ReLU
- Benefits
- Linear regimes
- Its output does not saturate.
- Its gradient does not die.

- Drawback
- Double the number of weights.

###### In practice

- Use ReLU first.
- Try out Leakey ReLU, Maxout, and ELU.
- Try out tanh but don’t expect much.
- Don’t use sigmoid.