MLE and MAP
A good video here explains the topic.
Assume is a sample follow independent and identical distribution
Maximum Likelihood Estimation:
,
- Deep Learning uses MLE for we do not know the real distribution of the dataset
Maximum A Posteriori
,
For is not related to ,
- MAP usually used in Bayisan Machine Learning
The difference is the prior probability
Gradient Descent
Let's say your hypothesis function contains multiple parameters, defined as , , .
Your cost function takes both your model output and your ground truth, e.g.:
Then we need to calculate the partial deriative of the cost function with respect to the each , from 0 to n.
Then we can simultaneously update the
derivative of a matrix-matrix product
Backpropagation
https://ml-cheatsheet.readthedocs.io/en/latest/backpropagation.html
A neural network propagates the signal of the input data forward through its parameters towards the moment of decision, and then backpropagates information about the error through the network so that it can alter the parameters one step at a time.
Chain Rule
As seen above, foward propagation can be viewed as a long series of nested equations. If you think of feed forward this way, then backpropagation is merely an application the Chain rule to find the Derivatives of cost with respect to any variable in the nested equation. Given a forward propagation function:
A, B, and C are activation functions at different layers. Using the chain rule we easily calculate the derivative of with respect to :
How about the derivative with respect to B? To find the derivative with respect to B you can pretend is a constant, replace it with a placeholder variable B, and proceed to find the derivative normally with respect to B.
This simple technique extends to any variable within a function and allows us to precisely pinpoint the exact impact each variable has on the total output.
Applying the chain rule
Let's use the chain rule to calculate the derivative of cost with respect to any weight in the network. The chain rule will help us identify how much each weight contributes to our overall error and the direction to update each weight to reduce our error. Here are the equations we need to make a prediction and calculate total error, or cost:
Given a network consisting of a single neuron, total cost could be calculated as:
Using the chain rule we can easily find the derivative of Cost with respect to weight W.
Example
NOTE:
Batch gradient descent
Vanilla gradient descent, aka batch gradient descent, computes the gradient of the cost function to the parameters for the entire training dataset.
for i in range(nb_epochs):
params_grad = evaluate_gradient(loss_function, data, params)
params = params - learning_rate * params_grad
Stochastic gradient descent
Stochastic gradient descent (SGD) in contrast performs a parameter update for each training example and label . Note that we shuffle the training data at every epoch.
SGD performs frequent updates with a high variance that cause the objective function to fluctuate heavily as in Image 1.
for i in range(nb_epochs):
np.random.shuffle(data)
for example in data:
params_grad = evaluate_gradient(loss_function, example, params)
params = params - learning_rate * params_grad
- Efficient when update gradient
- Slow converge speed, need good learning rate
Mini-batch gradient descent
Mini-batch gradient descent finally takes the best of both worlds and performs an update for every mini-batch of n training examples. Common mini-batch sizes range between 50 and 256, but can vary for different applications.
for i in range(nb_epochs):
np.random.shuffle(data)
for batch in get_batches(data, batch_size=50):
params_grad = evaluate_gradient(loss_function, batch, params)
params = params - learning_rate * params_grad
SGD momentum
Momentum is a mehod that helps accelerate SGD in the relevant direction and dampends oscillations seen in the image below.
Essentially, when using momentum, we push a ball down a hill. The ballaccumulates momentum as it rolls downhill, becoming faster and faster onthe way (until it reaches its terminal velocity if there is air resistance, i.e. ). The same thing happens to our parameter updates: The momentumterm increases for dimensions whose gradients point in the same directionsand reduces updates for dimensions whose gradients change directions. Asa result, we gain faster convergence and reduced oscillation.
ADAM
Adaptive Moment Estimation (ADAM) is a method that computes adaptive learning rate for each parameter. In addition to storing an exponentially decaying averagae of past squared gradients like Adadelta and RMSprop, Adam also keeps an exponentially decaying average of past gradients , similar to momentum. Whereas momentum can be seen as a ball running down a slope, Adam behaves like a heavy ball with friction, which thus prefers flat minima in the error surface.
- and are estimates of the first moment (the mean) and the second moment (the uncentered variance) of the gradients respectively.
- Compute bias-corrected first moment estimate and bias-corrected second raw moment estimate.
The authors propose default values of 0.9 for , 0.999 for , and for .