Derivative, Gradient and Jacobian¶
Run Jupyter Notebook
You can run the code for this section in this jupyter notebook link.
Simplified Equation¶
- This is the simplified equation we have been using on how we update our parameters to reach good values (good local or global minima)
- \(\theta = \theta - \eta \cdot \nabla_\theta\)
- \(\theta\): parameters (our tensors with gradient accumulation abilities)
- \(\eta\): learning rate (how fast we want to learn)
- \(\nabla_\theta\): gradients of loss with respect to the model's parameters
- Even simplier equation in English:
parameters = parameters - learning_rate * parameters_gradients
- This process can be broken down into 2 sequential parts
- Backpropagation
- Gradient descent
Simplified Equation Breakdown¶
- Our simplified equation can be broken down into 2 parts
- Backpropagation: getting our gradients
- Our partial derivatives of loss (scalar number) with respect to (w.r.t.) our model's parameters and w.r.t. our input
- Backpropagation gets us \(\nabla_\theta\) which is our gradient
- Gradient descent: using our gradients to update our parameters
- Somehow, the terms backpropagation and gradient descent are often mixed together. But they're totally different.
- Gradient descent relates to using our gradients obtained from backpropagation to update our weights.
- Gradient descent: \(\theta = \theta - \eta \cdot \nabla_\theta\)
- Backpropagation: getting our gradients
Steps¶
- Derivatives
- Partial Derivatives
- Gradients
- Gradient, Jacobian and Generalized Jacobian Differences
- Backpropagation: computing gradients
- Gradient descent: using gradients to update parameters
Derivative¶
- Given a simple cubic equation: \(f(x) = 2x^3 + 5\)
- Calculating the derivative \(\frac{df(x)}{dx}\) is simply calculating the difference in values of \(y\) for an extremely small (infinitesimally) change in value of \(x\) which is frequently labelled as \(h\)
- \(\frac{df(x)}{dx} = \displaystyle{\lim_{h \to 0}} \frac{f(x + h) - f(x)}{h}\)
- \(\frac{f(x + h) - f(x)}{h}\) is the slope formula similar to what you may be familiar with:
- Change in \(y\) over change in \(x\): \(\frac{\Delta y}{\Delta x}\)
- And the derivative is the slope when \(h \rightarrow 0\), in essence a super teeny small \(h\)
- \(\frac{f(x + h) - f(x)}{h}\) is the slope formula similar to what you may be familiar with:
- \(\frac{df(x)}{dx} = \displaystyle{\lim_{h \to 0}} \frac{f(x + h) - f(x)}{h}\)
- Let's break down \(\frac{df}{dx} = \displaystyle{\lim_{h \to 0}} \frac{f(x + h) - f(x)}{h}\)
- \(\displaystyle{\lim_{h \to 0} \frac{f(x + h) - f(x)}{h}}\)
- \(\displaystyle{\lim_{h \to 0} \frac{(2(x+h)^3 + 5) - 2x^3 + 5}{h}}\)
- \(\displaystyle{\lim_{h \to 0} \frac{2(x^2 + 2xh + h^2)(x+h) - 2x^3}{h}}\)
- \(\displaystyle{\lim_{h \to 0} \frac{2(x^3 + 2x^2h + h^3 + x^2h + 2xh^2 + h^3) - 2x^3}{h}}\)
- \(\displaystyle{\lim_{h \to 0}\frac{2(x^3 + 3x^2h + h^3 + 2xh^2) - 2x^3}{h}}\)
- \(\displaystyle{\lim_{h \to 0} \frac{6x^2h + h^3 + 2xh^2}{h}}\)
- \(\displaystyle{\lim_{h \to 0} 6x^2 + h^2 + 2xh} = 6x^2\)
Partial Derivative¶
- Ok, it's simple to calculate our derivative when we've only one variable in our function.
- If we've more than one (as with our parameters in our models), we need to calculate our partial derivatives of our function with respect to our variables
- Given a simple equation \(f(x, z) = 4x^4z^3\), let us get our partial derivatives
- 2 parts: partial derivative of our function w.r.t. x and z
- Partial derivative of our function w.r.t. x: \(\frac{\delta f(x, z)}{\delta x}\)
- Let \(z\) term be a constant, \(a\)
- \(f(x, z) = 4x^4a\)
- \(\frac{\delta f(x, z)}{\delta x} = 16x^3a\)
- Now we substitute \(a\) with our z term, \(a = z^3\)
- \(\frac{\delta f(x, z)}{\delta x} = 16x^3z^3\)
- Partial derivative of our function w.r.t. z: \(\frac{\delta f(x, z)}{\delta z}\)
- Let \(x\) term be a constant, \(a\)
- \(f(x, z) = 4az^3\)
- \(\frac{\delta f(x, z)}{\delta z} = 12az^2\)
- Now we substitute \(a\) with our \(x\) term, \(a = x^4\)
- \(\frac{\delta f(x, z)}{\delta z} = 12x^4z^2\)
- Partial derivative of our function w.r.t. x: \(\frac{\delta f(x, z)}{\delta x}\)
- Ta da! We made it, we calculated our partial derivatives of our function w.r.t. the different variables
Gradient¶
- We can now put all our partial derivatives into a vector of partial derivatives
- Also called "gradient"
- Represented by \(\nabla_{(x,z)}\)
- \(\nabla_{(x,z)} = \begin{bmatrix} \frac{df(x,z)}{dx} \\ \frac{df(x,z)}{dz} \end{bmatrix} = \begin{bmatrix} 16x^3z^3 \\ 12x^4z^2 \end{bmatrix}\)
- It is critical to note that the term gradient applies for \(f : \mathbb{R}^N \rightarrow \mathbb{R}\)
- Where our function maps a vector input to a scalar output: in deep learning, our loss function that produces a scalar loss
Gradient, Jacobian, and Generalized Jacobian¶
In the case where we have non-scalar outputs, these are the right terms of matrices or vectors containing our partial derivatives
- Gradient: vector input to scalar output
- \(f : \mathbb{R}^N \rightarrow \mathbb{R}\)
- Jacobian: vector input to vector output
- \(f : \mathbb{R}^N \rightarrow \mathbb{R}^M\)
- Generalized Jacobian: tensor input to tensor output
- In this case, a tensor can be any number of dimensions.
Summary¶
We've learnt to...
Success
- Calculate derivatives
- Calculate partial derivatives
- Get gradients
- Differentiate the concepts amongst gradients, Jacobian and Generalized Jacobian
Now it is time to move on to backpropagation and gradient descent for a simple 1 hidden layer FNN with all these concepts in mind.
Citation¶
If you have found these useful in your research, presentations, school work, projects or workshops, feel free to cite using this DOI.