Automatic differentiation

Tora · November 19, 2020, 12:18am

[Work in progress]

Automatic differentiation (AD) is the act of differentiating a function with respect to one or multiple of its dependencies without the use of an approximation such as finite differences, complex step derivatives or quaternionic step derivatives. Given the derivatives of elementary functions (addition, multiplication, polynomials, exponentials, trigonometric functions, …) derivatives for composite functions can be obtained with AD.

Some make a distinction between symbolic differentiation and AD, however AD can be used symbolically too by using symbols instead of numerical values so such a distinction is not necessary.

Algorithms

There are two main ways of doing AD. In both ways we desire the derivative of an output of a function with respect to some of its inputs. In reverse mode one works backward using the chain rule as one usually does when calculating derivatives by hand. In forward mode the derivative of a result with respect to inputs is dragged along in the forward calculation. Both of them have advantages that will be explained below, however the main focus will lie on forward mode as it is most connected to geometric algebra.

Reverse mode

In reverse mode we start from the outer function and repeatedly apply the chain \frac{d y}{d x} = \frac{d y}{d z} \frac{d z}{d x} until we have the desired derivatives.

Example: \frac{d sin(x^2)}{dx} = \frac{d sin(x^2)}{d x^2} \frac{d x^2}{d x} = cos(x^2) 2x

For higher order derivatives, the procedure can be applied repeatedly to the obtained expression.

When used numerically instead of symbolically, storing the intermediate results is necessary to calculate all derivatives. This method is also often called backpropagation.

Forward mode

In forward mode we go the other direction and start from the inner function.

Single variable, first order derivative

Instead of just using the usual input x we add a dual number \epsilon that is defined to square to zero (ie. \epsilon^2 = 0) but can’t otherwise be simplified. The result of using such an input in the function will be that the real part is the usual result, and the part containing \epsilon is the derivative with respect to the element that we added \epsilon to in the input.

Example for f(x) = x^2

f(x + \epsilon) = (x + \epsilon)^2 = x^2 + 2 x \epsilon + \epsilon^2 = x^2 + 2 x \epsilon

and we can verify that the \epsilon part is 2x which is indeed \frac{d f(x)}{dx} = 2x.

Like before we can introduce definitions for the primitives which we can use to obtain the derivative of expressions involving them (eg. sin(x + x' \epsilon) = sin(x) + x' \epsilon cos(x))

Multiple variables

For multiple variables we can just introduce an \epsilon for each of the variables which are independent from each other but each square to 0. The output will contain a scalar part which is again the usual result, as well as terms containing different \epsilon that correspond to the different derivatives. The output will then also have terms that have mixed \epsilon from different variables which is the derivative with respect to the different variables (ie. not strictly speaking first order, but first order with respect to each variable individually).

Example for f(x, y) = x y^2, define \epsilon_x^2 = 0, \epsilon_y^2 = 0

\begin{aligned} f(x + \epsilon_x, y + \epsilon_y) & = (x + \epsilon_x)(y + \epsilon_y)^2 \\ & = (x + \epsilon_x)(y^2 + 2 y \epsilon_y) \\ & = x y^2 + 2 x y \epsilon_y + y^2 \epsilon_x + 2 y \epsilon_x \epsilon_y \end{aligned}

And we can read off the different derivatives
\frac{df}{dx} = y^2, \frac{df}{dy} = 2xy, \frac{d^2f}{dx dy} = 2y

Higher order derivatives

For n-th order derivatives, instead of defining \epsilon^2 = 0, we define \epsilon^{n+1} = 0. We still pass x + \epsilon as the input, but the output will then contain different integer powers of \epsilon which are the different orders of derivatives scaled by \frac{1}{order!}. That is, the first order is scaled by \frac{1}{1!} = 1 so we don’t need to rescale it, but the second order is scaled by \frac{1}{2!} and hence we need to multiply it by 2! = 2.

Example f(x) = x^2 up to second order:

f(x + \epsilon) = (x + \epsilon)^2 = x^2 + 2 x \epsilon + 1 \epsilon^2

The first order derivative is still just 2x, but then second order derivative is 1 * 2! = 2 which matches with our expectation of \frac{d^2}{d x^2} x^2 = 2.

Why this works

If we look at the Taylor expansion of a function f(x) around a we have

f(x) = f(a) + \frac{f'(a)}{1!} (x - a) + \frac{f''(a)}{2!} (x - a)^2 + ...

We can expand f around x (ie. a=x) and evaluate f at x + \epsilon for an arbitrary \epsilon to get

\begin{aligned} f(x + \epsilon) & = f(x) + \frac{f'(x)}{1!} (x + \epsilon - x) + \frac{f''(x)}{2!} (x + \epsilon - x)^2 + ... \\ & = f(x) + \frac{f'(x)}{1!} \epsilon + \frac{f''(x)}{2!} \epsilon^2 + ... \end{aligned}

Now we can define \epsilon^2 = 0 and all the terms with higher order derivatives will vanish leaving us with

f(x + \epsilon) = f(x) + \frac{f'(x)}{1!} \epsilon

If we defined \epsilon to vanish for different powers we would obtain derivatives of different orders (eg. \epsilon^5 = 0 would give all derivatives up to fourth order with respect to x). We can also see why we need to scale the derivatives by order! as the different terms are scaled by \frac{1}{order!} in the taylor series.

Implementation using geometric algebra

So far we defined \epsilon as something that when raised to a certain power becomes 0. In geometric algebra we also deal with basis vectors that square to something.
For \epsilon^2 = 0 we can simply use a new basis vector e_1^2 = 0. For higher order derivatives we need to introduce new basis vectors that square to each other. For example if we are interested in up to third order derivatives we would usually define \epsilon^4 = 0. We then introduce the basis vectors e_1^2 = e_2, e_2^2 = 0 and we can achieve the same thing:

f(x) = x^2, f(x + e_1) = (x + e_1)^2 = x^2 + 2 x e_1 + 1 e_1^2 = x^2 + 2 x e_1 + 1 e_2

Just like before we need to rescale the n-th order derivatives by n!, and we get the expected result. The different powers of e_1 are then stored as coefficients of the basis vectors e_1 squares to. Another observation we can make is that in the current form, we would keep around third powers e_1^3 = e_1 e_2. If we are not interested in these (eg. we only want first and second order derivatives), we can define the product e_1 e_2 to be 0 too.

Geometric algebra implementations can thus implement forward mode automatic differentiation such as this ganja.js example if they support custom multiplication tables. If custom multiplication tables are not supported but degenerate metrics are, it is still possible to obtain first order derivatives though.

Comparison of methods

[Section needs clarification and examples]

Numerically, reverse mode requires evaluating the function first (“forward pass”) and then works backwards using derivatives on the output with the chain rule (“backward pass”). The backward pass requires the intermediate function values in the forward pass so these need to be stored (or only some of them can be stored and we can computer the rest of them when they are required again, a method called checkpointing). Reverse mode will give the derivatives of the output with respect to all its inputs in the backward pass.

Forward mode using multiple variables only needs a single pass where we directly obtain the derivatives of the output with respect to all its inputs. The downside compared to reverse mode is that we need to store a number of additional values for each variable (eg. if we want only first order derivatives and no mixed derivatives, we need to store an additional number for each variable).

The memory usage of forward mode can be reduced by using \epsilon for one or a subset of all variables in a pass and doing multiple passes, which of course will result in a higher runtime.

Examples

[Add more complex examples here?]