Expand description
Floating point division routines.
This module documentation gives an overview of the method used. More documentation is inline.
§Relevant notation
m_a: the mantissa ofa, in base 2p_a: the exponent ofa, in base 2. I.e.a = m_a * 2^p_auqN(e.g.uq1): this refers to Q notation for fixed-point numbers. UQ1.31 is an unsigned fixed-point number with 1 integral bit, and 31 decimal bits. AuqNvariable of typeuMwill have N bits of integer and M-N bits of fraction.hw: half width, i.e. forf64this will be au32.xis the best estimate of1/m_b
§Method Overview
Division routines must solve for a / b, which is res = m_a*2^p_a / m_b*2^p_b. The basic
process is as follows:
- Rearange the exponent and significand to simplify the operations:
res = (m_a / m_b) * 2^{p_a - p_b}. - Check for early exits (infinity, zero, etc).
- If
aorbare subnormal, normalize by shifting the mantissa and adjusting the exponent. - Set the implicit bit so math is correct.
- Shift mantissa significant digits (with implicit bit) fully left such that fixed-point UQ1 or UQ0 numbers can be used for mantissa math. These will have greater precision than the actual mantissa, which is important for correct rounding.
- Calculate the reciprocal of
m_b,x. - Use the reciprocal to multiply rather than divide:
res = m_a * x_b * 2^{p_a - p_b}. - Reapply rounding.
§Reciprocal calculation
Calculating the reciprocal is the most complicated part of this process. It uses the Newton-Raphson method, which picks an initial estimation (of the reciprocal) and performs a number of iterations to increase its precision.
In general, Newton’s method takes the following form:
`x_n` is a guess or the result of a previous iteration. Increasing `n` converges to the
desired result.
The result approaches a zero of `f(x)` by applying a correction to the previous gues.
x_{n+1} = x_n - f(x_n) / f'(x_n)Applying this to find the reciprocal:
1 / x = b
Rearrange so we can solve by finding a zero
0 = (1 / x) - b = f(x)
f'(x) = -x^{-2}
x_{n+1} = 2*x_n - b*x_n^2This is a process that can be repeated to calculate the reciprocal with enough precision to achieve a correctly rounded result for the overall division operation. The maximum required number of iterations is known since precision doubles with each iteration.
§Half-width operations
Calculating the reciprocal requires widening multiplication and performing arithmetic on the
results, meaning that emulated integer arithmetic on u128 (for f64) and u256 (for f128)
gets used instead of native math.
To make this more efficient, all but the final operation can be computed using half-width
integers. For example, rather than computing four iterations using 128-bit integers for f64,
we can instead perform three iterations using native 64-bit integers and only one final
iteration using the full 128 bits.
This works because of precision doubling. Some leeway is allowed here because the fixed-point number has more bits than the final mantissa will.