(impl)=
# Implementation Notes

## Multiplication

(karatsuba)=
```{todo}
Discuss Karatsuba algorithm.
```

## Division

Wikipedia [discusses](https://en.wikipedia.org/wiki/Division_algorithm#Slow_division_methods)
division algorithms, and the code works just fine. However, I found their
definition of the recurrence relation that governs "slow" (one digit per cycle)
division algorithms confusing. These are my notes for future-me.

(derive-res)=
### Deriving Restoring Division

Consider the following long division, meant to portray any "generic" divide
operation:

(sample-div)=
```{math}
1362 / 14 = 97\bmod{4} \Leftrightarrow N / D = q\bmod{R}
```

If you do this using long division[^1], you can derive the following
intermediate steps:

(derive1)=
```{math}
\begin{array}{rll}
R := 1362 && (initial) \\
R \mathrel{-}= 0 && (0 = (0 * 14) * 1000, R = 1362) \\
R \mathrel{-}= 0 && (0 = (0 * 14) * 100, R = 1362) \\
R \mathrel{-}= 1260 && (1260 = (9 * 14) * 10, R = 102) \\
R \mathrel{-}= 98 && (98 = (7 * 14) * 1, R = 4) \\
\end{array}
```

You can annotate the above to make it generic to any 4 digit base-10
multiplication.

(derive2)=
```{math}
R_0 := N \\
R_1 := R_0 - (q_3 * D * 10^3) \\
R_2 := R_1 - (q_2 * D * 10^2) \\
R_3 := R_2 - (q_1 * D * 10^1) \\
R_4 := R_3 - (q_0 * D * 10^0) \\
```

Mapping the [second](derive2) set of assignments to the [first](derive1):

* {math}`N = 1362`
* {math}`D = 14`
* {math}`q_j` is the {math}`j`th digit of the quotient, where {math}`j` ranges
  from {math}`0` up to {math}`n-1`. _{math}`q_j` must be between `0` and `9`._
  In the above example {math}`q_3 = 0`, {math}`q_2 = 0`, {math}`q_1 = 9`, and
  {math}`q_0 = 7`.
* {math}`R_n` is the remainder after each step, starting at {math}`R_0 = N`
  before division begins. {math}`R_4` is the remainder we want, after the 4th and
  final digit of the dividend. _{math}`R_n` must always be positive._

It turns out[^2] there's a generic [recurrence relation](https://en.wikipedia.org/wiki/Recurrence_relation)
for finding the {math}`j`th remainder of a division where the dividend has {math}`n`
digits:

(rr1)=
```{math}
R_0 := N \\
R_{j+1} := R_j - q_{n - (j + 1)} * 10^{n - (j + 1)} * D \\
```

This formula doesn't actually match Wikipedia's, even after accounting for
arbitrary base ({math}`10 \Rightarrow B`). Somewhere along the line, someone
realized that you can modify the [above equation](rr1) to make it friendlier
to implement on hardware, by making the following substitution:

```{math}
S_j = (10^j) * R_j
```

Note that {math}`S_0 = R_0`. If you substitute[^3] {math}`10^{-j} * S_j` for
{math}`R_j` into the [above equation](rr1), and then multiply both sides by
{math}`10^{-j - 1}` you get a new recurrence relation:

(rr2)=
```{math}
S_0 := N \\
S_{j+1} := 10 * S_j - q_{n - (j + 1)} * 10^n * D \\
```

The {math}`10^n` multipler on {math}`D` doesn't match Wikipedia's definition,
but it _does_ match the accompanying psuedocode. As alluded to earlier, these
recurrence relations work for arbitrary bases. In base-2, since multiplying
by powers of two is equivalent to left-shifting, the relation looks like:

(rr3)=
```{math}
S_0 := N \\
S_{j+1} := (S_j << 1) - q_{n - (j + 1)} * (D << n) \\
```

{math}`q_j` can only be either {math}`0` or {math}`1`, which makes multiplication trivial
(either subtract {math}`0` or {math}`D << n`). Since {math}`S_j` must be positive, we've
turned finding the quotient and remainder into the following Python code:

(resdiv-py)=
```{doctest}
>>> def restoring_div(N, D, n):
...    S = N
...    D = D << n
...    q = 0
...
...    for j in range(n):
...        S = (S << 1) - D
...        if S < 0:
...            S += D
...        else:
...            q |= 1
...        q <<=1
...
...    return (q >> 1, S >> n)
>>>
>>> restoring_div(1362, 14, 12)
(97, 4)

```

This is the core of restoring division. It's called restoring because if
`S < 0` during any step, we have to restore the `D << n` that we subtracted.
We can get rid of this step using non-restoring divison.

(derive-nr)=
### Deriving Non-restoring Division

In restoring division, {math}`q_0` through {math}`q_{n-1}` map precisely to
quotient digits in base-2[^4], where each {math}`q_j` is either {math}`0` or
{math}`1`:

```{math}
q = 2^{n - 1} * q_{n - 1} + 2^{n - 2} * q_{n - 2} + ... + 2^1 * q_1 + 2^0 * q_0
```

If we instead map each {math}`q_j` to either {math}`-1` or {math}`1`, we can
still represent odd numbers:

```{math}
{-1}{-1} = 2^1 * -1 + 2^0 * -1 = -3 \\
{-1}1 = 2^1 * -1 + 2^0 * -1 = -1 \\
1{-1} = 2^1 * -1 + 2^0 * -1 = 1 \\
11 = 2^1 * -1 + 2^0 * -1 = 3 \\
```

The quotient {math}`q` and remainder {math}`r` results one gets from dividing
{math}`N / D` aren't exactly unique. _Without additional constraints_, there
are infinite integers {math}`q` and {math}`r` that satisfy:

(diveq)=
```{math}
N = D * q + r
```

(constraint)=
Our additional constrait is that we want {math}`0 \leq r \lt D`; this [uniquely](https://en.wikipedia.org/wiki/Euclidean_division#Division_theorem) constrains {math}`q`. We can temporarily
relax this constraint however. Given {math}`N` and {math}`D`, when {math}`q`
is even, we can get a new mathematically valid result with an odd {math}`q` if
we set {math}`q := q + 1` and {math}`r := r - D`. Note that {math}`r` will be
negative, which means that for arbitrary {math}`q`, {math}`-D \leq r \lt D`:

```{math}
N = D * (q + 1) + (r - D)
```

Our equation we [used](rr3) for restoring division, reprinted below, still
works[^5] when {math}`q_j` is mapped to either {math}`-1` or {math}`1`:

(rr3-copy)=
```{math}
S_0 := N \\
S_{j+1} := (S_j << 1) - q_{n - (j + 1)} * (D << n) \\
```

For each iteration {math}`j` of the [above](rr3-copy) equation:

* If {math}`S_j` goes/remains positive, we're trying to get successive {math}`S_{j + 1}`
  closer to 0, much the same with restoring division.
* On the other hand, we will tolerate when {math}`S_j` goes/remains negative.
  When calculating {math}`S_{j + 1}` for the next iteration, we attempt to
  get closer to 0 by setting {math}`q_{n - (j + 1)} := -1` the next
  iteration and adding {math}`D`.

(nrdiv-restore)=
If the final {math}`S_n` is negative, we do a final restoring step by setting
{math}`q := q - 1` and {math}`S := S + D`. This will bring the shifted
remainder {math}`S` positive, and make our quotient {math}`q` even, satisfying
our [constraints](constraint). A working implementation looks like such:

(nrdiv-py)=
```{doctest}
>>> def nonrestoring_div(N, D, n):
...     S = N
...     D = D << n
...     q = 0
...     
...     for j in range(n):
...         if S < 0:
...             S = (S << 1) + D
...             q -= 2**(n - j - 1)
...         else:
...             S = (S << 1) - D
...             q += 2**(n - j - 1)
...     
...     if S < 0:
...         S += D
...         q -= 1
...     
...     return (q, S >> n)
>>>
>>> nonrestoring_div(1362, 14, 12)
(97, 4)

```

This technique is called non-restoring division, thanks to the lack of
restoring step to bring {math}`S_j` positive _except possibly at the final
step_.

### Benchmarking

```{note}
This section is still technically correct. However, I originally wrote it
before I made a restoring divider version of {class}`~smolarith.div.MulticycleDiv`.
For general use-cases, a restoring divider (the default as of v0.1.1) wins for
size over a non-restoring divider. Both a restoring and non-restoring
implementation are kept for future testing.

It seems that  the restoring step each iteration doesn't have a perf penalty
when done in parallel in hardware. In fact, the restoring version finishes one
cycle sooner due to the lack of a final restoring step! Additionally, the size
impact of restoring once per iteration seems negligible compared to the
calculations required per iteration of a non-restoring divider (coupled with
the possible final restore step).

At this time, I have not done any experiments on making a signed restoring
divider. There may or may not be the same complications as with a signed
non-restoring divider to satisfy RISC-V semantics.
```

Why would you ever go through all this trouble? Well, here's a benchmark...

```{prompt}
:language: powershell
:prompts: PS C:\\msys64\\home\\William\\Projects\\FPGA\\amaranth\\smolarith>,>>>,...
:modifiers: auto
PS C:\\msys64\\home\\William\\Projects\\FPGA\\amaranth\\smolarith> pdm run
No command is given, default to the Python REPL.
Python 3.11.8 (main, Feb 13 2024, 07:18:52)  [GCC 13.2.0 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from amaranth.back.verilog import convert
>>> from smolarith.div import MulticycleDiv, LongDivider
>>>
>>> nrdiv_v = convert(MulticycleDiv(32))
>>> with open("nrdiv.v", "w") as fp:  # doctest: +SKIP
...     fp.write(nrdiv_v)
...
53043
>>>
>>> longdiv_v = convert(LongDivider(32))
>>> with open("longdiv.v", "w") as fp:    # doctest: +SKIP
...     fp.write(longdiv_v)
...
76614
>>> exit()  # doctest: +SKIP
PS C:\\msys64\\home\\William\\Projects\\FPGA\\amaranth\\smolarith> yosys -QTp 'tee -q synth_ice40; stat' longdiv.v

-- Parsing `longdiv.v' using frontend ` -vlog2k' --

1. Executing Verilog-2005 frontend: longdiv.v
Parsing Verilog input from `longdiv.v' to AST representation.
Storing AST representation for module `$abstract\top'.
Storing AST representation for module `$abstract\top.mag'.
Successfully finished Verilog frontend.

-- Running command `tee -q synth_ice40; stat' --

3. Printing statistics.

=== top ===

   Number of wires:               2576
   Number of wire bits:          12802
   Number of public wire bits:   12802
   Number of memories:               0
   Number of memory bits:            0
   Number of processes:              0
   Number of cells:               4146
     $scopeinfo                      1
     SB_CARRY                      547
     SB_DFFESR                       9
     SB_DFFSR                       97
     SB_LUT4                      3492

PS C:\\msys64\\home\\William\\Projects\\FPGA\\amaranth\\smolarith> yosys -QTp 'tee -q synth_ice40; stat' nrdiv.v

-- Parsing `nrdiv.v' using frontend ` -vlog2k' --

1. Executing Verilog-2005 frontend: nrdiv.v
Parsing Verilog input from `nrdiv.v' to AST representation.
Storing AST representation for module `$abstract\top'.
Storing AST representation for module `$abstract\top.from_u'.
Storing AST representation for module `$abstract\top.nrdiv'.
Storing AST representation for module `$abstract\top.to_u'.
Successfully finished Verilog frontend.

-- Running command `tee -q synth_ice40; stat' --

3. Printing statistics.

=== top ===

   Number of wires:                894
   Number of wire bits:           2865
   Number of public wires:         894
   Number of public wire bits:    2865
   Number of memories:               0
   Number of memory bits:            0
   Number of processes:              0
   Number of cells:               1162
     $scopeinfo                      3
     SB_CARRY                      251
     SB_DFFESR                     177
     SB_DFFSR                       34
     SB_LUT4                       697

PS C:\\msys64\\home\\William\\Projects\\FPGA\\amaranth\\smolarith> yosys -V
Yosys 0.38+92 (git sha1 84116c9a3, sccache x86_64-w64-mingw32-g++ 13.2.0 -Os)
PS C:\\msys64\\home\\William\\Projects\\FPGA\\amaranth\\smolarith>
```

While I haven't spent much effort optimizing either {class}`~smolarith.div.LongDivider`
or {class}`~smolarith.div.MulticycleDiv`, it's clear that the latter non-restoring
implementation wins on LUT usage alone, with a modest increase in storage
elements (FFs). The extra storage elements should also help with timing
closure. So _without further testing_ for now, I'd recommend using
{class}`~smolarith.div.MulticycleDiv` for your division needs until I can
investigate.

(signedness)=
### Truncating Division And Signedness

You can implement [various types](https://en.wikipedia.org/wiki/Modulo#Variants_of_the_definition)
of division by choosing to constrain {math}`r` in the equation [above](diveq)
relating {math}`N`, {math}`D`, {math}`q`, and {math}`r`.

I initially wrote the long divider implementation with RISC-V in mind. Page
44 of the [RISC-V Unprivileged ISA, V20191213](https://github.com/riscv/riscv-isa-manual/releases/tag/Ratified-IMAFDQC),
states:

> For REM, the sign of the result equals the sign of the dividend.

Wikipedia [calls this](https://en.wikipedia.org/wiki/Modulo#Variants_of_the_definition)
truncating division, and in my experience it is the "natural" version of
division to implement when doing a signed long divider because regardless of
the sign of {math}`N`, you converge toward 0 remainder by subtracting or adding
shifted copies of `D`.

_While the above sections are focused on division with positive numbers_, with
a bit of effort to deal with the restoring step, you can also make
non-restoring division work with signed numbers and just like truncating
division:

```
if ((S < 0) and (N >= 0) and (D > 0)):
    q -= 1
    S += D
elif (S > 0) and (N < 0) and (D > 0):
    q += 1
    S -= D
elif (S > 0) and (N < 0) and (D < 0):
    q -= 1
    S += D
elif (S < 0) and (N >= 0) and (D < 0):
    q += 1
    S -= D
```

However, RISC-V has specific semantics for edge case divisons, list on page 45
of the 2019 manual:

* Division by 0 for arbitrary {math}`x` should return all bits set for the
  quotient, and {math}`x` for the remainder, regardless of signed or unsigned
  division.
* Division of the most negative value by {math}`-1` should return the dividend
  unchanged.

The former of these is easy enough to implement for free in a signed divider,
but the latter is a provision meant for common signed divider implementations:

> Signed division is often implemented using an unsigned division circuit and
> specifying the same overflow result simplifies the hardware.

I started with a signed divider under the assumption that I could spread out
the cost of signed/unsigned conversion. But the poor resource usage of my
long divider, coupled with signed division _indeed_ [commonly](https://www.righto.com/2023/04/)
being implemented using an unsigned division circuit, are making me rethink my
strategy for a non-restoring divider.

## Footnotes

[^1]: I did this using actual pen and paper :). Sadly, trying to format a
traditional long division in MathJax didn't work as well as I'd like.

[^2]: I didn't actually prove this, I just assume the math works out for
arbitrary-widths.

[^3]: Just trust me, I did it by hand.

[^4]: This applies for arbitrary base {math}`B`, but let's stick with base-2.

[^5]: I didn't actually prove this, but I can visualize why the math would
work out if I did a base {math}`-1`/{math}`1` division by hand analogous to
the [one](#deriving-restoring-division) in the restoring section.
   
    I believe non-restoring division in fact works for base 10 as well, where 
{math}`-9 \leq q \lt 10, q \neq 0`. However, I didn't work out the details on paper
about how close to {math}`0` each successive {math}`S_{j+1}` should be.