`smolarith` API

smolarith does not export anything from the top-level; you must import the module you want to use, perhaps with an alias, e.g.

>>> import smolarith.mul as mul

Multiplication

Soft-core multiplier components.

class smolarith.mul.Inputs(width)

Tagged union representing the signedness of multiplier inputs.

When sign is UNSIGNED, both a and b are treated by the multiplier as Values with an unsigned Shape.
When sign is SIGNED, both a and b are treated as Values with a signed Shape.
When sign is SIGNED_UNSIGNED, a is treated as a Value with a signed Shape, while b is treated as a Value with an unsigned Shape.

Parameters:: width (int) – Width in bits of both inputs a and b. For signed multiplies, this includes the sign bit.

sign

Controls the interpretation of the bit patterns of a and b during multiplication.

Type:: Sign

a

The multiplicand; i.e. the ‘\(a\)’ in \(a * b\).

Type:: Signal(width)

b

The multiplier; i.e. the ‘\(b\)’ in \(a * b\).

Type:: Signal(width)

class smolarith.mul.MulticycleMul(*args, src_loc_at=0, **kwargs)

Multicycle multiplier soft-core.

This multiplier core is a gateware implementation of shift-add multiplication.

A multiply starts on the current cycle when both inp.valid and inp.ready are asserted.
A multiply result is available when outp.valid is asserted. The result is read/overwritten once a downstream core asserts outp.ready.
Latency: Multiply Results for a will be available width clock cycles after assertion of both inp.valid and inp.ready.
Throughput: One multiply maximum is finished every width clock cycles.

Parameters:: width (int) – Width in bits of both inputs a and b. For signed multiplies, this includes the sign bit. Output o width will be \(2*n\).

width

Bit width of the inputs a and b. Output o width will be \(2*n\).

Type:: int

inp

Input interface to the multiplier.

Type:: In(multiplier_input_signature(width))

outp

Output interface of the multiplier.

Type:: Out(multiplier_output_signature(width))

Notes

This multiplier is a naive shift-add implementation, similar to how pen-and-pad multiplication in base-2/10 works. Internally, the multiplier treats the multiplier a as signed and the multiplicand b as unsigned.

a’s signedness only matters for the most-negative value possible for a given bit-width \(n\), where twos-complementing would not change the bit-pattern. Therefore, a is automatically sign-extended to \(n + 1\) bits in the input stage before any further processing.

If b is negative, both b and a are twos-complemented in the input stage; since \(a * b = -a * -b\), no inverse transformation on the output stage is needed.
For an \(n\)-bit multiply, this multiplier requires \(O(3*n)\) storage elements (to store copies of the input, output, and intermediate results).
The output product and (possibly-inverted) b input share a backing store. This works because only b’s LSb is used each cycle and needs to be shifted out at the end of the cycle. Consequently, upper bits of o can be shifted in while lower bits of b are shifted out.

elaborate(platform)

class smolarith.mul.Outputs(width)

Tagged union representing the signedness of multiplier output.

When sign is UNSIGNED, o should be treated as a Value with an as_unsigned Shape.
When sign is SIGNED or SIGNED_UNSIGNED, o should be treated as a Value with a as_signed Shape.

Parameters:: width (int) – Width in bits of the output o. For signed multiplies, this includes the sign bit.

sign

Indicates whether the multiply that produced this product was signed, unsigned, or signed-unsigned.

Type:: Sign

o

The product of \(a * b\).

Type:: Signal(width)

class smolarith.mul.PipelinedMul(*args, src_loc_at=0, **kwargs)

Multiplier soft-core which pipelines inputs.

This multiplier core has pipeline registers that stores intermediate results for up to width multiplies at once. Basic control flow is implemented:

A multiply starts on the current cycle when both inp.valid and inp.ready are asserted.
A multiply result is available when outp.valid is asserted. The result is read/overwritten once a downstream core asserts outp.ready.
inp.ready de-asserts on any cycle where outp.valid is asserted but outp.ready is not asserted. This is a pipeline stall.
Latency: Multiply Results for a given multiply will be available width clock cycles after the multiplier has seen those inputs, assuming no stalls.

If stalls occur (outp.valid is asserted while outp.ready is unasserted), latency increases by the length of the stalls in clock cycles while the given multiply was in the pipeline.
Throughput: One multiply maximum is finished per clock cycle.

Parameters:

width (int) – Width in bits of both inputs a and b. For signed multiplies, this includes the sign bit. Output o width will be \(2*n\).
debug (int, optional) – Enable debugging signals.

width

Bit width of the inputs a and b. Output o width will be \(2*n\).

Type:: int

inp

Input interface to the multiplier.

Type:: In(multiplier_input_signature(width))

outp

Output interface of the multiplier.

Type:: Out(multiplier_output_signature(width))

debug

Flag which indicates whether internal debugging Signals are enabled or not.

Type:: bool

Notes

This multiplier is a naive shift-add implementation, similar to how pen-and-pad multiplication in base-2/10 works. Internally, the multiplier treats the multiplier a as signed and the multiplicand b as unsigned.

a’s signedness only matters for the most-negative value possible for a given bit-width \(n\), where twos-complementing would not change the bit-pattern. Therefore, a is automatically sign-extended to \(n + 1\) bits in the input stage before any further processing.

If b is negative, both b and a are twos-complemented in the input stage; since \(a * b = -a * -b\), no inverse transformation on the output stage is needed.
For an \(n\)-bit multiply, this multiplier requires \(O(n^2)\) storage elements (to store intermediate results).
The pipeline will happily perform multiplies on inputs where inp.valid is not asserted; the core will not assert outp.valid when such multiplies reach the output interface (and thus they will be discarded).
For simplicity of implementation, and under the assumption that stalls will be rare, a pipeline stall stops the entire core. The core does not attempt to fill pipeline stages if the output interface isn’t ready.

Future Directions

It is possible to implement a \(2*n\)-bit multiply using 3 \(n\)-bit multiplies. Since this library is about smol arithmetic, it may be worthwhile to create classes for Karatsuba multipliers.

Larger multpliers using a smaller PipelinedMul will still potentially be quite fast :).
Stalls stop the entire pipeline because validity information is not forwarded to earlier pipeline stages. Latency in the presence of stalls could be reduced by quashing invalid multiplies at other points in the pipeline besides the output. It is not much code complexity to add this, but impact on timing and size is not clear.

elaborate(platform)

class smolarith.mul.Sign(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)

Indicate the type of multiply to be performed.

UNSIGNED

Both inputs a and b are unsigned.

The output is unsigned.

Type:: int

SIGNED

Both inputs a and b are signed.

The output is signed.

Type:: int

SIGNED_UNSIGNED

Input a is signed and input b is unsigned.

The output is signed.

Note that for an \(n\)-bit multiply with given bit patterns for a and b, the bottom \(n/2\) bits will be identical in an UNSIGNED or SIGNED_UNSIGNED multiply.

Type:: int

smolarith.mul.multiplier_input_signature(width)

Create a parametric multiplier input port.

This function returns a Signature that’s usable as a transfer initiator to a multiplier. A multiply starts on the current cycle when both valid and rdy are asserted.

Parameters:: width (int) – Width in bits of the inputs a and b. For signed multiplies, this includes the sign bit.
Returns:: Signature(Inputs)
Return type:: amaranth.lib.stream.Signature

smolarith.mul.multiplier_output_signature(width)

Create a parametric multiplier output port.

This function returns a Signature that’s usable as a transfer initiator from a multiplier.

Note

For a core responding to a multiplier, which is the typical use case, you will want to use this Signature with the In flow, like so:

>>> from smolarith.mul import multiplier_output_signature
>>> from amaranth.lib.wiring import Signature, In
>>> my_receiver_sig = Signature({
...     "inp": In(multiplier_output_signature(width=8))
... })

Parameters:: width (int) – Width in bits of output o. For signed multiplies, this includes the sign bit.
Returns:: Signature(Outputs)
Return type:: amaranth.lib.stream.Signature

Division

Soft-core divider components.

class smolarith.div.Impl(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)

Indicate the division algorithm to use inside a dividing component.

RESTORING

Use restoring division for the internal divider. This is a good default.

Type:: int

NON_RESTORING

Use non-restoring division for the internal divider.

Type:: int

class smolarith.div.Inputs(width)

Tagged union representing the signedness of divider inputs.

When sign is UNSIGNED, both n and d are treated by the divider as Values with an unsigned Shape.
When sign is SIGNED, both n and d are treated as Values with a signed Shape.

Parameters:: width (int) – Width in bits of both inputs n and d. For signed divides, this includes the sign bit.

sign

Controls the interpretation of the bit patterns of n and d during division.

Type:: Sign

n

The dividend; i.e. the ‘\(n\)’ in \(n / d\).

Type:: Signal(width)

d

The divisor; i.e. the ‘\(d\)’ in \(n / d\).

Type:: Signal(width)

class smolarith.div.LongDivider(*args, src_loc_at=0, **kwargs)

Long-division soft-core, used as a reference.

This divider core is a gateware implementation of classic long division.

A divide starts on the current cycle when both inp.valid and inp.ready are asserted.
A divide result is available when outp.valid is asserted. The result is read/overwritten once a downstream core asserts outp.ready.

Warning

This core is not intended to be used in user designs due to poor resource usage. It is mainly kept as a known-to-work reference design for possible equivalence checking later. Use MulticycleDiv instead.

Latency: Divide Results for a will be available width clock cycles after assertion of both inp.valid and inp.ready.
Throughput: One divide maximum is finished every width clock cycles.

Parameters:: width (int) – Width in bits of both inputs n and d. For signed multiplies, this includes the sign bit. Outputs q and r width will be the same width.

width

Bit width of the inputs n and d, and outputs q and r.

Type:: int

inp

Input interface to the divider.

Type:: In(divider_input_signature(width))

outp

Output interface of the divider.

Type:: Out(divider_output_signature(width))

Notes

This divider is a naive long-division implementation, similar to how pen-and-pad division is done in base-2/10.
Internally, the divider is aware of the sign of its inputs as well as the sign of the divide operation.

The divider will dispatch to one of the 4 possible combinations of signs during the division loop. The four combinations are:
- Add shifted copies of positive/negative powers of two together to form the quotient.
- Add/subtract shifted copies of the divisor from the dividend towards zero to form the remainder.
This divider is compliant with RISC-V semantics for divide-by-zero and overflow when width=32 or width=64[rv]. Specifically:
- Signed/unsigned divide-by-zero returns “all bits set” for the quotient and the dividend as the remainder.
- Signed overflow (\(-2^{\{31,63\}}/-1\)) returns \(-2^{\{31,63\}}\) for the quotient and \(0\) as the remainder.

[rv]

https://github.com/riscv/riscv-isa-manual/releases/tag/Ratified-IMAFDQC

elaborate(platform)

class smolarith.div.MulticycleDiv(*args, src_loc_at=0, **kwargs)

Restoring/Non-restoring divider soft-core.

This divider core is a gateware implementation of restoring division or non-restoring division. It works for both signed and unsigned divides, and should be preferred to LongDivider in almost all circumstances due to better resource usage.

A divide starts on the current cycle when both inp.valid and inp.ready are asserted.
A divide result is available when outp.valid is asserted. The result is read/overwritten once a downstream core asserts outp.ready.
Latency: Divide Results for a restoring divider will be available width + 3 clock cycles after assertion of both inp.valid and inp.ready. For a non-restoring divider, results are available after width + 4 clock cycles.
Throughput: One divide maximum is finished every width + 3 clock cycles for a restoring divider, and every width + 4 for a non-restoring divider.

Parameters:

width (int) – Width in bits of both inputs n and d. For signed multiplies, this includes the sign bit. Outputs q and r width will be the same width.
impl (Impl) – Choose whether to use a restoring divider or non-restoring divider internally. The default of RESTORING is fine for most cases.

width

Bit width of the inputs n and d, and outputs q and r.

Type:: int

inp

Input interface to the divider.

Type:: In(divider_input_signature(width))

outp

Output interface of the divider.

Type:: Out(divider_output_signature(width))

Notes

This divider is implemented using either restoring or non-restoring division. It is basically a gateware translation of the Python implementations shown in Implementation Notes.
Unlike the multipliers, where I could eyeball approximate storage element usage, divider size was benchmarked using yosys‘s synth_ice40 pass. For an \(n\)-bit divide:
- impl=Impl.RESTORING requires approximately \(O(7*n)\) storage elements and \(O(12.75*n)\) LUT4s.
- impl=Impl.NON_RESTORING requires approximately \(O(8*n)\) storage elements and \(O(14.5*n)\) LUT4s.
Internally, the divider converts its operands from signed to unsigned if necessary, performs the division, and the converts back from unsigned to signed if necessary. I ran into some issues with making a signed-aware non-restoring divider such that eating the conversion cost latency is probably justifiable for implementation simplicity of both a restoring and non-restoring divider.

The combination of signed-to-unsigned conversion, using a cycle to latching inputs to the internal divider, and unsigned-to-signed conversion adds three cycles of latency beyond the expected width number of cycles to perform a division.

Additionally, when using a non-restoring divider, the quotient and remainder require a possible final restore step. Remainder restore is implemented as shown in the Python code. Quotient restore is implemented by subtracting the ones complement of the raw quotient from the raw quotient itself.

The quotient and remainder share a backing store; new bits are shifted into the lower-half quotient portion as the remainder is shifted into the upper half. This works because each iteration of the restoring/non-restoring loops check the sign of the remainder, and the lower quotient bits won’t affect the sign bit.
This divider is compliant with RISC-V semantics for divide-by-zero and overflow when width=32 or width=64[rv]. Specifically:
- Signed/unsigned divide-by-zero returns “all bits set” for the quotient and the dividend as the remainder.
- Signed overflow (\(-2^{\{31,63\}}/-1\)) returns \(-2^{\{31,63\}}\) for the quotient and \(0\) as the remainder.

Future Directions

It may be worthwhile to make a pipelined divider?
Current latency is worse than the LongDivider, which is just about the only advantage to LongDivider. We can provide an option to reduce latency/resources required due to the signed/unsigned and restore stages at the cost of some timing closure.

elaborate(plat)

class smolarith.div.Outputs(width)

Tagged union representing the signedness of divider outputs.

When sign is UNSIGNED, q and r should be treated as Values with an as_unsigned Shape.
When sign is SIGNED, q and r should be treated as Values with a as_signed Shape.

Parameters:: width (int) – Width in bits of the outputs q and r. For signed dividers, this includes the sign bit.

sign

Indicates whether the divider that produced this quotient/remainder was signed or unsigned.

Type:: Sign

q

The quotient of \(n / d\).

Type:: Signal(width)

r

The remainder of \(n / d\), i.e. \(n \bmod d\).

Type:: Signal(width)

class smolarith.div.Sign(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)

Indicate the type of divide to be performed.

UNSIGNED

Both inputs n and d are unsigned.

The output is unsigned.

Type:: int

SIGNED

Both inputs n and d are signed.

The quotient and remainder are signed. The remainder takes the sign of the dividend.

Type:: int

smolarith.div.divider_input_signature(width)

Create a parametric divider input port.

This function returns a Signature that’s usable as a transfer initiator to a divider. A divider starts on the current cycle when both valid and rdy are asserted.

Parameters:: width (int) – Width in bits of the inputs n and d. For signed divides, this includes the sign bit.
Returns:: Signature(Inputs)
Return type:: amaranth.lib.stream.Signature

smolarith.div.divider_output_signature(width)

Create a parametric divider output port.

This function returns a Signature that’s usable as a transfer initiator from a divider.

Note

For a core responding to a divider, which is the typical use case, you will want to use this Signature with the In flow, like so:

>>> from smolarith.div import divider_output_signature
>>> from amaranth.lib.wiring import Signature, In
>>> my_receiver_sig = Signature({
...     "inp": In(divider_output_signature(width=8))
... })

Parameters:: width (int) – Width in bits of the outputs q and r. For signed divides, this includes the sign bit.
Returns:: Signature(Outputs)
Return type:: amaranth.lib.stream.Signature

smolarith API

Multiplication

Division

`smolarith` API