smolarith API

smolarith does not export anything from the top-level; you must import the module you want to use, perhaps with an alias, e.g.

>>> import smolarith.mul as mul

Multiplication

Soft-core multiplier components.

class smolarith.mul.Inputs(width)

Tagged union representing the signedness of multiplier inputs.

  • When sign is UNSIGNED, both a and b are treated by the multiplier as Values with an unsigned Shape.

  • When sign is SIGNED, both a and b are treated as Values with a signed Shape.

  • When sign is SIGNED_UNSIGNED, a is treated as a Value with a signed Shape, while b is treated as a Value with an unsigned Shape.

Parameters:

width (int) – Width in bits of both inputs a and b. For signed multiplies, this includes the sign bit.

sign

Controls the interpretation of the bit patterns of a and b during multiplication.

Type:

Sign

a

The multiplicand; i.e. the ‘\(a\)’ in \(a * b\).

Type:

Signal(width)

b

The multiplier; i.e. the ‘\(b\)’ in \(a * b\).

Type:

Signal(width)

class smolarith.mul.MulticycleMul(*args, src_loc_at=0, **kwargs)

Multicycle multiplier soft-core.

This multiplier core is a gateware implementation of shift-add multiplication.

  • A multiply starts on the current cycle when both inp.valid and inp.ready are asserted.

  • A multiply result is available when outp.valid is asserted. The result is read/overwritten once a downstream core asserts outp.ready.

  • Latency: Multiply Results for a will be available width clock cycles after assertion of both inp.valid and inp.ready.

  • Throughput: One multiply maximum is finished every width clock cycles.

Parameters:

width (int) – Width in bits of both inputs a and b. For signed multiplies, this includes the sign bit. Output o width will be \(2*n\).

width

Bit width of the inputs a and b. Output o width will be \(2*n\).

Type:

int

inp

Input interface to the multiplier.

Type:

In(multiplier_input_signature(width))

outp

Output interface of the multiplier.

Type:

Out(multiplier_output_signature(width))

Notes

  • This multiplier is a naive shift-add implementation, similar to how pen-and-pad multiplication in base-2/10 works. Internally, the multiplier treats the multiplier a as signed and the multiplicand b as unsigned.

    a’s signedness only matters for the most-negative value possible for a given bit-width \(n\), where twos-complementing would not change the bit-pattern. Therefore, a is automatically sign-extended to \(n + 1\) bits in the input stage before any further processing.

    If b is negative, both b and a are twos-complemented in the input stage; since \(a * b = -a * -b\), no inverse transformation on the output stage is needed.

  • For an \(n\)-bit multiply, this multiplier requires \(O(3*n)\) storage elements (to store copies of the input, output, and intermediate results).

  • The output product and (possibly-inverted) b input share a backing store. This works because only b’s LSb is used each cycle and needs to be shifted out at the end of the cycle. Consequently, upper bits of o can be shifted in while lower bits of b are shifted out.

elaborate(platform)
class smolarith.mul.Outputs(width)

Tagged union representing the signedness of multiplier output.

  • When sign is UNSIGNED, o should be treated as a Value with an as_unsigned Shape.

  • When sign is SIGNED or SIGNED_UNSIGNED, o should be treated as a Value with a as_signed Shape.

Parameters:

width (int) – Width in bits of the output o. For signed multiplies, this includes the sign bit.

sign

Indicates whether the multiply that produced this product was signed, unsigned, or signed-unsigned.

Type:

Sign

o

The product of \(a * b\).

Type:

Signal(width)

class smolarith.mul.PipelinedMul(*args, src_loc_at=0, **kwargs)

Multiplier soft-core which pipelines inputs.

This multiplier core has pipeline registers that stores intermediate results for up to width multiplies at once. Basic control flow is implemented:

  • A multiply starts on the current cycle when both inp.valid and inp.ready are asserted.

  • A multiply result is available when outp.valid is asserted. The result is read/overwritten once a downstream core asserts outp.ready.

  • inp.ready de-asserts on any cycle where outp.valid is asserted but outp.ready is not asserted. This is a pipeline stall.

  • Latency: Multiply Results for a given multiply will be available width clock cycles after the multiplier has seen those inputs, assuming no stalls.

    If stalls occur (outp.valid is asserted while outp.ready is unasserted), latency increases by the length of the stalls in clock cycles while the given multiply was in the pipeline.

  • Throughput: One multiply maximum is finished per clock cycle.

Parameters:
  • width (int) – Width in bits of both inputs a and b. For signed multiplies, this includes the sign bit. Output o width will be \(2*n\).

  • debug (int, optional) – Enable debugging signals.

width

Bit width of the inputs a and b. Output o width will be \(2*n\).

Type:

int

inp

Input interface to the multiplier.

Type:

In(multiplier_input_signature(width))

outp

Output interface of the multiplier.

Type:

Out(multiplier_output_signature(width))

debug

Flag which indicates whether internal debugging Signals are enabled or not.

Type:

bool

Notes

  • This multiplier is a naive shift-add implementation, similar to how pen-and-pad multiplication in base-2/10 works. Internally, the multiplier treats the multiplier a as signed and the multiplicand b as unsigned.

    a’s signedness only matters for the most-negative value possible for a given bit-width \(n\), where twos-complementing would not change the bit-pattern. Therefore, a is automatically sign-extended to \(n + 1\) bits in the input stage before any further processing.

    If b is negative, both b and a are twos-complemented in the input stage; since \(a * b = -a * -b\), no inverse transformation on the output stage is needed.

  • For an \(n\)-bit multiply, this multiplier requires \(O(n^2)\) storage elements (to store intermediate results).

  • The pipeline will happily perform multiplies on inputs where inp.valid is not asserted; the core will not assert outp.valid when such multiplies reach the output interface (and thus they will be discarded).

  • For simplicity of implementation, and under the assumption that stalls will be rare, a pipeline stall stops the entire core. The core does not attempt to fill pipeline stages if the output interface isn’t ready.

Future Directions

  • It is possible to implement a \(2*n\)-bit multiply using 3 \(n\)-bit multiplies. Since this library is about smol arithmetic, it may be worthwhile to create classes for Karatsuba multipliers.

    Larger multpliers using a smaller PipelinedMul will still potentially be quite fast :).

  • Stalls stop the entire pipeline because validity information is not forwarded to earlier pipeline stages. Latency in the presence of stalls could be reduced by quashing invalid multiplies at other points in the pipeline besides the output. It is not much code complexity to add this, but impact on timing and size is not clear.

elaborate(platform)
class smolarith.mul.Sign(value, names=None, *values, module=None, qualname=None, type=None, start=1, boundary=None)

Indicate the type of multiply to be performed.

  • UNSIGNED: Both inputs a and b are unsigned.

    The output is unsigned.

  • SIGNED: Both inputs a and b are unsigned.

    The output is signed.

  • SIGNED_UNSIGNED: Input a is signed and input b is unsigned.

    The output is signed.

    Note that for \(n\)-bit multiply with given bit patterns for a and b, the bottom \(n/2\) bits will be identical in an UNSIGNED or SIGNED_UNSIGNED multiply.

SIGNED = 2
SIGNED_UNSIGNED = 3
UNSIGNED = 1
smolarith.mul.multiplier_input_signature(width)

Create a parametric multiplier input port.

This function returns a Signature that’s usable as a transfer initiator to a multiplier. A multiply starts on the current cycle when both valid and rdy are asserted.

Parameters:

width (int) – Width in bits of the inputs a and b. For signed multiplies, this includes the sign bit.

Returns:

Signature containing the following members:

.payload: Out(Inputs)

Data input to multiplier.

.ready: In(1)

When 1, indicates that multiplier is ready.

.valid: Out(1)

When 1, indicates that multiplier data input is valid.

Return type:

Signature

smolarith.mul.multiplier_output_signature(width)

Create a parametric multiplier output port.

This function returns a Signature that’s usable as a transfer initiator from a multiplier.

Note

For a core responding to a multiplier, which is the typical use case, you will want to use this Signature with the In flow, like so:

>>> from smolarith.mul import multiplier_output_signature
>>> from amaranth.lib.wiring import Signature, In
>>> my_receiver_sig = Signature({
...     "inp": In(multiplier_output_signature(width=8))
... })
Parameters:

width (int) – Width in bits of output o. For signed multiplies, this includes the sign bit.

Returns:

Signature containing the following members:

.payload: Out(Outputs)

Data output from multiplier.

.ready: In(1)

When 1, indicates that responder is ready to receive results from multiplier.

.valid: Out(1)

When 1, indicates that multiplier output data input is valid.

Return type:

Signature

Division

Soft-core divider components.

class smolarith.div.Impl(value, names=None, *values, module=None, qualname=None, type=None, start=1, boundary=None)

Indicate the division algorithm to use inside a dividing component.

NON_RESTORING = 2
RESTORING = 1
class smolarith.div.Inputs(width)

Tagged union representing the signedness of divider inputs.

  • When sign is UNSIGNED, both n and d are treated by the divider as Values with an unsigned Shape.

  • When sign is SIGNED, both n and d are treated as Values with a signed Shape.

Parameters:

width (int) – Width in bits of both inputs n and d. For signed divides, this includes the sign bit.

sign

Controls the interpretation of the bit patterns of n and d during division.

Type:

Sign

n

The dividend; i.e. the ‘\(n\)’ in \(n / d\).

Type:

Signal(width)

d

The divisor; i.e. the ‘\(d\)’ in \(n / d\).

Type:

Signal(width)

class smolarith.div.LongDivider(*args, src_loc_at=0, **kwargs)

Long-division soft-core, used as a reference.

This divider core is a gateware implementation of classic long division.

  • A divide starts on the current cycle when both inp.valid and inp.ready are asserted.

  • A divide result is available when outp.valid is asserted. The result is read/overwritten once a downstream core asserts outp.ready.

Warning

This core is not intended to be used in user designs due to poor resource usage. It is mainly kept as a known-to-work reference design for possible equivalence checking later. Use MulticycleDiv instead.

  • Latency: Divide Results for a will be available width clock cycles after assertion of both inp.valid and inp.ready.

  • Throughput: One divide maximum is finished every width clock cycles.

Parameters:

width (int) – Width in bits of both inputs n and d. For signed multiplies, this includes the sign bit. Outputs q and r width will be the same width.

width

Bit width of the inputs n and d, and outputs q and r.

Type:

int

inp

Input interface to the divider.

Type:

In(divider_input_signature(width))

outp

Output interface of the divider.

Type:

Out(divider_output_signature(width))

Notes

  • This divider is a naive long-division implementation, similar to how pen-and-pad division is done in base-2/10.

  • Internally, the divider is aware of the sign of its inputs as well as the sign of the divide operation.

    The divider will dispatch to one of the 4 possible combinations of signs during the division loop. The four combinations are:

    • Add shifted copies of positive/negative powers of two together to form the quotient.

    • Add/subtract shifted copies of the divisor from the dividend towards zero to form the remainder.

  • This divider is compliant with RISC-V semantics for divide-by-zero and overflow when width=32 or width=64[rv]. Specifically:

    • Signed/unsigned divide-by-zero returns “all bits set” for the quotient and the dividend as the remainder.

    • Signed overflow (\(-2^{\{31,63\}}/-1\)) returns \(-2^{\{31,63\}}\) for the quotient and \(0\) as the remainder.

elaborate(platform)
class smolarith.div.MulticycleDiv(*args, src_loc_at=0, **kwargs)

Restoring/Non-restoring divider soft-core.

This divider core is a gateware implementation of restoring division or non-restoring division. It works for both signed and unsigned divides, and should be preferred to LongDivider in almost all circumstances due to better resource usage.

  • A divide starts on the current cycle when both inp.valid and inp.ready are asserted.

  • A divide result is available when outp.valid is asserted. The result is read/overwritten once a downstream core asserts outp.ready.

  • Latency: Divide Results for a restoring divider will be available width + 3 clock cycles after assertion of both inp.valid and inp.ready. For a non-restoring divider, results are available after width + 4 clock cycles.

  • Throughput: One divide maximum is finished every width + 3 clock cycles for a restoring divider, and every width + 4 for a non-restoring divider.

Parameters:
  • width (int) – Width in bits of both inputs n and d. For signed multiplies, this includes the sign bit. Outputs q and r width will be the same width.

  • impl (Impl) – Choose whether to use a restoring divider or non-restoring divider internally. The default of RESTORING is fine for most cases.

width

Bit width of the inputs n and d, and outputs q and r.

Type:

int

inp

Input interface to the divider.

Type:

In(divider_input_signature(width))

outp

Output interface of the divider.

Type:

Out(divider_output_signature(width))

Notes

  • This divider is implemented using either restoring or non-restoring division. It is basically a gateware translation of the Python implementations shown in Implementation Notes.

  • Unlike the multipliers, where I could eyeball approximate storage element usage, divider size was benchmarked using yosys‘s synth_ice40 pass. For an \(n\)-bit divide:

    • impl=Impl.RESTORING requires approximately \(O(7*n)\) storage elements and \(O(12.75*n)\) LUT4s.

    • impl=Impl.NON_RESTORING requires approximately \(O(8*n)\) storage elements and \(O(14.5*n)\) LUT4s.

  • Internally, the divider converts its operands from signed to unsigned if necessary, performs the division, and the converts back from unsigned to signed if necessary. I ran into some issues with making a signed-aware non-restoring divider such that eating the conversion cost latency is probably justifiable for implementation simplicity of both a restoring and non-restoring divider.

The combination of signed-to-unsigned conversion, using a cycle to latching inputs to the internal divider, and unsigned-to-signed conversion adds three cycles of latency beyond the expected width number of cycles to perform a division.

Additionally, when using a non-restoring divider, the quotient and remainder require a possible final restore step. Remainder restore is implemented as shown in the Python code. Quotient restore is implemented by subtracting the ones complement of the raw quotient from the raw quotient itself.

  • The quotient and remainder share a backing store; new bits are shifted into the lower-half quotient portion as the remainder is shifted into the upper half. This works because each iteration of the restoring/non-restoring loops check the sign of the remainder, and the lower quotient bits won’t affect the sign bit.

  • This divider is compliant with RISC-V semantics for divide-by-zero and overflow when width=32 or width=64[rv]. Specifically:

    • Signed/unsigned divide-by-zero returns “all bits set” for the quotient and the dividend as the remainder.

    • Signed overflow (\(-2^{\{31,63\}}/-1\)) returns \(-2^{\{31,63\}}\) for the quotient and \(0\) as the remainder.

Future Directions

  • It may be worthwhile to make a pipelined divider?

  • Current latency is worse than the LongDivider, which is just about the only advantage to LongDivider. We can provide an option to reduce latency/resources required due to the signed/unsigned and restore stages at the cost of some timing closure.

elaborate(plat)
class smolarith.div.Outputs(width)

Tagged union representing the signedness of divider outputs.

  • When sign is UNSIGNED, q and r should be treated as Values with an as_unsigned Shape.

  • When sign is SIGNED, q and r should be treated as Values with a as_signed Shape.

Parameters:

width (int) – Width in bits of the outputs q and r. For signed dividers, this includes the sign bit.

sign

Indicates whether the divider that produced this quotient/remainder was signed or unsigned.

Type:

Sign

q

The quotient of \(n / d\).

Type:

Signal(width)

r

The remainder of \(n / d\), i.e. \(n \bmod d\).

Type:

Signal(width)

class smolarith.div.Sign(value, names=None, *values, module=None, qualname=None, type=None, start=1, boundary=None)

Indicate the type of divide to be performed.

  • UNSIGNED: Both inputs n and d are unsigned.

    The output is unsigned.

  • SIGNED: Both inputs n and d are unsigned.

    The quotient and remainder are signed. The remainder takes the sign of the dividend.

SIGNED = 2
UNSIGNED = 1
smolarith.div.divider_input_signature(width)

Create a parametric divider input port.

This function returns a Signature that’s usable as a transfer initiator to a divider. A divider starts on the current cycle when both valid and rdy are asserted.

Parameters:

width (int) – Width in bits of the inputs n and d. For signed divides, this includes the sign bit.

Returns:

Signature containing the following members:

.payload: Out(Inputs)

Data input to divider.

.ready: In(1)

When 1, indicates that divider is ready.

.valid: Out(1)

When 1, indicates that divider data input is valid.

Return type:

Signature

smolarith.div.divider_output_signature(width)

Create a parametric divider output port.

This function returns a Signature that’s usable as a transfer initiator from a divider.

Note

For a core responding to a divider, which is the typical use case, you will want to use this Signature with the In flow, like so:

>>> from smolarith.div import divider_output_signature
>>> from amaranth.lib.wiring import Signature, In
>>> my_receiver_sig = Signature({
...     "inp": In(divider_output_signature(width=8))
... })
Parameters:

width (int) – Width in bits of the outputs q and r. For signed divides, this includes the sign bit.

Returns:

Signature containing the following members:

.payload: Out(Outputs)

Data output from divider.

.ready: In(1)

When 1, indicates that responder is ready to receive results from divider.

.valid: Out(1)

When 1, indicates that divider output data input is valid.

Return type:

Signature