smolarith
API
smolarith
does not export anything from the top-level; you must import the
module you want to use, perhaps with an alias, e.g.
>>> import smolarith.mul as mul
Multiplication
Soft-core multiplier components.
- class smolarith.mul.Inputs(width)
Tagged union representing the signedness of multiplier inputs.
When
sign
isUNSIGNED
, botha
andb
are treated by the multiplier asValue
s with anunsigned
Shape
.When
sign
isSIGNED
, botha
andb
are treated as Values with asigned
Shape.When
sign
isSIGNED_UNSIGNED
,a
is treated as a Value with a signed Shape, whileb
is treated as a Value with an unsigned Shape.
- Parameters:
width (int) – Width in bits of both inputs
a
andb
. For signed multiplies, this includes the sign bit.
- a
The multiplicand; i.e. the ‘\(a\)’ in \(a * b\).
- Type:
Signal(width)
- b
The multiplier; i.e. the ‘\(b\)’ in \(a * b\).
- Type:
Signal(width)
- class smolarith.mul.MulticycleMul(*args, src_loc_at=0, **kwargs)
Multicycle multiplier soft-core.
This multiplier core is a gateware implementation of shift-add multiplication.
A multiply starts on the current cycle when both
inp.valid
andinp.ready
are asserted.A multiply result is available when
outp.valid
is asserted. The result is read/overwritten once a downstream core assertsoutp.ready
.Latency: Multiply Results for a will be available
width
clock cycles after assertion of bothinp.valid
andinp.ready
.Throughput: One multiply maximum is finished every
width
clock cycles.
- Parameters:
width (int) – Width in bits of both inputs
a
andb
. For signed multiplies, this includes the sign bit. Outputo
width will be \(2*n\).
- inp
Input interface to the multiplier.
- Type:
In(multiplier_input_signature(width))
- outp
Output interface of the multiplier.
- Type:
Out(multiplier_output_signature(width))
Notes
This multiplier is a naive shift-add implementation, similar to how pen-and-pad multiplication in base-2/10 works. Internally, the multiplier treats the multiplier
a
as signed and the multiplicandb
as unsigned.a
’s signedness only matters for the most-negative value possible for a given bit-width \(n\), where twos-complementing would not change the bit-pattern. Therefore,a
is automatically sign-extended to \(n + 1\) bits in the input stage before any further processing.If
b
is negative, bothb
anda
are twos-complemented in the input stage; since \(a * b = -a * -b\), no inverse transformation on the output stage is needed.For an \(n\)-bit multiply, this multiplier requires \(O(3*n)\) storage elements (to store copies of the input, output, and intermediate results).
The output product and (possibly-inverted)
b
input share a backing store. This works because onlyb
’s LSb is used each cycle and needs to be shifted out at the end of the cycle. Consequently, upper bits ofo
can be shifted in while lower bits ofb
are shifted out.
- elaborate(platform)
- class smolarith.mul.Outputs(width)
Tagged union representing the signedness of multiplier output.
When
sign
isUNSIGNED
,o
should be treated as aValue
with anas_unsigned
Shape
.When
sign
isSIGNED
orSIGNED_UNSIGNED
,o
should be treated as a Value with aas_signed
Shape.
- Parameters:
width (int) – Width in bits of the output
o
. For signed multiplies, this includes the sign bit.
- sign
Indicates whether the multiply that produced this product was signed, unsigned, or signed-unsigned.
- Type:
- o
The product of \(a * b\).
- Type:
Signal(width)
- class smolarith.mul.PipelinedMul(*args, src_loc_at=0, **kwargs)
Multiplier soft-core which pipelines inputs.
This multiplier core has pipeline registers that stores intermediate results for up to
width
multiplies at once. Basic control flow is implemented:A multiply starts on the current cycle when both
inp.valid
andinp.ready
are asserted.A multiply result is available when
outp.valid
is asserted. The result is read/overwritten once a downstream core assertsoutp.ready
.inp.ready
de-asserts on any cycle whereoutp.valid
is asserted butoutp.ready
is not asserted. This is a pipeline stall.Latency: Multiply Results for a given multiply will be available
width
clock cycles after the multiplier has seen those inputs, assuming no stalls.If stalls occur (
outp.valid
is asserted whileoutp.ready
is unasserted), latency increases by the length of the stalls in clock cycles while the given multiply was in the pipeline.Throughput: One multiply maximum is finished per clock cycle.
- Parameters:
- inp
Input interface to the multiplier.
- Type:
In(multiplier_input_signature(width))
- outp
Output interface of the multiplier.
- Type:
Out(multiplier_output_signature(width))
Notes
This multiplier is a naive shift-add implementation, similar to how pen-and-pad multiplication in base-2/10 works. Internally, the multiplier treats the multiplier
a
as signed and the multiplicandb
as unsigned.a
’s signedness only matters for the most-negative value possible for a given bit-width \(n\), where twos-complementing would not change the bit-pattern. Therefore,a
is automatically sign-extended to \(n + 1\) bits in the input stage before any further processing.If
b
is negative, bothb
anda
are twos-complemented in the input stage; since \(a * b = -a * -b\), no inverse transformation on the output stage is needed.For an \(n\)-bit multiply, this multiplier requires \(O(n^2)\) storage elements (to store intermediate results).
The pipeline will happily perform multiplies on inputs where
inp.valid
is not asserted; the core will not assertoutp.valid
when such multiplies reach the output interface (and thus they will be discarded).For simplicity of implementation, and under the assumption that stalls will be rare, a pipeline stall stops the entire core. The core does not attempt to fill pipeline stages if the output interface isn’t ready.
Future Directions
It is possible to implement a \(2*n\)-bit multiply using 3 \(n\)-bit multiplies. Since this library is about smol arithmetic, it may be worthwhile to create classes for Karatsuba multipliers.
Larger multpliers using a smaller
PipelinedMul
will still potentially be quite fast :).Stalls stop the entire pipeline because validity information is not forwarded to earlier pipeline stages. Latency in the presence of stalls could be reduced by quashing invalid multiplies at other points in the pipeline besides the output. It is not much code complexity to add this, but impact on timing and size is not clear.
- elaborate(platform)
- class smolarith.mul.Sign(value, names=None, *values, module=None, qualname=None, type=None, start=1, boundary=None)
Indicate the type of multiply to be performed.
UNSIGNED
: Both inputsa
andb
are unsigned.The output is unsigned.
SIGNED
: Both inputsa
andb
are unsigned.The output is signed.
SIGNED_UNSIGNED
: Inputa
is signed and inputb
is unsigned.The output is signed.
Note that for \(n\)-bit multiply with given bit patterns for
a
andb
, the bottom \(n/2\) bits will be identical in anUNSIGNED
orSIGNED_UNSIGNED
multiply.
- SIGNED = 2
- SIGNED_UNSIGNED = 3
- UNSIGNED = 1
- smolarith.mul.multiplier_input_signature(width)
Create a parametric multiplier input port.
This function returns a
Signature
that’s usable as a transfer initiator to a multiplier. A multiply starts on the current cycle when bothvalid
andrdy
are asserted.- Parameters:
width (int) – Width in bits of the inputs
a
andb
. For signed multiplies, this includes the sign bit.- Returns:
Signature
containing the following members:- .payload: Out(Inputs)
Data input to multiplier.
- .ready: In(1)
When
1
, indicates that multiplier is ready.
- .valid: Out(1)
When
1
, indicates that multiplier data input is valid.
- Return type:
- smolarith.mul.multiplier_output_signature(width)
Create a parametric multiplier output port.
This function returns a
Signature
that’s usable as a transfer initiator from a multiplier.Note
For a core responding to a multiplier, which is the typical use case, you will want to use this Signature with the
In
flow, like so:>>> from smolarith.mul import multiplier_output_signature >>> from amaranth.lib.wiring import Signature, In >>> my_receiver_sig = Signature({ ... "inp": In(multiplier_output_signature(width=8)) ... })
- Parameters:
width (int) – Width in bits of output
o
. For signed multiplies, this includes the sign bit.- Returns:
Signature
containing the following members:- .payload: Out(Outputs)
Data output from multiplier.
- .ready: In(1)
When
1
, indicates that responder is ready to receive results from multiplier.
- .valid: Out(1)
When
1
, indicates that multiplier output data input is valid.
- Return type:
Division
Soft-core divider components.
- class smolarith.div.Impl(value, names=None, *values, module=None, qualname=None, type=None, start=1, boundary=None)
Indicate the division algorithm to use inside a dividing component.
RESTORING
: Use restoring division for the internal divider. This is a good default.NON_RESTORING
: Use non-restoring division for the internal divider.
- NON_RESTORING = 2
- RESTORING = 1
- class smolarith.div.Inputs(width)
Tagged union representing the signedness of divider inputs.
When
sign
isUNSIGNED
, bothn
andd
are treated by the divider asValue
s with anunsigned
Shape
.When
sign
isSIGNED
, bothn
andd
are treated as Values with asigned
Shape.
- Parameters:
width (int) – Width in bits of both inputs
n
andd
. For signed divides, this includes the sign bit.
- n
The dividend; i.e. the ‘\(n\)’ in \(n / d\).
- Type:
Signal(width)
- d
The divisor; i.e. the ‘\(d\)’ in \(n / d\).
- Type:
Signal(width)
- class smolarith.div.LongDivider(*args, src_loc_at=0, **kwargs)
Long-division soft-core, used as a reference.
This divider core is a gateware implementation of classic long division.
A divide starts on the current cycle when both
inp.valid
andinp.ready
are asserted.A divide result is available when
outp.valid
is asserted. The result is read/overwritten once a downstream core assertsoutp.ready
.
Warning
This core is not intended to be used in user designs due to poor resource usage. It is mainly kept as a known-to-work reference design for possible equivalence checking later. Use
MulticycleDiv
instead.Latency: Divide Results for a will be available
width
clock cycles after assertion of bothinp.valid
andinp.ready
.Throughput: One divide maximum is finished every
width
clock cycles.
- Parameters:
width (int) – Width in bits of both inputs
n
andd
. For signed multiplies, this includes the sign bit. Outputsq
andr
width will be the same width.
- inp
Input interface to the divider.
- Type:
In(divider_input_signature(width))
- outp
Output interface of the divider.
- Type:
Out(divider_output_signature(width))
Notes
This divider is a naive long-division implementation, similar to how pen-and-pad division is done in base-2/10.
Internally, the divider is aware of the sign of its inputs as well as the sign of the divide operation.
The divider will dispatch to one of the 4 possible combinations of signs during the division loop. The four combinations are:
Add shifted copies of positive/negative powers of two together to form the quotient.
Add/subtract shifted copies of the divisor from the dividend towards zero to form the remainder.
This divider is compliant with RISC-V semantics for divide-by-zero and overflow when
width=32
orwidth=64
[rv]. Specifically:Signed/unsigned divide-by-zero returns “all bits set” for the quotient and the dividend as the remainder.
Signed overflow (\(-2^{\{31,63\}}/-1\)) returns \(-2^{\{31,63\}}\) for the quotient and \(0\) as the remainder.
- elaborate(platform)
- class smolarith.div.MulticycleDiv(*args, src_loc_at=0, **kwargs)
Restoring/Non-restoring divider soft-core.
This divider core is a gateware implementation of restoring division or non-restoring division. It works for both signed and unsigned divides, and should be preferred to
LongDivider
in almost all circumstances due to better resource usage.A divide starts on the current cycle when both
inp.valid
andinp.ready
are asserted.A divide result is available when
outp.valid
is asserted. The result is read/overwritten once a downstream core assertsoutp.ready
.Latency: Divide Results for a restoring divider will be available
width + 3
clock cycles after assertion of bothinp.valid
andinp.ready
. For a non-restoring divider, results are available afterwidth + 4
clock cycles.Throughput: One divide maximum is finished every
width + 3
clock cycles for a restoring divider, and everywidth + 4
for a non-restoring divider.
- Parameters:
- inp
Input interface to the divider.
- Type:
In(divider_input_signature(width))
- outp
Output interface of the divider.
- Type:
Out(divider_output_signature(width))
Notes
This divider is implemented using either restoring or non-restoring division. It is basically a gateware translation of the Python implementations shown in Implementation Notes.
Unlike the multipliers, where I could eyeball approximate storage element usage, divider size was benchmarked using yosys‘s
synth_ice40
pass. For an \(n\)-bit divide:impl=Impl.RESTORING
requires approximately \(O(7*n)\) storage elements and \(O(12.75*n)\) LUT4s.impl=Impl.NON_RESTORING
requires approximately \(O(8*n)\) storage elements and \(O(14.5*n)\) LUT4s.
Internally, the divider converts its operands from signed to unsigned if necessary, performs the division, and the converts back from unsigned to signed if necessary. I ran into some issues with making a signed-aware non-restoring divider such that eating the conversion cost latency is probably justifiable for implementation simplicity of both a restoring and non-restoring divider.
The combination of signed-to-unsigned conversion, using a cycle to latching inputs to the internal divider, and unsigned-to-signed conversion adds three cycles of latency beyond the expected
width
number of cycles to perform a division.Additionally, when using a non-restoring divider, the quotient and remainder require a possible final restore step. Remainder restore is implemented as shown in the Python code. Quotient restore is implemented by subtracting the ones complement of the raw quotient from the raw quotient itself.
The quotient and remainder share a backing store; new bits are shifted into the lower-half quotient portion as the remainder is shifted into the upper half. This works because each iteration of the restoring/non-restoring loops check the sign of the remainder, and the lower quotient bits won’t affect the sign bit.
This divider is compliant with RISC-V semantics for divide-by-zero and overflow when
width=32
orwidth=64
[rv]. Specifically:Signed/unsigned divide-by-zero returns “all bits set” for the quotient and the dividend as the remainder.
Signed overflow (\(-2^{\{31,63\}}/-1\)) returns \(-2^{\{31,63\}}\) for the quotient and \(0\) as the remainder.
Future Directions
It may be worthwhile to make a pipelined divider?
Current latency is worse than the
LongDivider
, which is just about the only advantage toLongDivider
. We can provide an option to reduce latency/resources required due to the signed/unsigned and restore stages at the cost of some timing closure.
- elaborate(plat)
- class smolarith.div.Outputs(width)
Tagged union representing the signedness of divider outputs.
When
sign
isUNSIGNED
,q
andr
should be treated asValue
s with anas_unsigned
Shape
.When
sign
isSIGNED
,q
andr
should be treated as Values with aas_signed
Shape.
- Parameters:
width (int) – Width in bits of the outputs
q
andr
. For signed dividers, this includes the sign bit.
- sign
Indicates whether the divider that produced this quotient/remainder was signed or unsigned.
- Type:
- q
The quotient of \(n / d\).
- Type:
Signal(width)
- r
The remainder of \(n / d\), i.e. \(n \bmod d\).
- Type:
Signal(width)
- class smolarith.div.Sign(value, names=None, *values, module=None, qualname=None, type=None, start=1, boundary=None)
Indicate the type of divide to be performed.
UNSIGNED
: Both inputsn
andd
are unsigned.The output is unsigned.
SIGNED
: Both inputsn
andd
are unsigned.The quotient and remainder are signed. The remainder takes the sign of the dividend.
- SIGNED = 2
- UNSIGNED = 1
- smolarith.div.divider_input_signature(width)
Create a parametric divider input port.
This function returns a
Signature
that’s usable as a transfer initiator to a divider. A divider starts on the current cycle when bothvalid
andrdy
are asserted.- Parameters:
width (int) – Width in bits of the inputs
n
andd
. For signed divides, this includes the sign bit.- Returns:
Signature
containing the following members:- .payload: Out(Inputs)
Data input to divider.
- .ready: In(1)
When
1
, indicates that divider is ready.
- .valid: Out(1)
When
1
, indicates that divider data input is valid.
- Return type:
- smolarith.div.divider_output_signature(width)
Create a parametric divider output port.
This function returns a
Signature
that’s usable as a transfer initiator from a divider.Note
For a core responding to a divider, which is the typical use case, you will want to use this Signature with the
In
flow, like so:>>> from smolarith.div import divider_output_signature >>> from amaranth.lib.wiring import Signature, In >>> my_receiver_sig = Signature({ ... "inp": In(divider_output_signature(width=8)) ... })
- Parameters:
width (int) – Width in bits of the outputs
q
andr
. For signed divides, this includes the sign bit.- Returns:
Signature
containing the following members:- .payload: Out(Outputs)
Data output from divider.
- .ready: In(1)
When
1
, indicates that responder is ready to receive results from divider.
- .valid: Out(1)
When
1
, indicates that divider output data input is valid.
- Return type: