smolarith API
smolarith does not export anything from the top-level; you must import the
module you want to use, perhaps with an alias, e.g.
>>> import smolarith.mul as mul
Multiplication
Soft-core multiplier components.
- class smolarith.mul.Inputs(width)
Tagged union representing the signedness of multiplier inputs.
When
signisUNSIGNED, bothaandbare treated by the multiplier asValues with anunsignedShape.When
signisSIGNED, bothaandbare treated as Values with asignedShape.When
signisSIGNED_UNSIGNED,ais treated as a Value with a signed Shape, whilebis treated as a Value with an unsigned Shape.
- Parameters:
width (int) – Width in bits of both inputs
aandb. For signed multiplies, this includes the sign bit.
- a
The multiplicand; i.e. the ‘\(a\)’ in \(a * b\).
- Type:
Signal(width)
- b
The multiplier; i.e. the ‘\(b\)’ in \(a * b\).
- Type:
Signal(width)
- class smolarith.mul.MulticycleMul(*args, src_loc_at=0, **kwargs)
Multicycle multiplier soft-core.
This multiplier core is a gateware implementation of shift-add multiplication.
A multiply starts on the current cycle when both
inp.validandinp.readyare asserted.A multiply result is available when
outp.validis asserted. The result is read/overwritten once a downstream core assertsoutp.ready.Latency: Multiply Results for a will be available
widthclock cycles after assertion of bothinp.validandinp.ready.Throughput: One multiply maximum is finished every
widthclock cycles.
- Parameters:
width (int) – Width in bits of both inputs
aandb. For signed multiplies, this includes the sign bit. Outputowidth will be \(2*n\).
- inp
Input interface to the multiplier.
- Type:
In(multiplier_input_signature(width))
- outp
Output interface of the multiplier.
- Type:
Out(multiplier_output_signature(width))
Notes
This multiplier is a naive shift-add implementation, similar to how pen-and-pad multiplication in base-2/10 works. Internally, the multiplier treats the multiplier
aas signed and the multiplicandbas unsigned.a’s signedness only matters for the most-negative value possible for a given bit-width \(n\), where twos-complementing would not change the bit-pattern. Therefore,ais automatically sign-extended to \(n + 1\) bits in the input stage before any further processing.If
bis negative, bothbandaare twos-complemented in the input stage; since \(a * b = -a * -b\), no inverse transformation on the output stage is needed.For an \(n\)-bit multiply, this multiplier requires \(O(3*n)\) storage elements (to store copies of the input, output, and intermediate results).
The output product and (possibly-inverted)
binput share a backing store. This works because onlyb’s LSb is used each cycle and needs to be shifted out at the end of the cycle. Consequently, upper bits ofocan be shifted in while lower bits ofbare shifted out.
- elaborate(platform)
- class smolarith.mul.Outputs(width)
Tagged union representing the signedness of multiplier output.
When
signisUNSIGNED,oshould be treated as aValuewith anas_unsignedShape.When
signisSIGNEDorSIGNED_UNSIGNED,oshould be treated as a Value with aas_signedShape.
- Parameters:
width (int) – Width in bits of the output
o. For signed multiplies, this includes the sign bit.
- sign
Indicates whether the multiply that produced this product was signed, unsigned, or signed-unsigned.
- Type:
- o
The product of \(a * b\).
- Type:
Signal(width)
- class smolarith.mul.PipelinedMul(*args, src_loc_at=0, **kwargs)
Multiplier soft-core which pipelines inputs.
This multiplier core has pipeline registers that stores intermediate results for up to
widthmultiplies at once. Basic control flow is implemented:A multiply starts on the current cycle when both
inp.validandinp.readyare asserted.A multiply result is available when
outp.validis asserted. The result is read/overwritten once a downstream core assertsoutp.ready.inp.readyde-asserts on any cycle whereoutp.validis asserted butoutp.readyis not asserted. This is a pipeline stall.Latency: Multiply Results for a given multiply will be available
widthclock cycles after the multiplier has seen those inputs, assuming no stalls.If stalls occur (
outp.validis asserted whileoutp.readyis unasserted), latency increases by the length of the stalls in clock cycles while the given multiply was in the pipeline.Throughput: One multiply maximum is finished per clock cycle.
- Parameters:
- inp
Input interface to the multiplier.
- Type:
In(multiplier_input_signature(width))
- outp
Output interface of the multiplier.
- Type:
Out(multiplier_output_signature(width))
Notes
This multiplier is a naive shift-add implementation, similar to how pen-and-pad multiplication in base-2/10 works. Internally, the multiplier treats the multiplier
aas signed and the multiplicandbas unsigned.a’s signedness only matters for the most-negative value possible for a given bit-width \(n\), where twos-complementing would not change the bit-pattern. Therefore,ais automatically sign-extended to \(n + 1\) bits in the input stage before any further processing.If
bis negative, bothbandaare twos-complemented in the input stage; since \(a * b = -a * -b\), no inverse transformation on the output stage is needed.For an \(n\)-bit multiply, this multiplier requires \(O(n^2)\) storage elements (to store intermediate results).
The pipeline will happily perform multiplies on inputs where
inp.validis not asserted; the core will not assertoutp.validwhen such multiplies reach the output interface (and thus they will be discarded).For simplicity of implementation, and under the assumption that stalls will be rare, a pipeline stall stops the entire core. The core does not attempt to fill pipeline stages if the output interface isn’t ready.
Future Directions
It is possible to implement a \(2*n\)-bit multiply using 3 \(n\)-bit multiplies. Since this library is about smol arithmetic, it may be worthwhile to create classes for Karatsuba multipliers.
Larger multpliers using a smaller
PipelinedMulwill still potentially be quite fast :).Stalls stop the entire pipeline because validity information is not forwarded to earlier pipeline stages. Latency in the presence of stalls could be reduced by quashing invalid multiplies at other points in the pipeline besides the output. It is not much code complexity to add this, but impact on timing and size is not clear.
- elaborate(platform)
- class smolarith.mul.Sign(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)
Indicate the type of multiply to be performed.
- smolarith.mul.multiplier_input_signature(width)
Create a parametric multiplier input port.
This function returns a
Signaturethat’s usable as a transfer initiator to a multiplier. A multiply starts on the current cycle when bothvalidandrdyare asserted.- Parameters:
width (int) – Width in bits of the inputs
aandb. For signed multiplies, this includes the sign bit.- Returns:
Signature(Inputs)- Return type:
- smolarith.mul.multiplier_output_signature(width)
Create a parametric multiplier output port.
This function returns a
Signaturethat’s usable as a transfer initiator from a multiplier.Note
For a core responding to a multiplier, which is the typical use case, you will want to use this Signature with the
Inflow, like so:>>> from smolarith.mul import multiplier_output_signature >>> from amaranth.lib.wiring import Signature, In >>> my_receiver_sig = Signature({ ... "inp": In(multiplier_output_signature(width=8)) ... })
- Parameters:
width (int) – Width in bits of output
o. For signed multiplies, this includes the sign bit.- Returns:
Signature(Outputs)- Return type:
Division
Soft-core divider components.
- class smolarith.div.Impl(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)
Indicate the division algorithm to use inside a dividing component.
- RESTORING
Use restoring division for the internal divider. This is a good default.
- Type:
- NON_RESTORING
Use non-restoring division for the internal divider.
- Type:
- class smolarith.div.Inputs(width)
Tagged union representing the signedness of divider inputs.
When
signisUNSIGNED, bothnanddare treated by the divider asValues with anunsignedShape.When
signisSIGNED, bothnanddare treated as Values with asignedShape.
- Parameters:
width (int) – Width in bits of both inputs
nandd. For signed divides, this includes the sign bit.
- n
The dividend; i.e. the ‘\(n\)’ in \(n / d\).
- Type:
Signal(width)
- d
The divisor; i.e. the ‘\(d\)’ in \(n / d\).
- Type:
Signal(width)
- class smolarith.div.LongDivider(*args, src_loc_at=0, **kwargs)
Long-division soft-core, used as a reference.
This divider core is a gateware implementation of classic long division.
A divide starts on the current cycle when both
inp.validandinp.readyare asserted.A divide result is available when
outp.validis asserted. The result is read/overwritten once a downstream core assertsoutp.ready.
Warning
This core is not intended to be used in user designs due to poor resource usage. It is mainly kept as a known-to-work reference design for possible equivalence checking later. Use
MulticycleDivinstead.Latency: Divide Results for a will be available
widthclock cycles after assertion of bothinp.validandinp.ready.Throughput: One divide maximum is finished every
widthclock cycles.
- Parameters:
width (int) – Width in bits of both inputs
nandd. For signed multiplies, this includes the sign bit. Outputsqandrwidth will be the same width.
- inp
Input interface to the divider.
- Type:
In(divider_input_signature(width))
- outp
Output interface of the divider.
- Type:
Out(divider_output_signature(width))
Notes
This divider is a naive long-division implementation, similar to how pen-and-pad division is done in base-2/10.
Internally, the divider is aware of the sign of its inputs as well as the sign of the divide operation.
The divider will dispatch to one of the 4 possible combinations of signs during the division loop. The four combinations are:
Add shifted copies of positive/negative powers of two together to form the quotient.
Add/subtract shifted copies of the divisor from the dividend towards zero to form the remainder.
This divider is compliant with RISC-V semantics for divide-by-zero and overflow when
width=32orwidth=64[rv]. Specifically:Signed/unsigned divide-by-zero returns “all bits set” for the quotient and the dividend as the remainder.
Signed overflow (\(-2^{\{31,63\}}/-1\)) returns \(-2^{\{31,63\}}\) for the quotient and \(0\) as the remainder.
- elaborate(platform)
- class smolarith.div.MulticycleDiv(*args, src_loc_at=0, **kwargs)
Restoring/Non-restoring divider soft-core.
This divider core is a gateware implementation of restoring division or non-restoring division. It works for both signed and unsigned divides, and should be preferred to
LongDividerin almost all circumstances due to better resource usage.A divide starts on the current cycle when both
inp.validandinp.readyare asserted.A divide result is available when
outp.validis asserted. The result is read/overwritten once a downstream core assertsoutp.ready.Latency: Divide Results for a restoring divider will be available
width + 3clock cycles after assertion of bothinp.validandinp.ready. For a non-restoring divider, results are available afterwidth + 4clock cycles.Throughput: One divide maximum is finished every
width + 3clock cycles for a restoring divider, and everywidth + 4for a non-restoring divider.
- Parameters:
- inp
Input interface to the divider.
- Type:
In(divider_input_signature(width))
- outp
Output interface of the divider.
- Type:
Out(divider_output_signature(width))
Notes
This divider is implemented using either restoring or non-restoring division. It is basically a gateware translation of the Python implementations shown in Implementation Notes.
Unlike the multipliers, where I could eyeball approximate storage element usage, divider size was benchmarked using yosys‘s
synth_ice40pass. For an \(n\)-bit divide:impl=Impl.RESTORINGrequires approximately \(O(7*n)\) storage elements and \(O(12.75*n)\) LUT4s.impl=Impl.NON_RESTORINGrequires approximately \(O(8*n)\) storage elements and \(O(14.5*n)\) LUT4s.
Internally, the divider converts its operands from signed to unsigned if necessary, performs the division, and the converts back from unsigned to signed if necessary. I ran into some issues with making a signed-aware non-restoring divider such that eating the conversion cost latency is probably justifiable for implementation simplicity of both a restoring and non-restoring divider.
The combination of signed-to-unsigned conversion, using a cycle to latching inputs to the internal divider, and unsigned-to-signed conversion adds three cycles of latency beyond the expected
widthnumber of cycles to perform a division.Additionally, when using a non-restoring divider, the quotient and remainder require a possible final restore step. Remainder restore is implemented as shown in the Python code. Quotient restore is implemented by subtracting the ones complement of the raw quotient from the raw quotient itself.
The quotient and remainder share a backing store; new bits are shifted into the lower-half quotient portion as the remainder is shifted into the upper half. This works because each iteration of the restoring/non-restoring loops check the sign of the remainder, and the lower quotient bits won’t affect the sign bit.
This divider is compliant with RISC-V semantics for divide-by-zero and overflow when
width=32orwidth=64[rv]. Specifically:Signed/unsigned divide-by-zero returns “all bits set” for the quotient and the dividend as the remainder.
Signed overflow (\(-2^{\{31,63\}}/-1\)) returns \(-2^{\{31,63\}}\) for the quotient and \(0\) as the remainder.
Future Directions
It may be worthwhile to make a pipelined divider?
Current latency is worse than the
LongDivider, which is just about the only advantage toLongDivider. We can provide an option to reduce latency/resources required due to the signed/unsigned and restore stages at the cost of some timing closure.
- elaborate(plat)
- class smolarith.div.Outputs(width)
Tagged union representing the signedness of divider outputs.
When
signisUNSIGNED,qandrshould be treated asValues with anas_unsignedShape.When
signisSIGNED,qandrshould be treated as Values with aas_signedShape.
- Parameters:
width (int) – Width in bits of the outputs
qandr. For signed dividers, this includes the sign bit.
- sign
Indicates whether the divider that produced this quotient/remainder was signed or unsigned.
- Type:
- q
The quotient of \(n / d\).
- Type:
Signal(width)
- r
The remainder of \(n / d\), i.e. \(n \bmod d\).
- Type:
Signal(width)
- class smolarith.div.Sign(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)
Indicate the type of divide to be performed.
- smolarith.div.divider_input_signature(width)
Create a parametric divider input port.
This function returns a
Signaturethat’s usable as a transfer initiator to a divider. A divider starts on the current cycle when bothvalidandrdyare asserted.- Parameters:
width (int) – Width in bits of the inputs
nandd. For signed divides, this includes the sign bit.- Returns:
Signature(Inputs)- Return type:
- smolarith.div.divider_output_signature(width)
Create a parametric divider output port.
This function returns a
Signaturethat’s usable as a transfer initiator from a divider.Note
For a core responding to a divider, which is the typical use case, you will want to use this Signature with the
Inflow, like so:>>> from smolarith.div import divider_output_signature >>> from amaranth.lib.wiring import Signature, In >>> my_receiver_sig = Signature({ ... "inp": In(divider_output_signature(width=8)) ... })
- Parameters:
width (int) – Width in bits of the outputs
qandr. For signed divides, this includes the sign bit.- Returns:
Signature(Outputs)- Return type: