Updates to Introduction and edits to bit locations#15
Conversation
|
I added some more changes to the floating-point semantics. I expect those to be somewhat more contentious. Looking forward to hearing from @ptomsich and @efocht-oct. I am reviewing the micro-scaling next. |
| where the roundings are performed with the rounding mode from `frm`. | ||
| The rounding of partial sum S _before_ it is accumulated to the running value of C[m,n] is optional. | ||
|
|
||
| * After each group, the accumulated partial sum is rounded to C's precision (SEW) using an _implementation-defined_ rounding mode and _added to the running value of C[m, n]_. |
There was a problem hiding this comment.
Why specify this intermediate rounding for partial sums, if it's implementation defined?
There was a problem hiding this comment.
"Implementation defined" is too vague and cannot be tested for compliance. I have restricted things a bit, so that the implementations can be more easily tested and matches industry practice. If there are more things we want to license, we can add that. But the way it is there now enables pretty much anything one wants to do.
|
There are additional edits for tile loads/stores. Special case for (rs2) = 0 so that hardware can optimize for the micro-kernel. |
|
Just a quick note on the bit locations: we inserted the (original) editorial note, as we couldn't make a world-breaking change across our QEMU, testsuite, etc. and keep the timeline for this document drop… |
|
I am merging this pull request after a few edits for clarity. In particular, I simplified the guidelines for VLEN-portable code to include the "dynamic code path" form that we previously discussed as the preferred approach. |
daae96e
into
riscv:integrated-matrix-extension
* Updates to the introduction * Editorial notes on bit locations * Revised floating-point rounding rules * Revised floating-point rounding rules * Revised floating-point rounding rules * Special case for tile loads/stores when (rs2) = 0 * Inputs don't have their own SEW, just EEW * Added arithmetic considerations to mixed-format inputs * Added arithmetic considerations to mixed-format inputs * Made semantics of micro-scaling computations clearer * Used byte addresses in the definitions of tile load/store * Used byte addresses in the definitions of tile load/store * Clarify valid values of VL * Clarify that tile loads must use target SEW * Clarify guidelines for portable IME code
Process all 28 items from the IME TG internal review feedback tracker. Subextension dependencies (#3): Replace blanket Zve64d dependency with the minimum Zve subset per subextension: Zve32x for integer accumulators ≤ 32-bit, Zve64x for Int64 accumulators, Zve32f for FP accumulators ≤ 32-bit, and Zve64d only for FP64 accumulators. 8× widening instructions (#7, #8, #9, #24): Add v8wmmacc.vv (funct6=0x3b, OPIVV), vf8wmmacc.vv (funct6=0x17, OPFVV), and vf8wimmacc.vv (integer-input MX variant, vm=0 of v8wmmacc) with full instruction definitions, SAIL pseudocode, encoding diagrams, and exception tables. Update encoding maps (FP, integer, integer MX) with W=8 entries. Add Zvvxi4fp32mm and Zvvxni4fp32mm to the MX subextension table. Replace the informative NOTE about reserved W=8 encoding space with normative text. Remove the undefined term "octal-widening". MXINT4 clarification and OCP citation (#14): Define MXINT4 as analogous to OCP MX's MXINT8 but with 4-bit signed elements. Add proper citation of the OCP Microscaling Formats (MX) v1.0 Specification with URL. Update microscaling applicability to include vf8wmmacc.vv. vfmmacc.vv vm=0 cleanup (#13, #28): Remove contradictory "When vm=0" exception bullets (vm=0 is reserved for non-widening FP). Replace dead microscaling SAIL code with a straightforward non-widening FP GEMM loop. Add explicit note that microscaling is not supported for non-widening multiply-accumulate. Terminology fixes (#15, #21): Add forward cross-reference at first use of altfmt_A/altfmt_B. Correct two occurrences where λ was described as "the K dimension" to "tile-layout parameter", clarifying that K_eff = λ × W × LMUL is the derived effective K dimension.
I know I am late to this, but getting there. Everything I read so far (up to Storage Formats) looks great to me! I just made some minor changes that I hope are non-controversial.