Skip to content

Updates to Introduction and edits to bit locations#15

Merged
joseemoreira merged 15 commits intoriscv:integrated-matrix-extensionfrom
joseemoreira:integrated-matrix-extension
Mar 8, 2026
Merged

Updates to Introduction and edits to bit locations#15
joseemoreira merged 15 commits intoriscv:integrated-matrix-extensionfrom
joseemoreira:integrated-matrix-extension

Conversation

@joseemoreira
Copy link
Copy Markdown
Collaborator

I know I am late to this, but getting there. Everything I read so far (up to Storage Formats) looks great to me! I just made some minor changes that I hope are non-controversial.

@joseemoreira joseemoreira requested a review from ptomsich March 7, 2026 12:32
@joseemoreira joseemoreira requested a review from efocht-oct March 7, 2026 14:04
@joseemoreira
Copy link
Copy Markdown
Collaborator Author

I added some more changes to the floating-point semantics. I expect those to be somewhat more contentious. Looking forward to hearing from @ptomsich and @efocht-oct.

I am reviewing the micro-scaling next.

where the roundings are performed with the rounding mode from `frm`.
The rounding of partial sum S _before_ it is accumulated to the running value of C[m,n] is optional.

* After each group, the accumulated partial sum is rounded to C's precision (SEW) using an _implementation-defined_ rounding mode and _added to the running value of C[m, n]_.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why specify this intermediate rounding for partial sums, if it's implementation defined?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Implementation defined" is too vague and cannot be tested for compliance. I have restricted things a bit, so that the implementations can be more easily tested and matches industry practice. If there are more things we want to license, we can add that. But the way it is there now enables pretty much anything one wants to do.

@joseemoreira
Copy link
Copy Markdown
Collaborator Author

There are additional edits for tile loads/stores. Special case for (rs2) = 0 so that hardware can optimize for the micro-kernel.

@ptomsich
Copy link
Copy Markdown
Collaborator

ptomsich commented Mar 7, 2026

Just a quick note on the bit locations: we inserted the (original) editorial note, as we couldn't make a world-breaking change across our QEMU, testsuite, etc. and keep the timeline for this document drop…

@joseemoreira
Copy link
Copy Markdown
Collaborator Author

I am merging this pull request after a few edits for clarity. In particular, I simplified the guidelines for VLEN-portable code to include the "dynamic code path" form that we previously discussed as the preferred approach.

@joseemoreira joseemoreira merged commit daae96e into riscv:integrated-matrix-extension Mar 8, 2026
3 checks passed
ptomsich pushed a commit that referenced this pull request Mar 9, 2026
* Updates to the introduction

* Editorial notes on bit locations

* Revised floating-point rounding rules

* Revised floating-point rounding rules

* Revised floating-point rounding rules

* Special case for tile loads/stores when (rs2) = 0

* Inputs don't have their own SEW, just EEW

* Added arithmetic considerations to mixed-format inputs

* Added arithmetic considerations to mixed-format inputs

* Made semantics of micro-scaling computations clearer

* Used byte addresses in the definitions of tile load/store

* Used byte addresses in the definitions of tile load/store

* Clarify valid values of VL

* Clarify that tile loads must use target SEW

* Clarify guidelines for portable IME code
ptomsich added a commit that referenced this pull request Mar 26, 2026
Process all 28 items from the IME TG internal review feedback tracker.

Subextension dependencies (#3):
  Replace blanket Zve64d dependency with the minimum Zve subset per
  subextension: Zve32x for integer accumulators ≤ 32-bit, Zve64x for
  Int64 accumulators, Zve32f for FP accumulators ≤ 32-bit, and Zve64d
  only for FP64 accumulators.

8× widening instructions (#7, #8, #9, #24):
  Add v8wmmacc.vv (funct6=0x3b, OPIVV), vf8wmmacc.vv (funct6=0x17,
  OPFVV), and vf8wimmacc.vv (integer-input MX variant, vm=0 of
  v8wmmacc) with full instruction definitions, SAIL pseudocode,
  encoding diagrams, and exception tables.  Update encoding maps (FP,
  integer, integer MX) with W=8 entries.  Add Zvvxi4fp32mm and
  Zvvxni4fp32mm to the MX subextension table.  Replace the informative
  NOTE about reserved W=8 encoding space with normative text.  Remove
  the undefined term "octal-widening".

MXINT4 clarification and OCP citation (#14):
  Define MXINT4 as analogous to OCP MX's MXINT8 but with 4-bit signed
  elements.  Add proper citation of the OCP Microscaling Formats (MX)
  v1.0 Specification with URL.  Update microscaling applicability to
  include vf8wmmacc.vv.

vfmmacc.vv vm=0 cleanup (#13, #28):
  Remove contradictory "When vm=0" exception bullets (vm=0 is reserved
  for non-widening FP).  Replace dead microscaling SAIL code with a
  straightforward non-widening FP GEMM loop.  Add explicit note that
  microscaling is not supported for non-widening multiply-accumulate.

Terminology fixes (#15, #21):
  Add forward cross-reference at first use of altfmt_A/altfmt_B.
  Correct two occurrences where λ was described as "the K dimension"
  to "tile-layout parameter", clarifying that K_eff = λ × W × LMUL is
  the derived effective K dimension.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants