Skip to content

Commit dfc8994

Browse files
Added 4 figures. Edited overview. Added latex stem for proper (#16)
math formulas. Signed-off-by: Jose Moreira <jmoreira@us.ibm.com> Co-authored-by: Jose Moreira <jmoreira@us.ibm.com>
1 parent daae96e commit dfc8994

5 files changed

Lines changed: 44 additions & 16 deletions

File tree

18.8 KB
Loading

src/images/png/ime-geometry.png

132 KB
Loading
114 KB
Loading
121 KB
Loading

src/integrated-matrix.adoc

Lines changed: 44 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,16 @@
11
[[IME]]
22
== Zvvm Family of Integrated Matrix Extensions
33

4+
:stem: latexmath
5+
46
=== Introduction
57

68
High-performance computing and machine learning workloads depend critically on general matrix multiplication (GEMM) over a wide range of data types and precisions.
79
Dedicated matrix-multiply accelerators often require new register state—separate matrix register files—to achieve competitive throughput, introducing substantial architectural complexity and binary interface disruption.
810

911
The Zvvm family of Integrated Matrix extensions (Zvvmm, Zvvfmm, Zvvmtls) takes a different approach: it accelerates matrix multiplication using _only_ the 32 × VLEN architected vector registers already defined by the RISC-V "V" vector extension.
1012
By interpreting groups of existing vector registers as two-dimensional matrix tiles, the Zvvm family of Integrated Matrix extensions delivers high arithmetic density without introducing any new architected state.
11-
We focus, in particular, on the computation of C ← A × B^T^ + C, where A (μ × λ) and B (ν × λ) are row-major matrix panels and C (μ × ν) is row-major.
13+
We focus, in particular, on the computation of C ← A × B^T^ + C, where A (μ × λ), B (ν × λ) and C (μ × ν) are row-major matrix panels.
1214

1315
The extensions are designed to support implementations spanning a wide range of microarchitectures and performance points: from small, embedded in-order cores targeting low-power and area-constrained applications, to large, high-performance out-of-order implementations targeting HPC and AI workloads.
1416
A key design goal is that the same binary executes correctly—and achieves near-peak arithmetic throughput—across this entire range without recompilation.
@@ -21,14 +23,20 @@ The Zvvm family of Integrated Matrix extensions (Zvvmm, Zvvfmm, Zvvmtls) provide
2123
* <<#zvvfmm>>: multiply-accumulate instructions for floating-point matrix tiles
2224
* <<#zvvmtls>>: two-dimensional load/store instructions for moving data between memory and vector registers interpreted as matrix tiles
2325

24-
==== Matrix tile geometry
26+
==== Matrix tile multiplication geometry
27+
28+
The geometry of the multiplier and the tiles is defined by the new parameter `lambda` (λ) which is encoded in 3 bits in the `vtype` CSR, and vector operation parameters like the widening `W` of the multiplication encoded in the instruction, `LMUL`, `SEW` and `VLEN`.
29+
30+
[#ime-geometry-fig]
31+
.Geometry of matrix tiles and element ordering for 32 element vector registers and λ=4. VRs are interpreted as 2D tiles. Vector element indices show the tile element order. (a) Non-widening case with A, B, C having the same SEW. (b) Widening case with A and B having half the SEW of C (double-packing).
32+
image::images/png/ime-geometry.png[width=100%, align=center, alt="Diagram of matrix tile geometry and multiplier configuration parameters."]
2533

2634
Matrix tiles are represented using the existing RISC-V V register file and its configuration state.
2735
The three matrices in the multiply-accumulate operation C ← A × B^T^ + C are stored as follows:
2836

2937
* The _accumulator_ C is stored in a vector register group with element width SEW.
3038
Its register group multiplier MUL_C is determined by the tile geometry:
31-
MUL_C = (VLEN / SEW) / λ², where λ is the K dimension given by the `lambda[2:0]` field in `vtype`.
39+
MUL_C = (VLEN / SEW) / λ^2^, where λ is the K dimension given by the `lambda[2:0]` field in `vtype`.
3240
The C register group may start at any vector register index that is MUL_C-aligned.
3341
MUL_C ∈ {1, 2, 4, 8, 16}.
3442
If MUL_C = 16, the only allowed vector register indices are 0 and 16.
@@ -40,11 +48,23 @@ The three matrices in the multiply-accumulate operation C ← A × B^T^ + C are
4048
Only integer values of LMUL are supported by the Zvvm family of Integrated Matrix extensions: LMUL ∈ {1, 2, 4, 8}.
4149
Fractional LMUL settings (LMUL < 1) are reserved and shall raise an illegal-instruction exception when used with any IME instruction.
4250

51+
The elements in the vector registers are contiguous in the λ direction, as depicted in
52+
<<ime-geometry-fig>>. Tile A and C elements are sorted in row-major order while tile B^T^
53+
elements are sorted in column-major order. This choice allows the implementation of the
54+
matrix tile multiplication as inner product (eg. a systolic array) or outer product and
55+
simplifies the implementation of high-rank updates based outer products.
56+
57+
[#ime-tile-lmul-fig]
58+
.LMUL scaling of matrix tiles for LMUL=2 (left) and LMUL=4 (right). LMUL scales the tile along the K dimension only, increasing the effective K dimension of the tile by a factor of LMUL. The blue arrows indicate the contiguous elements in the vector register groups that form the tiles.
59+
image::images/png/ime-tile-lmul_2_4.png[width=100%, align=center, alt="Diagram showing LMUL scaling of matrix tiles."]
60+
61+
==== Subextensions of Zvvm
62+
4363
Table <<tbl-subextensions>> lists all computational subextensions in the Zvvm family.
4464

4565
[#tbl-subextensions]
4666
.Computational subextensions in the Zvvm family of Integrated Matrix extensions.
47-
[cols="1,1,3,3", options="header"]
67+
[cols="1,1,4,2", options="header"]
4868
|===
4969
|Extension | Dependencies | Multiplicand Types | Accumulator Type
5070
|Zvvmmi4b ^| Zve64d | [U]Int4, [U]Int4 | Int8
@@ -425,6 +445,10 @@ tile load instructions always transfer data at SEW granularity, every loaded
425445
SEW-bit position contains W contiguous narrow elements that the
426446
multiply-accumulate instruction consumes as a sub-dot-product.
427447

448+
[#ime-tile-widening-fig]]
449+
.Element distribution and tile geometry example for L=32, SEW wide elements (left), two SEW/2 wide elements packed per SEW (middle), and four SEW/4 wide elements per SEW (right). Packing/widening by W increases the effective K dimension of the tile by a factor of W.
450+
image::images/png/ime-tile-widening.png[align="center"]
451+
428452
===== Byte-sized and wider elements (EEW ≥ 8)
429453

430454
When the input element width is 8 bits or wider (EEW ∈ {8, 16, 32, 64}), the
@@ -457,9 +481,9 @@ accordingly.
457481
[#arithmetic-considerations]
458482
==== Arithmetic considerations
459483

460-
Each multiply-accumulate instruction computes, for every output element C[m, n]:
484+
Each multiply-accumulate instruction computes, for every output element stem:[C_{m,n}]:
461485

462-
C[m, n] ← C[m, n] + Σ_{k=0}^{K_eff−1} A[m, k] × B[k, n]
486+
stem:[C_{m,n} \leftarrow C_{m,n} + \sum_{k=0}^{K_{\text{eff}}-1} A_{m,k} \times B_{k,n}]
463487

464488
where K_eff = λ × W × LMUL.
465489
This section specifies how the K_eff product terms may be grouped and when intermediate rounding is permitted.
@@ -540,17 +564,17 @@ The `Zvvmm` family of extensions provides instructions that perform matrix multi
540564

541565
The K-dimension of the multiplication (shared inner dimension of A and B^T^) is determined by λ from `vtype`, scaled by a per-instruction widening factor W and further multiplied by LMUL:
542566

543-
K_effective = λ × W × LMUL
567+
K_eff = λ × W × LMUL
544568

545569
A and B tiles have elements of width SEW÷W; the accumulator C has elements of width SEW (see table below).
546570
The signedness of each input is controlled independently by `altfmt_A` and `altfmt_B` (0 = signed, 1 = unsigned); the accumulator C is always signed.
547571
Integer accumulation wraps modulo 2^SEW^.
548572

549573
The tile dimensions follow from the widening and LMUL settings:
550574

551-
L_A = LMUL × VLEN ÷ (SEW ÷ W) (total A or B tile elements across LMUL registers)
552-
K_effective = λ × W × LMUL
553-
M = N = L_A ÷ K_effective (rows = columns of the square accumulator tile)
575+
L_A = LMUL × VLEN ÷ (SEW ÷ W) (total A or B tile elements across LMUL registers)
576+
K_eff = λ × W × LMUL
577+
M = N = L_A ÷ K_eff (rows = columns of the square accumulator tile)
554578

555579
The accumulator register group has cardinality MUL_C = VLEN ÷ (SEW × λ²), independent of both W and LMUL.
556580

@@ -714,6 +738,10 @@ OCP MX block size is 32 elements; however, recent research (Lee et al.,
714738
demonstrates that a block size of 16 significantly improves accuracy for
715739
activation tensors with outliers.
716740

741+
[#ime-block-scaling-mul]
742+
.Block scaling multiplication example with SEW=16, λ=4, W=4, LMUL=1. Block scales are applied along A rows and B columns.
743+
image::images/png/ime-block-scaling.png[align="center", width="50%"]
744+
717745
The Integrated Matrix extensions support microscaling on all floating-point
718746
multiply-accumulate instructions by encoding `vm=0` in the instruction and
719747
supplying paired E8M0 block-scale factors through `v0`. The `bs` field in
@@ -734,21 +762,21 @@ applied, `v0` is not read, and the `bs` field is ignored.
734762

735763
When microscaling is active (`vm=0`), each output element is computed as:
736764

737-
C[m, n] ← C[m, n] + Σ_{s=0}^{S1} (scale_A[m][s] × scale_B[n][s] × (Σ_{k block s} A[m, k] × B[k, n]))
765+
stem:[C_{m,n} \leftarrow C_{m,n} + \sum_{s=0}^{S-1} \left( \text{scale_A}_{m,s} \times \text{scale_B}_{n,s} \times \left( \sum_{k \in \text{block } s} A_{m,k} \times B_{k,n} \right) \right)]
738766

739767
where S = ⌈K_eff / block_size⌉ is the number of scale blocks per
740768
row/column, and block _s_ covers elements
741769
[s × block_size, min((s+1) × block_size, K_eff) − 1].
742770

743-
`scale_A[m][s]` and `scale_B[n][s]` are power-of-two values decoded from
771+
stem:[\text{scale_A}_{m,s}] and stem:[\text{scale_B}_{n,s}] are power-of-two values decoded from
744772
8-bit E8M0 fields in `v0`. The E8M0 format uses an 8-bit biased exponent
745773
(bias 127) representing values from 2^−127^ to 2^127^; the bit pattern
746774
0xFF encodes NaN.
747775

748776
The per-block scales for A and B are applied as exact power-of-two
749777
multiplications (implemented as exponent additions) and therefore introduce
750-
no rounding error. If any `scale_A[m][s]` or `scale_B[n][s]` is NaN
751-
(0xFF), the corresponding output element C[m, n] is set to the default NaN
778+
no rounding error. If any stem:[\text{scale_A}_{m,s}] or stem:[\text{scale_B}_{n,s}] is NaN
779+
(0xFF), the corresponding output element stem:[C_{m,n}] is set to the default NaN
752780
regardless of the input element values.
753781

754782
When a non-NaN E8M0 value represents a power-of-two that overflows the
@@ -775,8 +803,8 @@ Within each row, S = ⌈K_eff / block_size⌉ elements are active. Element
775803
index _p_ = _m_ × R + _s_ addresses the scale pair for row (or column) _m_
776804
at block _s_. Specifically:
777805

778-
* `scale_A[m][s]` is the lower byte of the 16-bit element at index (m × R + s)
779-
* `scale_B[n][s]` is the upper byte of the 16-bit element at index (n × R + s)
806+
* stem:[\text{scale_A}_{m,s}] is the lower byte of the 16-bit element at index (m × R + s)
807+
* stem:[\text{scale_B}_{n,s}] is the upper byte of the 16-bit element at index (n × R + s)
780808

781809
The total number of 16-bit elements spanned per tile dimension is M × R,
782810
where M = N = VLEN ÷ (SEW × λ). Because M × R = (VLEN ÷ (SEW × λ)) ×

0 commit comments

Comments
 (0)