You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: src/integrated-matrix.adoc
+44-16Lines changed: 44 additions & 16 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,14 +1,16 @@
1
1
[[IME]]
2
2
== Zvvm Family of Integrated Matrix Extensions
3
3
4
+
:stem: latexmath
5
+
4
6
=== Introduction
5
7
6
8
High-performance computing and machine learning workloads depend critically on general matrix multiplication (GEMM) over a wide range of data types and precisions.
7
9
Dedicated matrix-multiply accelerators often require new register state—separate matrix register files—to achieve competitive throughput, introducing substantial architectural complexity and binary interface disruption.
8
10
9
11
The Zvvm family of Integrated Matrix extensions (Zvvmm, Zvvfmm, Zvvmtls) takes a different approach: it accelerates matrix multiplication using _only_ the 32 × VLEN architected vector registers already defined by the RISC-V "V" vector extension.
10
12
By interpreting groups of existing vector registers as two-dimensional matrix tiles, the Zvvm family of Integrated Matrix extensions delivers high arithmetic density without introducing any new architected state.
11
-
We focus, in particular, on the computation of C ← A × B^T^ + C, where A (μ × λ) and B (ν × λ) are row-major matrix panels and C (μ × ν) is row-major.
13
+
We focus, in particular, on the computation of C ← A × B^T^ + C, where A (μ × λ), B (ν × λ) and C (μ × ν) are row-major matrix panels.
12
14
13
15
The extensions are designed to support implementations spanning a wide range of microarchitectures and performance points: from small, embedded in-order cores targeting low-power and area-constrained applications, to large, high-performance out-of-order implementations targeting HPC and AI workloads.
14
16
A key design goal is that the same binary executes correctly—and achieves near-peak arithmetic throughput—across this entire range without recompilation.
@@ -21,14 +23,20 @@ The Zvvm family of Integrated Matrix extensions (Zvvmm, Zvvfmm, Zvvmtls) provide
21
23
* <<#zvvfmm>>: multiply-accumulate instructions for floating-point matrix tiles
22
24
* <<#zvvmtls>>: two-dimensional load/store instructions for moving data between memory and vector registers interpreted as matrix tiles
23
25
24
-
==== Matrix tile geometry
26
+
==== Matrix tile multiplication geometry
27
+
28
+
The geometry of the multiplier and the tiles is defined by the new parameter `lambda` (λ) which is encoded in 3 bits in the `vtype` CSR, and vector operation parameters like the widening `W` of the multiplication encoded in the instruction, `LMUL`, `SEW` and `VLEN`.
29
+
30
+
[#ime-geometry-fig]
31
+
.Geometry of matrix tiles and element ordering for 32 element vector registers and λ=4. VRs are interpreted as 2D tiles. Vector element indices show the tile element order. (a) Non-widening case with A, B, C having the same SEW. (b) Widening case with A and B having half the SEW of C (double-packing).
32
+
image::images/png/ime-geometry.png[width=100%, align=center, alt="Diagram of matrix tile geometry and multiplier configuration parameters."]
25
33
26
34
Matrix tiles are represented using the existing RISC-V V register file and its configuration state.
27
35
The three matrices in the multiply-accumulate operation C ← A × B^T^ + C are stored as follows:
28
36
29
37
* The _accumulator_ C is stored in a vector register group with element width SEW.
30
38
Its register group multiplier MUL_C is determined by the tile geometry:
31
-
MUL_C = (VLEN / SEW) / λ², where λ is the K dimension given by the `lambda[2:0]` field in `vtype`.
39
+
MUL_C = (VLEN / SEW) / λ^2^, where λ is the K dimension given by the `lambda[2:0]` field in `vtype`.
32
40
The C register group may start at any vector register index that is MUL_C-aligned.
33
41
MUL_C ∈ {1, 2, 4, 8, 16}.
34
42
If MUL_C = 16, the only allowed vector register indices are 0 and 16.
@@ -40,11 +48,23 @@ The three matrices in the multiply-accumulate operation C ← A × B^T^ + C are
40
48
Only integer values of LMUL are supported by the Zvvm family of Integrated Matrix extensions: LMUL ∈ {1, 2, 4, 8}.
41
49
Fractional LMUL settings (LMUL < 1) are reserved and shall raise an illegal-instruction exception when used with any IME instruction.
42
50
51
+
The elements in the vector registers are contiguous in the λ direction, as depicted in
52
+
<<ime-geometry-fig>>. Tile A and C elements are sorted in row-major order while tile B^T^
53
+
elements are sorted in column-major order. This choice allows the implementation of the
54
+
matrix tile multiplication as inner product (eg. a systolic array) or outer product and
55
+
simplifies the implementation of high-rank updates based outer products.
56
+
57
+
[#ime-tile-lmul-fig]
58
+
.LMUL scaling of matrix tiles for LMUL=2 (left) and LMUL=4 (right). LMUL scales the tile along the K dimension only, increasing the effective K dimension of the tile by a factor of LMUL. The blue arrows indicate the contiguous elements in the vector register groups that form the tiles.
59
+
image::images/png/ime-tile-lmul_2_4.png[width=100%, align=center, alt="Diagram showing LMUL scaling of matrix tiles."]
60
+
61
+
==== Subextensions of Zvvm
62
+
43
63
Table <<tbl-subextensions>> lists all computational subextensions in the Zvvm family.
44
64
45
65
[#tbl-subextensions]
46
66
.Computational subextensions in the Zvvm family of Integrated Matrix extensions.
47
-
[cols="1,1,3,3", options="header"]
67
+
[cols="1,1,4,2", options="header"]
48
68
|===
49
69
|Extension | Dependencies | Multiplicand Types | Accumulator Type
50
70
|Zvvmmi4b ^| Zve64d | [U]Int4, [U]Int4 | Int8
@@ -425,6 +445,10 @@ tile load instructions always transfer data at SEW granularity, every loaded
425
445
SEW-bit position contains W contiguous narrow elements that the
426
446
multiply-accumulate instruction consumes as a sub-dot-product.
427
447
448
+
[#ime-tile-widening-fig]]
449
+
.Element distribution and tile geometry example for L=32, SEW wide elements (left), two SEW/2 wide elements packed per SEW (middle), and four SEW/4 wide elements per SEW (right). Packing/widening by W increases the effective K dimension of the tile by a factor of W.
This section specifies how the K_eff product terms may be grouped and when intermediate rounding is permitted.
@@ -540,17 +564,17 @@ The `Zvvmm` family of extensions provides instructions that perform matrix multi
540
564
541
565
The K-dimension of the multiplication (shared inner dimension of A and B^T^) is determined by λ from `vtype`, scaled by a per-instruction widening factor W and further multiplied by LMUL:
542
566
543
-
K_effective = λ × W × LMUL
567
+
K_eff = λ × W × LMUL
544
568
545
569
A and B tiles have elements of width SEW÷W; the accumulator C has elements of width SEW (see table below).
546
570
The signedness of each input is controlled independently by `altfmt_A` and `altfmt_B` (0 = signed, 1 = unsigned); the accumulator C is always signed.
547
571
Integer accumulation wraps modulo 2^SEW^.
548
572
549
573
The tile dimensions follow from the widening and LMUL settings:
550
574
551
-
L_A = LMUL × VLEN ÷ (SEW ÷ W) (total A or B tile elements across LMUL registers)
552
-
K_effective = λ × W × LMUL
553
-
M = N = L_A ÷ K_effective (rows = columns of the square accumulator tile)
575
+
L_A = LMUL × VLEN ÷ (SEW ÷ W) (total A or B tile elements across LMUL registers)
576
+
K_eff = λ × W × LMUL
577
+
M = N = L_A ÷ K_eff (rows = columns of the square accumulator tile)
554
578
555
579
The accumulator register group has cardinality MUL_C = VLEN ÷ (SEW × λ²), independent of both W and LMUL.
556
580
@@ -714,6 +738,10 @@ OCP MX block size is 32 elements; however, recent research (Lee et al.,
714
738
demonstrates that a block size of 16 significantly improves accuracy for
715
739
activation tensors with outliers.
716
740
741
+
[#ime-block-scaling-mul]
742
+
.Block scaling multiplication example with SEW=16, λ=4, W=4, LMUL=1. Block scales are applied along A rows and B columns.
0 commit comments