riscv
diff --git a/‎src/images/png/ime-load-store-geometry.png‎
193 KB b/‎src/images/png/ime-load-store-geometry.png‎
193 KB
diff --git a/‎src/images/png/ime-vmtls-lmul.png‎
256 KB b/‎src/images/png/ime-vmtls-lmul.png‎
256 KB
diff --git a/‎src/integrated-matrix.adoc‎
Lines changed: 17 additions & 1 deletion b/‎src/integrated-matrix.adoc‎
Lines changed: 17 additions & 1 deletion
@@ -2,6 +2,7 @@
 == Zvvm Family of Integrated Matrix Extensions
 
 :stem: latexmath
+:imagesdir: ../docs-resources/images
 :imagesdir: images
 
 === Introduction
@@ -1116,8 +1117,14 @@ The tile load and store instructions make use of the following parameters from t
 * LMUL — vector length multiplier
 * λ — selected lambda, read from `lambda[2:0]` in `vtype`
 
-The resulting tile dimensions are μ = ν = VL/λ, with the accumulator tile C occupying MUL = LMUL/λ² vector registers.
 When loading A or B input tiles, `vmtl.v` and `vmttl.v` shall be used with SEW equal to the element width of the C accumulator tile.
+<<#ime-load-store-geometry>> illustrates the memory to VR load for both row-major and column-major order for a tile with LMUL=1. 
+Physically both transfers are identical: they move contiguous segments of length _linesize_ = λ × LMUL with a stride of LD between them.
+The tile load/store instructions interpret the memory layout according to the specified leading dimension, but the resulting data layout in the VR is the same regardless of whether the source/destination matrix is stored in row-major or column-major order.
+
+[#ime-load-store-geometry]
+.Loading a matrix tile from memory for LMUL=1. The matrix is layed out linearly in memory, the leading dimension LD specifies its row size (a) or column size (b). Element indices represent the offset of the elements in memory. Blue arrows indicate the data ordering in memory/VR.
+image::png/ime-load-store-geometry.png[align="center"]
 
 If (rs2) = 0, then the leading dimension LD is set to the _natural dimension_ of λ × LMUL.
 That is, the memory layout, with elements contiguous to each other, matches the layout of the register group being loaded/stored.
@@ -1188,6 +1195,14 @@ For each element index `i` in the body `[vstart, VL)` where the mask is enabled:
 
     M[rs1 + (SEW ÷ 8) × ((i / linesize) × LD + (i % linesize))] = VS[i]
 
+[NOTE]
+====
+Order preserving tile load/store with LMUL > 1 offers optimization opportunities. While vmtl/vmts are very similar to vector constant-stride segment operations, the segment sizes are potentially larger.  Matching cache line size with λ × LMUL × SEW allows for full cacheline transfers.
+====
+[#ime-vmtls-lmul]
+.Order preserving tile load/store with LMUL > 1 for row-major (a) and column-major ordering in memory.
+image::png/ime-vmtls-lmul.png[align="center", width="90%"]
+
 ===== `vmttl.v` — Transposing Tile Load
 
     vmttl.v vd, (rs1), rs2 [, Lλ] [, vm]
@@ -3410,3 +3425,4 @@ Included in::
 |0.1
 |Draft
 |===
+