Skip to content

Commit 5e49724

Browse files
authored
fix: fig link in cute docs (#2216)
1 parent b3f3c77 commit 5e49724

File tree

4 files changed

+25
-25
lines changed

4 files changed

+25
-25
lines changed

media/docs/cpp/cute/02_layout_algebra.md

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -250,7 +250,7 @@ We often use the `<LayoutA, LayoutB, ...>` notation to distinguish `Tiler`s from
250250

251251
The `result` in the above code can be depicted as the 3x8 sublayout of the original layout highlighted in the figure below.
252252
<p align="center">
253-
<img src="../../images/cute/composition1.png" alt="composition1.png" height="250"/>
253+
<img src="../../../images/cute/composition1.png" alt="composition1.png" height="250"/>
254254
</p>
255255

256256
For convenience, CuTe also interprets `Shape`s as a tiler as well. A `Shape` is interpreted as tuple-of-layouts-with-stride-1:
@@ -269,7 +269,7 @@ auto result = composition(a, tiler);
269269
```
270270
where `result` can be depicted as the 3x8 sublayout of the original layout highlighted in the figure below.
271271
<p align="center">
272-
<img src="../../images/cute/composition2.png" alt="composition2.png" height="250"/>
272+
<img src="../../../images/cute/composition2.png" alt="composition2.png" height="250"/>
273273
</p>
274274
275275
## Composition Tilers
@@ -324,7 +324,7 @@ The `cotarget` parameter above is most commonly an integer -- you can see we onl
324324
* `complement((2,2):(1,6), 24)` is `(3,2):(2,12)`. Note that `((2,2),(3,2)):((1,6),(2,12))` has cosize `24` and produces unique indices.
325325

326326
<p align="center">
327-
<img src="../../images/cute/complement1.png" alt="complement1.png" height="75"/>
327+
<img src="../../../images/cute/complement1.png" alt="complement1.png" height="75"/>
328328
</p>
329329
As a visualization, the above figure depicts the codomain of the last example. The image of the original layout `(2,2):(1,6)` is colored in gray. The complement effectively "repeats" the original layout (displayed in the other colors) such that the codomain size of the result is `24`. The complement `(3,2):(2,12)` can be viewed as the "layout of the repetition."
330330

@@ -372,7 +372,7 @@ This is computed in the three steps described in the implementation above.
372372
* Composition of `A = (4,2,3):(2,1,8)` with `(B,B*)` is then `((2,2),(2,3)):((4,1),(2,8))`.
373373
374374
<p align="center">
375-
<img src="../../images/cute/divide1.png" alt="divide1.png" height="150"/>
375+
<img src="../../../images/cute/divide1.png" alt="divide1.png" height="150"/>
376376
</p>
377377
378378
The above figure depicts `A` as a 1-D layout with the elements pointed to by `B` highlighted in gray. The layout `B` describes our "tile" of data, and there are six of those tiles in `A` shown by each of the colors. After the divide, the first mode of the result is the tile of data and the second mode of the result iterates over each tile.
@@ -384,7 +384,7 @@ Using the `Tiler` concept defined above, this immediately generalizes to multidi
384384
Similar to the 2-D composition example above, consider a 2-D layout `A = (9,(4,8)):(59,(13,1))` and want to apply `3:3` down the columns (mode-0) and `(2,4):(1,8)` across the rows (mode-1). This means the tiler can be written as `B = <3:3, (2,4):(1,8)>`.
385385
386386
<p align="center">
387-
<img src="../../images/cute/divide2.png" alt="divide2.png" height="450"/>
387+
<img src="../../../images/cute/divide2.png" alt="divide2.png" height="450"/>
388388
</p>
389389
390390
The above figure depicts `A` as a 2-D layout with the elements pointed to by `B` highlighted in gray. The layout `B` describes our "tile" of data, and there are twelve of those tiles in `A` shown by each of the colors. After the divide, the first mode of each mode of the result is the tile of data and the second mode of each mode iterates over each tile. In that sense, this operation can be viewed as a kind of `gather` operation or as simply a permutation on the rows and cols.
@@ -430,7 +430,7 @@ We note that `logical_divide` preserves the *semantics* of the modes while permu
430430
This is not the case with `zipped_divide`. The mode-0 in the `zipped_divide` result is the `Tile` itself (of whatever rank the `Tiler` was) and mode-1 is the layout of those tiles. It doesn't always make sense to plot these as 2-D layouts, because the `M`-mode is now more aptly the "tile-mode" and the `N`-mode is more aptly the "rest-mode". Regardless, we still can plot the resulting layout as 2-D as shown below.
431431
432432
<p align="center">
433-
<img src="../../images/cute/divide3.png" alt="divide3.png" height="450"/>
433+
<img src="../../../images/cute/divide3.png" alt="divide3.png" height="450"/>
434434
</p>
435435
436436
We've kept each tile as its color in the previous images for clarity. Clearly, iterating across tiles is now equivalent to iterating across a row of this layout and iterating over elements within a tile is equivalent to iterating down a column of this layout. As we'll see in the `Tensor` section, this can be used to great effect in partitioning within or across tiles of data.
@@ -477,7 +477,7 @@ This is computed in the three steps described in the implementation above.
477477
* Concatenation of `(A,A* o B) = ((2,2),(2,3)):((4,1),(2,8))`.
478478

479479
<p align="center">
480-
<img src="../../images/cute/product1.png" alt="product1.png" height="175"/>
480+
<img src="../../../images/cute/product1.png" alt="product1.png" height="175"/>
481481
</p>
482482

483483
The above figure depicts `A` and `B` as a 1-D layouts. The layout `B` describes the number and order of repetitions of `A` and they are colored for clarity. After the product, the first mode of the result is the tile of data and the second mode of the result iterates over each tile.
@@ -487,7 +487,7 @@ Note that the result is identical to the result of the 1-D Logical Divide exampl
487487
Of course, we can change the number and order of the tiles in the product by changing `B`.
488488

489489
<p align="center">
490-
<img src="../../images/cute/product2.png" alt="product2.png" height="175"/>
490+
<img src="../../../images/cute/product2.png" alt="product2.png" height="175"/>
491491
</p>
492492

493493
For example, in the above image with `B = (4,2):(2,1)`, there are 8 repeated tiles instead of 6 and the tiles are in a different order.
@@ -497,7 +497,7 @@ For example, in the above image with `B = (4,2):(2,1)`, there are 8 repeated til
497497
We can use the by-mode `tiler` strategies previously developed to write multidimensional products as well.
498498

499499
<p align="center">
500-
<img src="../../images/cute/product2d.png" alt="product2d.png" height="250"/>
500+
<img src="../../../images/cute/product2d.png" alt="product2d.png" height="250"/>
501501
</p>
502502

503503
The above image demonstates the use of a `tiler` to apply `logical_product` by-mode. Despite this **not being the recommended approach**, the result is a rank-2 layout consisting of 2x5 row-major block that is tiled across a 3x4 column-major arrangement.
@@ -520,15 +520,15 @@ Because `A` is always compatible with mode-0 of the result and `B` is always com
520520
This is exactly what `blocked_product` and `raked_product` do and it is why they are called rank-sensitive. Unlike other CuTe functions that take `Layout` arguments, these care about the top-level rank of the arguments so that each mode can be reassociated after the `logical_product`.
521521

522522
<p align="center">
523-
<img src="../../images/cute/productblocked2d.png" alt="productblocked2d.png" height="250"/>
523+
<img src="../../../images/cute/productblocked2d.png" alt="productblocked2d.png" height="250"/>
524524
</p>
525525

526526
The above image shows the same result as the `tiler` approach, but with much more intuitive arguments. A 2x5 row-major layout is arranged as a tile in a 3x4 column-major arrangement. Also note that `blocked_product` went ahead and `coalesced` mode-0 for us.
527527

528528
Similarly, `raked_product` combines the modes slightly differently. Instead of the resulting "column" mode being constructed from the `A` "column" mode then the `B` "column" mode, the resulting "column" mode is constructed from the `B` "column" mode then the `A` "column" mode.
529529

530530
<p align="center">
531-
<img src="../../images/cute/productraked2d.png" alt="productraked2d.png" height="250"/>
531+
<img src="../../../images/cute/productraked2d.png" alt="productraked2d.png" height="250"/>
532532
</p>
533533

534534
This results in the "tile" `A` now being interleaved or "raked" with the "layout-of-tiles" `B` instead of appearing as blocks. Other references call this a "cyclic distribution."

media/docs/cpp/cute/03_tensor.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -270,7 +270,7 @@ Tensor F = A(make_coord(2,_),make_coord(_,3,_));
270270
```
271271

272272
<p align="center">
273-
<img src="../../images/cute/slice.png" alt="slice.png" height="300"/>
273+
<img src="../../../images/cute/slice.png" alt="slice.png" height="300"/>
274274
</p>
275275

276276
In the image above, a `Tensor` is sliced in various ways and the subtensors generated by those slices are highlighted within the original tensor. Note that tensor `C` and `D` contain the same elements, but have different ranks and shapes due to the use of `_` versus the use of `make_coord(_,_)`. In each case, the rank of the result is equal to the number of `Underscore`s in the slicing coordinate.
@@ -328,7 +328,7 @@ Tensor v = tv(threadIdx.x, _); // (4)
328328
```
329329

330330
<p align="center">
331-
<img src="../../images/cute/tv_layout.png" alt="tv_layout.png" height="300"/>
331+
<img src="../../../images/cute/tv_layout.png" alt="tv_layout.png" height="300"/>
332332
</p>
333333

334334
The above image is a visual representation of the above code. An arbitrary 4x8 layout of data is composed with a specific 8x4 TV-layout that represents a partitioning pattern. The result of the composition is on the right where each threads' values are arranged across each row. The bottom layout depicts the inverse TV layout which shows the mapping of 4x8 logical coordinates to the thread id and value id they will be mapped to.

media/docs/cpp/cute/0t_mma_atom.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -209,7 +209,7 @@ Volta architecture implements an HMMA instruction where a group of 8 threads cal
209209
We first take a look at how we would take the ISA semantics of thread and data partitioning for the HMMA instruction, and encode it in a Traits struct. The HMMA NT instruction has the thread-data layout:
210210
211211
<p align="center">
212-
<img src="../../images/cute/HMMA.8x8x4.NT.png" alt="HMMA.8x8x4.NT.png" height="400"/>
212+
<img src="../../../images/cute/HMMA.8x8x4.NT.png" alt="HMMA.8x8x4.NT.png" height="400"/>
213213
</p>
214214
215215
### Types
@@ -251,7 +251,7 @@ Again, this layout function maps the logical thread id [0,8) of the MMA operatio
251251
Let us look at exactly how the 8 threads within a QP are mapped to the A, B and C matrices. For the C and D matrices, the above image is broken down a bit more below. On the left is shown the whole QP level view, and on the right is shown the values owned by just thread 0.
252252

253253
<p align="center">
254-
<img src="../../images/cute/HMMA.8x8x4.quadpair.C.png" alt="HMMA.8x8x4.quadpair.C.png" height="400"/>
254+
<img src="../../../images/cute/HMMA.8x8x4.quadpair.C.png" alt="HMMA.8x8x4.quadpair.C.png" height="400"/>
255255
</p>
256256

257257
The metainformation of this single instruction level view is what we want to encode in CuTe. Specifically, the QP level view in this diagram corresponds to the four MMA traits for [SM70_F32F16F16F32](https://github.com/NVIDIA/cutlass/tree/main/include/cute/arch/mma_sm70.hpp). These structs contain the `Element` types, the `Shape_MNK`, and the `ThrID` mapping we constructed above. Now, let us take a look at the definition of `CLayout`, the thread-data layout of accumulators. The job of `CLayout` is to construct a mapping between the `(logical_thr_id, logical_val_id)` and `(m, n)` coordinate in the C matrix which can then be used to build up more complicated layouts and operations like the 16x16x4 WMMA.
@@ -321,7 +321,7 @@ In the case of F16 accumulators, the layout is way less complex. Each row of acc
321321
A and B matrix layouts depend on whether the sources are transposed or not. The diagram below shows the thread ID to data ownership map for A and B matrices in the case of NT and TN transposes.
322322

323323
<p align="center">
324-
<img src="../../images/cute/HMMA.8x8x4.quadpair.AB.png" alt="HMMA.8x8x4.quadpair.AB.png" height="400"/>
324+
<img src="../../../images/cute/HMMA.8x8x4.quadpair.AB.png" alt="HMMA.8x8x4.quadpair.AB.png" height="400"/>
325325
</p>
326326

327327
Let's look at the TN layout for A matrix first (right side in the diagram). Again, there are the same 8 logical threads, but each threads owns only 4 elements this time. The shape of `ALayout` will then be `Shape<_8, _4>`. As for the strides, we again need a similar mapping between `(m, k) == m + k * M`. Looking down the `M` mode, we go from `(T0, V0)` to `(T1, V0)` which is a stride of 1 for all 8 threads. For the `K` mode, as we go across, we go from `(T0, V0)` to `(T0, V1)`, which makes a stride of 8 for all 4 values. Therefore, the A layout is:
@@ -376,15 +376,15 @@ Accumulators are mapped hierarchically in GMMA, starting from the concept of a c
376376

377377
Each core matrix has the layout as shown in the diagram below.
378378
<p align="center">
379-
<img src="../../images/cute/gmma_coremat_cd_fp16.png" alt="gmma_coremat_cd_fp16.png" height="600"/>
379+
<img src="../../../images/cute/gmma_coremat_cd_fp16.png" alt="gmma_coremat_cd_fp16.png" height="600"/>
380380
</p>
381381

382382
As in the Volta examples, the thread IDs are logical only, and which of the four warps they belong to in the warpgroup is not important.
383383

384384
Then GMMA tiles this core matrix first vertically along the M mode, and then repeats that column of core matrices along the N mode to construct the full MxN tile. This tiling is shown in the image below.
385385

386386
<p align="center">
387-
<img src="../../images/cute/gmma_wg_n_slice.png" alt="gmma_wg_n_slice.png" height="600"/>
387+
<img src="../../../images/cute/gmma_wg_n_slice.png" alt="gmma_wg_n_slice.png" height="600"/>
388388
</p>
389389

390390
With this image, we are again ready to start building the `CLayout` for `SM90_64x128x16_F16F16F16F16_TN` atom. Same as before, we are constructing a mapping between the `(logical_thr_id, logical_val_id) -> (m, n)` coordinate spaces.
@@ -453,7 +453,7 @@ MMA_Atom mma = MMA_Atom<SM70_8x8x4_F32F16F16F32_NT>{};
453453
print_latex(mma);
454454
```
455455
<p align="center">
456-
<img src="../../images/cute/HMMA.8x8x4.NT_Atom.png" alt="HMMA.8x8x4.NT_Atom.png" height="400"/>
456+
<img src="../../../images/cute/HMMA.8x8x4.NT_Atom.png" alt="HMMA.8x8x4.NT_Atom.png" height="400"/>
457457
</p>
458458
459459
The above is equivalent to
@@ -473,7 +473,7 @@ We can create an object akin to a WMMA by using four of these quadpair MMAs:
473473
print_latex(mma);
474474
```
475475
<p align="center">
476-
<img src="../../images/cute/HMMA.8x8x4.NT_2x2.png" alt="HMMA.8x8x4.NT_2x2.png" height="400"/>
476+
<img src="../../../images/cute/HMMA.8x8x4.NT_2x2.png" alt="HMMA.8x8x4.NT_2x2.png" height="400"/>
477477
</p>
478478
This `TiledMMA` replicates the `MMA_Atom` across threads as we can see the `T4` and `T8` and `T12` threads in the `C`-matrix that were not used before. Each quadrant of the `C`-matrix is a replica of the atom's partitioning pattern for a new quadpair and this replication follows a `(2,2):(2,1)` layout.
479479
@@ -486,7 +486,7 @@ The above represents a 16x16x4 MMA now, but we can immediately expand this "tile
486486
print_latex(mma);
487487
```
488488
<p align="center">
489-
<img src="../../images/cute/HMMA.8x8x4.NT_2x2_32x32x4.png" alt="HMMA.8x8x4.NT_2x2_32x32x4.png" height="400"/>
489+
<img src="../../../images/cute/HMMA.8x8x4.NT_2x2_32x32x4.png" alt="HMMA.8x8x4.NT_2x2_32x32x4.png" height="400"/>
490490
</p>
491491
This `TiledMMA` replicates the previous `TiledMMA` across values instead of threads. We can see the `T0V8` and `T16V8` and `T8V8` values in the `C`-matrix that were not used before. Each quadrant of the `C`-matrix is a replica of the previous `TiledMMA`'s partitioning pattern for a new set of values.
492492

@@ -514,7 +514,7 @@ which are separate, but we might prefer them to be next to each other. That is w
514514
print_latex(mma);
515515
```
516516
<p align="center">
517-
<img src="../../images/cute/HMMA.8x8x4.NT_2x2_32Mx32x4.png" alt="HMMA.8x8x4.NT_2x2_32Mx32x4.png" height="400"/>
517+
<img src="../../../images/cute/HMMA.8x8x4.NT_2x2_32Mx32x4.png" alt="HMMA.8x8x4.NT_2x2_32Mx32x4.png" height="400"/>
518518
</p>
519519
520520
That layout `(4,4,2):(1,8,4)` is read like a scatter permutation, telling the m-coords of the original image where to go in the new image.

media/docs/cpp/cute/0x_gemm_tutorial.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -335,7 +335,7 @@ These thread layouts are then used to partition the tiles of data in global memo
335335
where we've used the same projection-style interface to avoid applying the `N`-mode of `tC` to the `(BLK_M,BLK_K)` shape of `sA` and avoid applying the `M`-mode of `tC` to the `(BLK_N,BLK_K)` shape of `sB`.
336336

337337
<p align="center">
338-
<img src="../../images/cute/tC_partitioning.png" alt="tC_partitioning.png" height="300"/>
338+
<img src="../../../images/cute/tC_partitioning.png" alt="tC_partitioning.png" height="300"/>
339339
</p>
340340
This diagram shows a `tC` layout, highlights two threads in green and blue, shows the projections of the `tC` layout, and finally highlights the subtensors within `sA`, `sB`, and `gC` that `tCsA`, `tCsB`, and `tCgC` represent.
341341

@@ -391,7 +391,7 @@ As a first example, lets look at the `TiledCopy` that `gemm_nt` generates.
391391
```
392392
The easiest way to see what this `TiledCopy` does is to look at the partition pattern in LaTeX.
393393
<p align="center">
394-
<img src="../../images/cute/TiledCopyA.png" alt="TiledCopyA.png" height="300"/>
394+
<img src="../../../images/cute/TiledCopyA.png" alt="TiledCopyA.png" height="300"/>
395395
</p>
396396
On the left is the source-tensor partitioning and on the right is the destination-tensor partitioning. The partition patterns are the same for this case, but there exist PTX instructions which require different patterns in the source and destination. The diagram shows that each thread reads 4x1 `TA` elements and there are 32x8 threads. The `UniversalCopy<uint128_t>` forces the instruction to use a 128-bit copy instruction. If the partition (of `sA` or `gA` in this case) does not result in 4 `TA` elements that cannot be vectorized to a 128-bit load/store, then CuTe will statically fail with an error message to that effect.
397397
@@ -422,7 +422,7 @@ As a first example, lets look at the `TiledMMA` that `gemm_nt` generates.
422422
```
423423
The easiest way to see what this `TiledMMA` does is to look at the partition pattern in LaTeX.
424424
<p align="center">
425-
<img src="../../images/cute/TiledMmaC.png" alt="TiledMmaC.png" height="300"/>
425+
<img src="../../../images/cute/TiledMmaC.png" alt="TiledMmaC.png" height="300"/>
426426
</p>
427427
On the left is the A-tensor partitioning, on the top is the B-tensor partitioning, and in the middle is the C-tensor partitioning.Because the `UniversalFMA` is a 1x1x1 MMA instruction, a 16x16x1 tiling of them results in a 16x16x1 `TiledMMA`. Other MMA instructions will have different threads involved and have different instruction sizes. In this case, all threads will read a single element from `A`, `B`, and `C` each.
428428

0 commit comments

Comments
 (0)