[BACKEND] Fix regression in pipeliner pre-checks. #4196

ThomasRaoux · 2024-06-24T05:51:41Z

During some previous refactoring we changed the logic and started pipeling cases that had incompatible shared encoding. This was missed because one of the lit test had not been updated :(

lib/Dialect/TritonGPU/Transforms/Pipeliner/MatmulLoopPipeline.cpp

pawelszczerbuk · 2024-06-24T17:32:16Z

Thank you! Good catch!

During some previous refactoring we changed the logic and started pipeling cases that had incompatible shared encoding. This was missed because one of the lit test had not been updated :(

Update Update Update Update Add a more meaningful check to make sure we are not merging blocks (#4186) This is a follow-up to #4176 (comment) I am now counting the number of blocks with (17) and without (31) block merging. I double checked to make sure this does not pass when we use an aggressive region simplification strategy. [AMD] Skip mfma layout in maybeDuplicate (#4170) The workaround introduced in #4048 "forgot" to skip mfma layout. [TEST] Merge duplicate `max_num_imprecise_acc` tests and improve code (#4191) [DOCS][NFC] Fix doc formatting problems (#4195) 1. f-string cannot be used as docstrings in Python. 2. URLs should follow the reStructuredText format. 3. Code snippets in a code block should be indented. Tested and passed on a local machine. [BACKEND] Fix regression in pipeliner pre-checks. (#4196) During some previous refactoring we changed the logic and started pipeling cases that had incompatible shared encoding. This was missed because one of the lit test had not been updated :( Remove tl.multiple_of call from tma persistent kernel (#4198) [AMD] Guard against null in `BypassEpilogueSMEM` (#4203) `val.getDefiningOp()` can return `nullptr`. In this case, we must fail the `BypassEpilogueSMEM` rewrite pass for the given op. This prevents run-time crashes. [FRONTEND][NFC] Fix type checking, conditional logic, and loop structures for improved readability and performance (#4208) Document TRITON_HOME (#4210) Document the existence of `TRITON_HOME` environment variable. The `TRITON_HOME` variable controls the location of the `.triton` directory that stores, among other things, the files downloaded during a `pip install -e python` virtualenv build. By default, this is located in the user's home directory, at `~/.triton`. I was trying to build Triton on my system on a large local disk, but with limited network home directory space, and the `pip` command kept failing with out of disk space errors. It turned out that during installation, large files were downloaded to the `~/.triton` directory causing failure. After checking that it was not `pip` doing this, I found the `TRITON_HOME` variable which allowed me to workaround the issue and build Triton successfully. After seconding #4007, I decided to contribute this documentation fix. Co-authored-by: sree <sree@buckyball> [BACKEND] Fix regression in i1 reduction (#4215) Recent refactoring broke i1 shared memory load. [BUILD] update URL for LLVM tarballs (#4216) [BACKEND] Fix divisibility analysis for shift ops (#4221) Divisibility does not ensure that a value is not 0 therefore we cannot use divisibility as a minimum shifted values. Support FP8 constant (#4222) To unblock the compilation of kernels like below which don't operate arithmetically on FP8. ``` @triton.jit def triton_poi_fused__scaled_mm__to_copy_constant_pad_nd_lift_fresh_2(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr): xnumel = 400624 xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel x0 = xindex % 784 x1 = (xindex // 784) x2 = xindex tmp0 = x0 tmp1 = tl.full([1], 769, tl.int64) tmp2 = tmp0 < tmp1 tmp3 = tl.load(in_ptr0 + (x0 + (769*x1)), tmp2 & xmask, other=0.0) tmp4 = tmp3.to(tl.float8e4nv) tmp5 = tl.full(tmp4.shape, 0.0, tmp4.dtype) tmp6 = tl.where(tmp2, tmp4, tmp5) tl.store(out_ptr0 + (x2), tmp6, xmask) ``` [INTERPRETER] Implement implicit tensor conversion for assignment operators (#4214) Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update

During some previous refactoring we changed the logic and started pipeling cases that had incompatible shared encoding. This was missed because one of the lit test had not been updated :(

* Add blocked to dot shortcut * pack tensors in vectors instead of structures * fix * add moe bypass option * initial commit * fix * fix * add missing configurations and add more checks in passes * adjust global load layout for vllm swizzling format * Remove debug print * make load width dependable on data type * fix int 8 logic * generalize load analysis: return last load in dependant laod chain instead of 2 * Add message for assert failure So that people know what the problem is when this compiler error shows up * add k=512/1024 cases * [BACKEND] Add memory space to memdesc type. (triton-lang#4027) Currently only shared memory is supported but this will allow supporting different kinds of local memory (like private) or others. * [BACKEND] Fix memory side effects of `tt.dot` (triton-lang#4033) 1. Replaced `triton_nvidia_gpu.async_dot` with `triton_nvidia_gpu.group_dot` which has a `isAsync` attribute. Maybe `warp_group_dot` is a better name? 2. Removed `memdesc` from `tt.dot` because `tt.dot` should be pure, without any side effects 3. Removed hacks in Membar analysis. 4. Unified wgmma code generation in the backend. 5. Introduced the `DotLike` trait for `tt.dot` and `triton_nvidia_gpu.group_dot`. 6. Updated comments in matmul loop pipeline (maybe incomplete). 7. Removed the `ConvertDotConvert` pattern * remove streamPipelinev2 * [TEST] NFC: Drop irrelevant NVIDIA specific attributes (triton-lang#4384) Software pipeling should be not using them. This makes it cleaner and prepares reusing the same test inputs for AMD side. * [Pipeliner] NFC: Expose Pipeliner infrastructure for use by other target backends (triton-lang#4155) Non-functional changes to expose `lib/Dialect/TritonGPU/Transforms/Pipeliner` infrastructure for use by other target backends. * [BACKEND] Fix regression in pipeliner pre-checks. (triton-lang#4196) During some previous refactoring we changed the logic and started pipeling cases that had incompatible shared encoding. This was missed because one of the lit test had not been updated :( * [Backend][AMD] Introduce stream pipeliner v2 (triton-lang#4148) This PR first promotes common infrastructure in `lib/Dialect/TritonGPU/Transforms/Pipeliner` to enable inclusion by other target backends. No other changes have been made to the lib/include directories. Second, the `tritonamdgpu-stream-pipeline` pass has been completely revamped based on code from `lib/Dialect/TritonGPU/Transforms/Pipeliner/MatmulLoopPipeline.cpp` using similar scheduling passes to compute multi-stage pipelines. Some of this code could be consolidated further in the CoarseSchedule class (or perhaps a derived LoopScheduler class). This modulo scheduler collects `tt.load` ops and generates local_storage and management ops for the ramp-up stage (stage-0), then collecting all uses of the loads for stage-1. Multi-buffering is introduced when num_stages exceeds the max distance between load and uses. Buffering may be in Shared memory for `tt.dot` uses or Registers for all other uses. This current implement does not support peeling the last iteration if the loop is dynamic. Lastly, the `tritonamdgpu-reorder-instructions` pass has been enhanced to move `tt.load` ops as early as possible in its region. This includes loop bodies as well as func entry blocks for the case of ramp-up. This pass will also move `triton_gpu.local_store` ops as early as possible if their source is not directly from a `tt.load`. In this way, a multi-buffered pipeline will overlap in this order: 1. tt.load buffer+2 2. tg.local_store buffer+1 3. tt.dot buffer+0 --------- Co-authored-by: Lei Zhang <[email protected]> * [AMD] Prefetch loads and independent local_stores (triton-lang#4429) This pass is enhanced to move tt.loads as early as possible. This enables buffering in registers for global loads while computing previous tiles (stream-pipelining), but may increase register pressure. If ttg.local_stores are independent of loads in the loop (i.e. double buffering in shared memory), then this pass will also move those early to overlap with global loads and compute. * [Pipeliner] Implement dynamic loop peeling - enabled for tritonamdgpu-stream-pipeline * * disabled for num_stages > 2 * updated tests * * guard each stage of ramp-down in epilogue * enable peeling for any num_stages * * pipeline reg buffers * [AMD] Fixed bug with tritonamdgpu-reorder-instructions - blindly moving local_loads can violate memory access order - also fixed case when moving instructions to top of loop * * only move ops early * Fix in streamPipelinerV2 * Fix lit tests * [Backend][AMD] Add temporary environment variable for pipeliner v2 (triton-lang#4430) This commit adds a new environment variable to enable pipeliner v2. It is expected to be temporary while we enable the new pipeliner and get all cases covered. Co-authored-by: SJW <[email protected]> --------- Co-authored-by: Ognjen Plavsic <[email protected]> Co-authored-by: Alexander Efimov <[email protected]> Co-authored-by: Ognjen Plavsic <[email protected]> Co-authored-by: Vinayak Gokhale <[email protected]> Co-authored-by: Lixun Zhang <[email protected]> Co-authored-by: Thomas Raoux <[email protected]> Co-authored-by: Keren Zhou <[email protected]> Co-authored-by: Lei Zhang <[email protected]> Co-authored-by: SJW <[email protected]> Co-authored-by: SJW <[email protected]>

[BACKEND] Fix regression in pipeliner pre-checks.

a13b324

During some previous refactoring we changed the logic and started pipeling cases that had incompatible shared encoding. This was missed because one of the lit test had not been updated :(

ThomasRaoux requested a review from ptillet as a code owner June 24, 2024 05:51

ThomasRaoux requested a review from pawelszczerbuk June 24, 2024 05:52

Jokeren reviewed Jun 24, 2024

View reviewed changes

lib/Dialect/TritonGPU/Transforms/Pipeliner/MatmulLoopPipeline.cpp Outdated Show resolved Hide resolved

Fix typo

4333690

ThomasRaoux requested a review from Jokeren June 24, 2024 15:22

pawelszczerbuk approved these changes Jun 24, 2024

View reviewed changes

ThomasRaoux merged commit d0cd1c0 into triton-lang:main Jun 24, 2024
6 checks passed

jlebar mentioned this pull request Jun 25, 2024

Pass repr=key when calling JITFunction.cache_hook #4207

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BACKEND] Fix regression in pipeliner pre-checks. #4196

[BACKEND] Fix regression in pipeliner pre-checks. #4196

Uh oh!

ThomasRaoux commented Jun 24, 2024

Uh oh!

Uh oh!

pawelszczerbuk commented Jun 24, 2024

Uh oh!

Uh oh!

Uh oh!

[BACKEND] Fix regression in pipeliner pre-checks. #4196

[BACKEND] Fix regression in pipeliner pre-checks. #4196

Uh oh!

Conversation

ThomasRaoux commented Jun 24, 2024

Uh oh!

Uh oh!

pawelszczerbuk commented Jun 24, 2024

Uh oh!

Uh oh!

Uh oh!