Call cost floor + FP16 speed boost / old GPUs #245

blefaudeux · 2021-08-26T17:26:24Z

blefaudeux
Aug 26, 2021

Poking around the fused softmax tutorial, and extending it a little to make it comparable to torch.softmax() (adding autograd mostly), I'm trying to get an understanding of how the perf compares to the very optimized Pytorch CUDA kernels, across a couple of axes. I'm curious to get some context around two observations, which stem from micro-benchmarks on a P100 (arguably a little old):

there seems to be a time floor to any call to a triton kernel, I'm seeing around 70us, even if this kernel was called just before and there's no autotune. Is that expected ? The call time for pytorch kernels can go much lower, towards a couple of us
using the same softmax kernel on fp16 data requires very little change (see tl.exp() and torch.float16 crash unceremoniously #241), but it does not bring any speed benefits. This is somewhat expected compute-wise on a P100, but I would expect some bandwidth benefits, and testing against pytorch this does show up (ie. torch.softmax() on fp16 data gets to be a little faster than fp32 on P100). Is that expected, or could that be that some memory allocations are hardcoded to fp32 within the IR for instance ?

Thanks ! cc @ptillet

ptillet · 2021-08-27T08:44:52Z

ptillet
Aug 27, 2021
Maintainer

Hmm, how are you measuring this? I tried a vector addition with tiny vectors

import torch
import triton
import triton.language as tl

@triton.jit
def add_kernel(x_ptr, y_ptr, output_ptr,n_elements,**meta):
    pid = tl.program_id(axis=0) 
    offsets = tl.arange(0, 128)
    mask = offsets < n_elements
    x = tl.load(x_ptr + offsets, mask=mask)
    y = tl.load(y_ptr + offsets, mask=mask)
    output = x + y
    tl.store(output_ptr + offsets, output)

def add(x, y):
    z = torch.empty_like(x)
    add_kernel[(1,)](x, y, z, x.numel())

x = torch.randn(128, device='cuda')
y = torch.randn(128, device='cuda')
torch_fn = lambda: x + y
triton_fn = lambda: add(x, y)
print(triton.testing.do_bench(torch_fn)[0])
print(triton.testing.do_bench(triton_fn)[0])

and saw that torch and triton had the same overhead:

0.0030720001086592674
0.0030720001086592674

2 replies

blefaudeux Aug 27, 2021
Author

I'm not using triton.testing.do_bench, just torch.utils.benchmark, and I'm looping through shapes (could be why). The code looks like

def bench_functions(test_cases: List[TestCase]):
    min_run_time = MIN_RUN_TIME
    device = torch.device("cuda")
    results = []

    for dtype in [torch.float16, torch.float32]:
        for B, M, K in SHAPES:
            a = torch.rand(B, M, K, device=device, dtype=dtype, requires_grad=True)

            for testcase in test_cases:
                fn_str = "fn(a)"

                results.append(
                    benchmark.Timer(
                        stmt=fn_str,
                        globals={
                            "a": a,
                            "fn": testcase.function,
                        },
                        label=f"{dtype}",
                        sub_label=f"workload: {testcase.name}",
                        description=f"B={B}, M={M}, K={K}",
                    ).blocked_autorange(min_run_time=min_run_time)
                )

    compare = benchmark.Compare(results)
    compare.print()


def pytorch_fw_bw(x):
    y = torch.norm(torch.softmax(x, dim=-1))
    y.backward()


def triton_fw_bw(x):
    y = torch.norm(triton_softmax(x))
    y.backward()


# Test FW
bench_functions(
    [
        TestCase(lambda x: torch.softmax(x, dim=-1), "pytorch - fw"),
        TestCase(triton_softmax, "triton - fw"),
    ]
)

# Test FW+BW
bench_functions(
    [
        TestCase(pytorch_fw_bw, "pytorch - fw+bw"),
        TestCase(triton_fw_bw, "triton - fw+bw"),
    ]
)

blefaudeux Aug 27, 2021
Author

in an actual benchmark I'm seeing a very good speed from Triton, so could be that it's a micro-benchmark issue really. Any thoughts on the fp16 code path ?

blefaudeux · 2021-08-31T22:45:57Z

blefaudeux
Aug 31, 2021
Author

Changing the benchmark method (outside of torch.utils, and confirmed by real life measurements) solves both observations, fp16 gets a significant boost even on old HW and no call cost

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Call cost floor + FP16 speed boost / old GPUs #245

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Call cost floor + FP16 speed boost / old GPUs #245

Uh oh!

blefaudeux Aug 26, 2021

Replies: 2 comments · 2 replies

Uh oh!

ptillet Aug 27, 2021 Maintainer

Uh oh!

Uh oh!

blefaudeux Aug 27, 2021 Author

Uh oh!

blefaudeux Aug 27, 2021 Author

Uh oh!

blefaudeux Aug 31, 2021 Author

blefaudeux
Aug 26, 2021

Replies: 2 comments 2 replies

ptillet
Aug 27, 2021
Maintainer

blefaudeux Aug 27, 2021
Author

blefaudeux Aug 27, 2021
Author

blefaudeux
Aug 31, 2021
Author