Skip to content

memory_efficient_attention fails with half precision tensors, with pre-compiled xformers 0.0.30 on RTX 5090 #1251

@cdancette

Description

@cdancette

🐛 Bug

I am encountering an error with xformers 0.0.30 memory_efficient_attention on RTX 5090, with half-precision tensors. I am using torch 2.7 with cuda 12.8.

Calling the function returns the following error:

CUDA error (/__w/xformers/xformers/third_party/flash-attention/hopper/flash_fwd_launch_template.h:175): no kernel image is available for execution on the device

Note that it works in full precision.

Also, I tried installing the latest version from source (commit 8fc8ec5), and I cannot reproduce the issue, it works as expected. I installed it with poetry run pip install -v -U git+https://github.com/facebookresearch/xformers.git@main#egg=xformers

Command to reproduce

from xformers.ops import memory_efficient_attention
import torch
q, k, v = torch.randn(5, 10, 128).half().cuda(), torch.randn(5, 10, 128).half().cuda(), torch.randn(5, 10, 128).half().cuda()
memory_efficient_attention(q, k, v)

Returns

CUDA error (/__w/xformers/xformers/third_party/flash-attention/hopper/flash_fwd_launch_template.h:175): no kernel image is available for execution on the device

Environment

You can run the script with:

  • PyTorch Version (e.g., 1.0): 2.7
  • OS (e.g., Linux): ubuntu 24.04.2
  • How you installed PyTorch (conda, pip, source): poetry/pip
  • Build command you used (if compiling from source):
  • Python version: 3.12
  • CUDA/cuDNN version: 12.8 / Nvidia driver version: 570.133.20
  • GPU models and configuration: RTX 5090
  • Any other relevant information:

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions