Skip to content

[X86] some builtins generate incorrect code for shifts with large (constant) shift counts #43267

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wolfy1961 opened this issue Nov 6, 2019 · 6 comments
Labels
backend:X86 bugzilla Issues migrated from bugzilla wontfix Issue is real, but we can't or won't fix it. Not invalid

Comments

@wolfy1961
Copy link
Collaborator

wolfy1961 commented Nov 6, 2019

Bugzilla Link 43922
Version trunk
OS All
Attachments IR in question
CC @topperc,@RKSimon,@rotateright
Fixed by commit(s) 641d2e5

Extended Description

The attached IR features a call to a builtin that generates a PSRAD instruction. The shift count is large, and in such cases the instruction's definition says that the result should be equal to the operand's sign bit (or the sign bits of its respective elements) extended to the entire width of the result (i.e. 0 or -1).

The instruction, however, is emitted with a shift count of 0, yielding a result identical to the input operand.

    movq    .LCPI0_0(%rip), %mm0    # mm0 = 0x7AAAAAAA7AAAAAAA
    psrad   $0, %mm0

I do not think undefined behaviour is in play here, since I would expect that the semantics of a builtin are directly derived from the that of the instruction they represent, though I could be wrong about that.

The IR has been generated from the following C source:

// -----------------------------------------------------
typedef int __v2si __attribute__((__vector_size__(8)));
typedef long long __m64 __attribute__((__vector_size__(8)));

extern "C" int printf(const char *, ...);

int main()
{
            __m64 id17152 = {(long long)0x7aaaaaaa7aaaaaaa};
            int id17156 = 0x10000000;
            __m64 id17151 = (__m64)__builtin_ia32_psradi((__v2si)id17152, id17156);
            printf("id17151 = %llx\n", id17151[0]);
}
// ------------------------------------------------------

gcc 7.4.0 prints 0 for this code.

In a late stage IR dump the PSRAD instruction shows up with the correct shift count:

renamable $mm0 = MMX_PSRADri killed renamable $mm0(tied-def 0), 268435456

@topperc
Copy link
Collaborator

topperc commented Nov 6, 2019

The PSRAD with immediate instruction only supports 8-bits worth of shift so we masked 268435456 to 8-bits and got 0. The same issue exists for shift left and shift right for the MMX builtin. We do better with SSE shifts.

I'm inclined to fix this by just clamping the shift amount to 31 for all three shift amounts. We could do better for left/right shifts and the result to 0 explicitly, but I'm not sure we care to optimize MMX that well.

@RKSimon
Copy link
Collaborator

RKSimon commented Nov 6, 2019

The PSRAD with immediate instruction only supports 8-bits worth of shift so
we masked 268435456 to 8-bits and got 0. The same issue exists for shift
left and shift right for the MMX builtin. We do better with SSE shifts.

I'm inclined to fix this by just clamping the shift amount to 31 for all
three shift amounts. We could do better for left/right shifts and the result
to 0 explicitly, but I'm not sure we care to optimize MMX that well.

How tricky would it be to match what we do for SSE for out of bounds values?

@topperc
Copy link
Collaborator

topperc commented Nov 6, 2019

Not that tricky. We need to clamp for psrai. And for the other two emit an i32 zero and an x86isd node to move i32 to mmx. That will pattern match to the pxor zero idiom I think.

@topperc
Copy link
Collaborator

topperc commented Nov 6, 2019

I've committed the simple clamp to 255 for all 8 affected intrinsics in 641d2e5. Using 255 avoids needing to decode the element size from the intrinsic.

Simon, do you think I should optimize it to produce 0s when possible? Or should we start working on MMX->SSE

@RKSimon
Copy link
Collaborator

RKSimon commented Nov 10, 2019

I've committed the simple clamp to 255 for all 8 affected intrinsics in
641d2e5. Using 255 avoids needing to decode
the element size from the intrinsic.

Simon, do you think I should optimize it to produce 0s when possible? Or
should we start working on MMX->SSE

Returning 0s for out of range logical shifts should be a relatively small change, making a dent in [Issue #41665] is likely to be a much bigger issue.

@llvmbot llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 10, 2021
@RKSimon
Copy link
Collaborator

RKSimon commented Feb 4, 2025

Won't fix - we've now removed frontend MMX support and the MMX intrinsics now wrap to the SSE2 intrinsics internally, giving matching behavior.

@RKSimon RKSimon closed this as not planned Won't fix, can't repro, duplicate, stale Feb 4, 2025
@EugeneZelenko EugeneZelenko added the wontfix Issue is real, but we can't or won't fix it. Not invalid label Feb 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend:X86 bugzilla Issues migrated from bugzilla wontfix Issue is real, but we can't or won't fix it. Not invalid
Projects
None yet
Development

No branches or pull requests

4 participants