-
Notifications
You must be signed in to change notification settings - Fork 43
Unsigned narrowing #94
Comments
My back of the envelope instruction sequence to emulate unsigned-unsigned narrowing in SSE would look something like the following, there are other ways as well, but I can't seem to find anything substantially shorter:
|
Can we have a spc PR with these recommendations and seek feedback there? Unless there are use cases benefiting from unsigned to unsigned narrowing, it is justifiable to drop them from the initial proposal. |
|
Sorry, a longer response now. Another way to look at it is that there are two questions: performance of unsigned narrowing with unsigned inputs and the benefits of unsigned narrowing with signed input. #91 has the spec change to substitute the former with the latter. Alternatively, we can add the other variant in alongside the ones that have been added. The advantages of having unsigned narrowing with signed inputs are described above, and it can be done in one instruction on both x86 and ARM. |
Closing as we merged the alternative in PR #91 |
Uh oh!
There was an error while loading. Please reload this page.
Widening and narrowing operations were proposed in #21 and added in #89. There is a discussion about what should be the input for unsigned narrowing instructions in #91. The issue is that x86(64) SIMD narrowing instructions treat input as signed, even when output is not. ARM supports both signed-unsigned and unsigned-unsigned narrowing.
I can see value in "signed to unsigned" narrowing for things like RGBA graphics -- results of signed integer arithmetic that would be packed into unsigned RGBA output. For one example, see Sobel operator, other image filters would use it as well. Would "unsigned to unsigned" narrowing be equally useful? Where would it be used?
Another problem is that emulating "unsigned to unsigned" narrowing on x86(64) requires about 4 instructions, which is not a good value proposition for operation that would work on 2*, 4, or 8 lanes, as only the 8-lane version would be faster than scalar.
*if 64 bit lanes are supported
The text was updated successfully, but these errors were encountered: