-
-
Notifications
You must be signed in to change notification settings - Fork 478
Description
#1579 notes some unfinished business:
The
Simdandm128ietc. type generation should be equivalent, but they're not in terms of code; theSimdimpls currently usefillto avoid moreunsafecode here.Notice from the above that
u32x4,u16x8andu8x16are the same size asu128andm128ibut cost about twice as much to generate here. This indicates thefillcode may be sub-optimal.Additionally, the
m128iimpl performed even worse when transmuting au128value (~4.3ns or +%130) which, as far as I can tell, is purely because theu128value is returned viarax, rdxwhile the__m128ivalue is returned viardx, r10(withraxequal to the struct address). I don't understand this.
Optimizing Fill for such cases may not be possible without specialization, and even then it's unclear if we'd want to due to the implied value-breaking changes.
Optimizing SIMD impls would require either specialization or replacing the generic Simd<$ty, LANES> impls with a (large) number of specific impls.