[Perf] Explore more performant Fp8 Casting

# Summary
There are two components to this, non_saturated casting and saturated casting.

### Non-Saturated casting
- We are currently using bit logic to cast from fp32 to fp8 where as there exists intrinsics to perform the same, see Nikitas comment below.
- Currently for fp16 -> fp8 casting we actually first rescaled fp16 to fp32 and then recast to fp8.

### Saturated Casting
- In this codebase we write out the saturated cast logic explicitly by clamping prior to conversion: https://github.com/pytorch-labs/float8_experimental/blob/cdcadb57c5f4736d1a78a794da98afd398571942/float8_experimental/float8_utils.py#L19
There does appear to be intrinisics with PTX for doing saturated casts, see: https://github.com/openai/triton/blob/10f59d8ce04052521c1bc0cb3a3f8b98918fc7e3/lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp#L182

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Perf] Explore more performant Fp8 Casting #83

Summary

Non-Saturated casting

Saturated Casting

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Perf] Explore more performant Fp8 Casting #83

Description

Summary

Non-Saturated casting

Saturated Casting

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions