float8 inference weight-only quant should map to a fused kernel or explain why not

Today, `Float8WeightOnlyConfig` maps to a reference implementation of weight-only quant, which dequantized the tensor and then runs a high precision gemm: https://github.com/pytorch/ao/blob/ba3ac9f2f6117ba35ff28fbb8811f61ad992dfcf/torchao/quantization/quantize_/workflows/float8/float8_tensor.py#L392

Users have reported confusion about this, we should either clearly explain that no speedup is expected or map to a fast kernel.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

float8 inference weight-only quant should map to a fused kernel or explain why not #3288

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

float8 inference weight-only quant should map to a fused kernel or explain why not #3288

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions