Skip to content

Bug: Perf Regression in PP throughput after Pull #461 (...R4 CUDA impl) #474

@usrlocalben

Description

@usrlocalben

What happened?

While testing out an IQ4 quant of R1-0528 I noticed that PP throughput on my system was reduced e.g. 75/s -> 12/s, basically equal to TG throughput. With IQ4 and Q8 shared on GPU I expect PP > 60/s.

I compare with an all Q8_0 quant and see what I expect, PP >50/sec (on main/HEAD today.)

I bisected, and found that this problem was introduced with Pull #461 (commit 1429291).

However, my IQ4 quant doesn't have any _R4 tensors. It's Q8 shared, and IQ4_K for the remaining tensors.

Absence/presence of --run-time-repack doesn't cause nor avoid it.

CUDA device is RTX 8000 (Turing)

I glance over the commit and mostly see changes that seem clearly restricted to _R4 suffix components. There are some shared parts where n_interleaved is propagated down the template stack (iqk_mmvq.cu) but at a casual glance nothing strikes me as odd, but I'm certainly not that familiar with it. The dot product interface changed to a mutating one taking an accumulator pointer (previously returning the computed result) and that could be curious.

aside, but maybe related -- there were recent PRs related to mla/fa that had some vague language wrt. Turing support. (Pulls #386 and #408 ) I say vague because 386 indicates turing is not supported, then 408 indicates that it is extended to Turing, but I'm not sure they're referring to the same thing, and the changes in 408 don't seem very significant. It's not clear what the proper mla/fa settings should be on Turing at this time. I currently use -mla 2 -fa

What operating system are you seeing the problem on?

Linux

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions