[Impeller] Implement framebuffer-fetch via subpasses in Vulkan without extensions. #49787

chinmaygarde · 2024-01-16T05:15:20Z

Subpasses are not exposed in the HAL and the need for subpasses in Vulkan can be determined based on the presence and use of input-attachments in the shaders. This information is already reflected by the compiler. Because of this, all references to subpasses have been removed from APIs above the HAL.
RenderPassBuilderVK is a lightweight object used to generate render passes to use either with the pipelines (compat, base, or per-subpass) or during rendering along with the framebuffer. Using the builder also sets up the right subpass dependencies. As long as the builder contains compatible attachments and subpass counts, different subpasses stamped by the builder (via the Build method) are guaranteed to be compatible per the rules in the spec.
Pass attachments are now in the eGeneral layout. There was no observable difference in performance when manually inserting the right transitions. Except, a lot of transitions needed to be inserted. If we need it, we can add it back in short order. I wouldn't be averse to adding it if reviewers insist.
Depending on where the subpass cursor is for each command, a different pipeline variant is necessary. For instance, if a command uses a pipeline at subpass 1 of 10 and that same pipeline is reused later in say subpass 6, the variant for subpass 1 is not suitable for subpass 6 (VkGraphicsPipelineCreateInfo::subpass(uint32_t) is part of the compat rules). Creation of these subpass variants from the lone-pass (subpass index 0 of count 1) prototype is done via a preload operation with jobs submitted to a concurrent worker. The preload can only happen once the number of passes needed can be determined. On mobile and desktop devices at hand, the observation was that the variants obtained from the prototype already in a PipelineCacheVK was extremely fast. Even so, once the variant is obtained, it is cached in PipelineVK. Notably, this is not present in PipelineCacheVK. That top-level cache only contains prototypes for the lone-pass pipeline configuration. The allows for purging of subpass variants of which there can theoretically be an unbounded number, and also a single point where subpass prototype creation can be elided completely if the rasterization_order_attachment_access extension is present.
Speaking of the rasterization_order_attachment_access extension, its use has been removed in this patch. I am prototyping adding it back to measure the overhead introduced by manual subpass management. If the overhead is measurable, we can use the extension on devices that have it as an added optimization.
The complexity of command encoding remains linear (to the number of commands) per pass.
This patch only works on a single color attachment being used as an input attachment. While this is sufficient for current use cases, the Metal implementation is significantly more capable since the multiple attachments and attachment types (depth) are already supported. Rounding out support for this is in progress.
This patch contains some test harness updates for MoltenVK that will be backed out and submitted separately.

…t extensions. * Subpasses are not exposed in the HAL and the need for subpasses in Vulkan can be determined based on the presence and use of input-attachments in the shaders. This information is already reflected by the compiler. Because of this, all references to subpasses have been removed from APIs above the HAL. * `RenderPassBuilderVK` is a lightweight object used to generate render passes to use either with the pipelines (compat, base, or per-subpass) or during rendering along with the framebuffer. Using the builder also sets up the right subpass dependencies. As long as the builder contains compatible attachments and subpass counts, different subpasses stamped by the builder (via the `Build` method) are guaranteed to be compatible per the rules in the spec. * Pass attachments are now in the `eGeneral` layout. There was no observable difference in performance when manually inserting the right transitions. Except, a lot of transitions needed to be inserted. If we need it, we can add it back in short order. I wouldn't be averse to adding it if reviewers insist. * Depending on where the subpass cursor is for each command, a different pipeline variant is necessary. For instance, if a command uses a pipeline at subpass 1 of 10 and that same pipeline is reused later in say subpass 6, the variant for subpass 1 is not suitable for subpass 6 (`VkGraphicsPipelineCreateInfo::subpass(uint32_t)` is part of the compat rules). Creation of these subpass variants from the lone-pass (subpass index 0 of count 1) prototype is done via a preload operation with jobs submitted to a concurrent worker. The preload can only happen once the number of passes needed can be determined. On mobile and desktop devices at hand, the observation was that the variants obtained from the prototype already in a `PipelineCacheVK` was extremely fast. Even so, once the variant is obtained, it is cached in `PipelineVK`. Notably, this is not present in `PipelineCacheVK`. That top-level cache only contains prototypes for the lone-pass pipeline configuration. The allows for purging of subpass variants of which there can theoretically be an unbounded number, and also a single point where subpass prototype creation can be elided completely if the `rasterization_order_attachment_access` extension is present. * Speaking of the `rasterization_order_attachment_access` extension, its use has been removed in this patch. I am prototyping adding it back to measure the overhead introduced by manual subpass management. If the overhead is measurable, we can use the extension on devices that have it as an added optimization. * The complexity of command encoding remains linear (to the number of commands) per pass. * This patch only works on a single color attachment being used as an input attachment. While this is sufficient for current use cases, the Metal implementation is significantly more capable since the multiple attachments and attachment types (depth) are already supported. Rounding out support for this is in progress. * This patch contains some test harness updates for MoltenVK that will be backed out and submitted separately.

chinmaygarde · 2024-01-16T05:18:07Z

#49686 and #49780 are expected to cause significant merge conflicts. So after gathering feedback here, I am going to re-apply the patches on top of those two changes when they land. cc @jonahwilliams

chinmaygarde · 2024-01-16T05:20:41Z

This patch works for the existing use-cases with no validation errors. Will work on the minor WIP items while the other changes below the HAL land before rebasing.

jonahwilliams · 2024-01-16T06:14:11Z

Awesome: some quick thoughts:

Re: rasterization_order_attachment_access

Yeah lets drop it for now, I'd happily take a single fast enough system over a mess of extensions. not to mention #47131 is its own mess.

Re: command look ahead.

As you already know my patches are removing the command recording which you're using to calculate subpass count in this patch. however, I think we already know the exact number of advanced blends in entity_pass (https://github.com/flutter/engine/blob/main/impeller/entity/entity_pass.cc#L77 ) so perhaps we can recycle this information into the RenderPass constructor to correctly set up the subpasses. If that doesn't immediately work we could probably get away with doing a lookahead in entity pass.

chinmaygarde · 2024-01-16T06:57:06Z

Yeah, the upcoming changes is why I wanted to get your eyes on this mechanism. Getting the subpass counts is necessary but not sufficient as the vkCmdNextSubpass needs to be encoded at the right time as well (as determined by ShouldAdvanceSubpass). And that does require a lookahead. I'm confident there is a workaround though. Preferably one that doesn't involve updates from above the HAL.

jonahwilliams · 2024-01-16T16:18:17Z

It seems like the two components are:

Total subpass count

This is easy, we already have it computed in entity_pass. As a starting point, you could just provide that to the RenderPass constructor.

Advancing bound input subpass

Without the deferred recording, I think all you have to do is advance the subpass count when a pipeline is recorded that samples from the subpass input. The render pass can maintain that state. all this seems like is a line like

if (pipelineDesc.uses_subpass) {
  subpass_index++;  
}

Side note, but from some early experiments I think this patch is breaking clips, but not related to the actual subpass usage as far as I can tell.

chinmaygarde · 2024-01-16T16:52:08Z

I am treating (1) as a non-starter at the moment because it involves changes above the HAL and now relies on developers to divine the pass count before encoding. The cognitive load increases too because now you have to explain what the heck a subpass is and why callers should care. Oh, and its a Vulkan specific thing. It fundamentally changes how renderpass could be setup and used. Theres got to be way to have our cake and eat it too. I am thinking secondary command buffers ATM but need to explore some more. Momentido.

(2) is about right. You just need to collapse the first subpass if there is an input attachment load in the first command (just a load from clear). But yeah.

jonahwilliams · 2024-01-16T17:46:23Z

I am treating (1) as a non-starter at the moment because it involves changes above the HAL and now relies on developers to divine the pass count before encoding.

But we're the developers, and we've already figured it out!

…oesn't specify the transition out of it.

jonahwilliams · 2024-01-16T22:56:06Z

I am treating (1) as a non-starter at the moment because it involves changes above the HAL and now relies on developers to divine the pass count before encoding.

Following up, I'm not aware of any use cases we have today where we are exposing the impeller rendering primitives to developers and expecting them to do something with them. It seems like we already have the subpass count computed today in entity pass, which all flutter users and current non flutter impeller users will benefit from.

chinmaygarde · 2024-01-16T23:55:41Z

Yeah, the current use-cases in APIs above impeller/entity are easy to reason about. But, introducing the notion of subpasses above impeller/renderer add significant cognitive overhead when using that layer and muddies the waters for stuff like Flutter GPU. After reading the Vulkan spec. so much over the past few and getting the distinct sense that the API was explicit to the point where a lot of setup was idle ceremony for no real gain (and it will be for all backends other than Vulkan), I am vary of making the HAL similarly hard to use. There are now open questions like what happens if the caller specifies the wrong subpass counts but is running on Metal, do they get validations even if Metal doesn't care? If the first command uses a subpass load, does that get collapsed? etc...

I want to timebox an investigation to figure out if there is a way out. Though, I fully suspect it will come down to a question of how much performance (if any) there is a willingness to sacrifice for a nicer API.

chinmaygarde · 2024-01-17T01:08:44Z

At the end of my timebox and signing off for the day. Conclusions:

I am not going to make changes to the HAL at this time. If the key observation behind direct encoding was that that all the information necessary for encoding was available at the time of the call was made, that condition is not satisfied on Vulkan because of subpasses and reactor access on OpenGL. I'm going to rebase on ToT this week and back out the direct encoding stuff on the Vulkan backend.

chinmaygarde · 2024-01-17T01:10:38Z

Was discussing with @bdero at some point how we shouldn't add subpasses to the HAL. But that may have been a bad call. We can add that notion to HAL and and get stuff like validation across APIs as well as direct encode.

jonahwilliams · 2024-01-17T16:03:19Z

I believe this change would be a large improvement for advanced blends on Vulkan devices (though I've only tested on Pixels so far). I can hold off on removing deferred encoding for Vulkan if you'd like - then this should have an easier path to landing?

I disagree that we should worry about an internal facing API for now. After we've shipped on Android we can revisit what these APIs should look like for Flutter GPU.

chinmaygarde · 2024-01-17T18:19:15Z

I also lost concurrent subpass pipeline preloading with the direct encoding. So the issues were more than API ergonomics which I think we have flexibility on. Let's have a quick chat about this during our sync today to figure out a way forward.

jonahwilliams · 2024-01-17T18:26:35Z

Do you know how much preloading is buying you? Even today we record -> encode immediately, such that I doubt there is more than a hundred microseconds between the preload start and when the PSO needs to be ready.

My intuition is that if there are enough distinct pipelines, then adding a frame with multiple advanced blends (such that we end up constructing new pipeline variants for each one) is going to drop the frame - but that might be okay as long as subsequent frames continue to use the pipeline cached entries.

jonahwilliams · 2024-01-17T20:17:57Z

Although I see the issue, if you have to block on every new pipeline variant you don't get any benefits from batching it up 🤔

chinmaygarde · 2024-01-29T22:29:05Z

Closing in favor of #50154 which solves the subpass PSO variants issue and addresses merge conflicts.

github-actions bot added the e: impeller label Jan 16, 2024

chinmaygarde added the Work in progress (WIP) Not ready (yet) for review! label Jan 16, 2024

chinmaygarde added 3 commits January 16, 2024 11:40

Format SL.

f4a85c1

Don't overspecify the final dependency. The last subpass dependency d…

1521ac9

…oesn't specify the transition out of it.

Fix DCHECK making tests trip.

badb9a7

chinmaygarde closed this Jan 29, 2024

chinmaygarde mentioned this pull request Jan 29, 2024

[Impeller] Implement framebuffer-fetch via subpasses in Vulkan without extensions. #50154

Merged

chinmaygarde deleted the pass3 branch August 22, 2024 19:46

[Impeller] Implement framebuffer-fetch via subpasses in Vulkan without extensions. #49787

[Impeller] Implement framebuffer-fetch via subpasses in Vulkan without extensions. #49787

Uh oh!

Conversation

chinmaygarde commented Jan 16, 2024

Uh oh!

chinmaygarde commented Jan 16, 2024

Uh oh!

chinmaygarde commented Jan 16, 2024

Uh oh!

jonahwilliams commented Jan 16, 2024

Uh oh!

chinmaygarde commented Jan 16, 2024

Uh oh!

jonahwilliams commented Jan 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chinmaygarde commented Jan 16, 2024

Uh oh!

jonahwilliams commented Jan 16, 2024

Uh oh!

jonahwilliams commented Jan 16, 2024

Uh oh!

chinmaygarde commented Jan 16, 2024

Uh oh!

chinmaygarde commented Jan 17, 2024

Uh oh!

chinmaygarde commented Jan 17, 2024

Uh oh!

jonahwilliams commented Jan 17, 2024

Uh oh!

chinmaygarde commented Jan 17, 2024

Uh oh!

jonahwilliams commented Jan 17, 2024

Uh oh!

jonahwilliams commented Jan 17, 2024

Uh oh!

chinmaygarde commented Jan 29, 2024

Uh oh!

Uh oh!

jonahwilliams commented Jan 16, 2024 •

edited

Loading