Skip to content
This repository was archived by the owner on Feb 25, 2025. It is now read-only.

[Impeller] Implement framebuffer-fetch via subpasses in Vulkan without extensions. #49787

Closed
wants to merge 4 commits into from

Conversation

chinmaygarde
Copy link
Member

  • Subpasses are not exposed in the HAL and the need for subpasses in Vulkan can be determined based on the presence and use of input-attachments in the shaders. This information is already reflected by the compiler. Because of this, all references to subpasses have been removed from APIs above the HAL.
  • RenderPassBuilderVK is a lightweight object used to generate render passes to use either with the pipelines (compat, base, or per-subpass) or during rendering along with the framebuffer. Using the builder also sets up the right subpass dependencies. As long as the builder contains compatible attachments and subpass counts, different subpasses stamped by the builder (via the Build method) are guaranteed to be compatible per the rules in the spec.
  • Pass attachments are now in the eGeneral layout. There was no observable difference in performance when manually inserting the right transitions. Except, a lot of transitions needed to be inserted. If we need it, we can add it back in short order. I wouldn't be averse to adding it if reviewers insist.
  • Depending on where the subpass cursor is for each command, a different pipeline variant is necessary. For instance, if a command uses a pipeline at subpass 1 of 10 and that same pipeline is reused later in say subpass 6, the variant for subpass 1 is not suitable for subpass 6 (VkGraphicsPipelineCreateInfo::subpass(uint32_t) is part of the compat rules). Creation of these subpass variants from the lone-pass (subpass index 0 of count 1) prototype is done via a preload operation with jobs submitted to a concurrent worker. The preload can only happen once the number of passes needed can be determined. On mobile and desktop devices at hand, the observation was that the variants obtained from the prototype already in a PipelineCacheVK was extremely fast. Even so, once the variant is obtained, it is cached in PipelineVK. Notably, this is not present in PipelineCacheVK. That top-level cache only contains prototypes for the lone-pass pipeline configuration. The allows for purging of subpass variants of which there can theoretically be an unbounded number, and also a single point where subpass prototype creation can be elided completely if the rasterization_order_attachment_access extension is present.
  • Speaking of the rasterization_order_attachment_access extension, its use has been removed in this patch. I am prototyping adding it back to measure the overhead introduced by manual subpass management. If the overhead is measurable, we can use the extension on devices that have it as an added optimization.
  • The complexity of command encoding remains linear (to the number of commands) per pass.
  • This patch only works on a single color attachment being used as an input attachment. While this is sufficient for current use cases, the Metal implementation is significantly more capable since the multiple attachments and attachment types (depth) are already supported. Rounding out support for this is in progress.
  • This patch contains some test harness updates for MoltenVK that will be backed out and submitted separately.

…t extensions.

* Subpasses are not exposed in the HAL and the need for subpasses in Vulkan can
  be determined based on the presence and use of input-attachments in the
  shaders. This information is already reflected by the compiler. Because of
  this, all references to subpasses have been removed from APIs above the HAL.
* `RenderPassBuilderVK` is a lightweight object used to generate render passes
  to use either with the pipelines (compat, base, or per-subpass) or during
  rendering along with the framebuffer. Using the builder also sets up the
  right subpass dependencies. As long as the builder contains compatible
  attachments and subpass counts, different subpasses stamped by the builder
  (via the `Build` method) are guaranteed to be compatible per the rules in the
  spec.
* Pass attachments are now in the `eGeneral` layout. There was no observable
  difference in performance when manually inserting the right transitions.
  Except, a lot of transitions needed to be inserted. If we need it, we can add
  it back in short order. I wouldn't be averse to adding it if reviewers
  insist.
* Depending on where the subpass cursor is for each command, a different
  pipeline variant is necessary. For instance, if a command uses a pipeline at
  subpass 1 of 10 and that same pipeline is reused later in say subpass 6, the
  variant for subpass 1 is not suitable for subpass 6
  (`VkGraphicsPipelineCreateInfo::subpass(uint32_t)` is part of the compat
  rules). Creation of these subpass variants from the lone-pass (subpass index
  0 of count 1) prototype is done via a preload operation with jobs submitted
  to a concurrent worker. The preload can only happen once the number of passes
  needed can be determined. On mobile and desktop devices at hand, the
  observation was that the variants obtained from the prototype already in a
  `PipelineCacheVK` was extremely fast. Even so, once the variant is obtained,
  it is cached in `PipelineVK`. Notably, this is not present in
  `PipelineCacheVK`. That top-level cache only contains prototypes for the
  lone-pass pipeline configuration. The allows for purging of subpass variants
  of which there can theoretically be an unbounded number, and also a single
  point where subpass prototype creation can be elided completely if the
  `rasterization_order_attachment_access` extension is present.
* Speaking of the `rasterization_order_attachment_access` extension, its use has
  been removed in this patch. I am prototyping adding it back to measure the
  overhead introduced by manual subpass management. If the overhead is
  measurable, we can use the extension on devices that have it as an added
  optimization.
* The complexity of command encoding remains linear (to the number of commands)
  per pass.
* This patch only works on a single color attachment being used as an input
  attachment. While this is sufficient for current use cases, the Metal
  implementation is significantly more capable since the multiple attachments
  and attachment types (depth) are already supported. Rounding out support for
  this is in progress.
* This patch contains some test harness updates for MoltenVK that will be backed
  out and submitted separately.
@chinmaygarde
Copy link
Member Author

#49686 and #49780 are expected to cause significant merge conflicts. So after gathering feedback here, I am going to re-apply the patches on top of those two changes when they land. cc @jonahwilliams

@chinmaygarde
Copy link
Member Author

This patch works for the existing use-cases with no validation errors. Will work on the minor WIP items while the other changes below the HAL land before rebasing.

@chinmaygarde chinmaygarde added the Work in progress (WIP) Not ready (yet) for review! label Jan 16, 2024
@jonahwilliams
Copy link
Member

Awesome: some quick thoughts:

Re: rasterization_order_attachment_access

Yeah lets drop it for now, I'd happily take a single fast enough system over a mess of extensions. not to mention #47131 is its own mess.

Re: command look ahead.

As you already know my patches are removing the command recording which you're using to calculate subpass count in this patch. however, I think we already know the exact number of advanced blends in entity_pass (https://github.com/flutter/engine/blob/main/impeller/entity/entity_pass.cc#L77 ) so perhaps we can recycle this information into the RenderPass constructor to correctly set up the subpasses. If that doesn't immediately work we could probably get away with doing a lookahead in entity pass.

@chinmaygarde
Copy link
Member Author

Yeah, the upcoming changes is why I wanted to get your eyes on this mechanism. Getting the subpass counts is necessary but not sufficient as the vkCmdNextSubpass needs to be encoded at the right time as well (as determined by ShouldAdvanceSubpass). And that does require a lookahead. I'm confident there is a workaround though. Preferably one that doesn't involve updates from above the HAL.

@jonahwilliams
Copy link
Member

jonahwilliams commented Jan 16, 2024

It seems like the two components are:

  1. Total subpass count

This is easy, we already have it computed in entity_pass. As a starting point, you could just provide that to the RenderPass constructor.

  1. Advancing bound input subpass

Without the deferred recording, I think all you have to do is advance the subpass count when a pipeline is recorded that samples from the subpass input. The render pass can maintain that state. all this seems like is a line like

if (pipelineDesc.uses_subpass) {
  subpass_index++;  
}

Side note, but from some early experiments I think this patch is breaking clips, but not related to the actual subpass usage as far as I can tell.

@chinmaygarde
Copy link
Member Author

I am treating (1) as a non-starter at the moment because it involves changes above the HAL and now relies on developers to divine the pass count before encoding. The cognitive load increases too because now you have to explain what the heck a subpass is and why callers should care. Oh, and its a Vulkan specific thing. It fundamentally changes how renderpass could be setup and used. Theres got to be way to have our cake and eat it too. I am thinking secondary command buffers ATM but need to explore some more. Momentido.

(2) is about right. You just need to collapse the first subpass if there is an input attachment load in the first command (just a load from clear). But yeah.

@jonahwilliams
Copy link
Member

I am treating (1) as a non-starter at the moment because it involves changes above the HAL and now relies on developers to divine the pass count before encoding.

But we're the developers, and we've already figured it out!

@jonahwilliams
Copy link
Member

I am treating (1) as a non-starter at the moment because it involves changes above the HAL and now relies on developers to divine the pass count before encoding.

Following up, I'm not aware of any use cases we have today where we are exposing the impeller rendering primitives to developers and expecting them to do something with them. It seems like we already have the subpass count computed today in entity pass, which all flutter users and current non flutter impeller users will benefit from.

@chinmaygarde
Copy link
Member Author

Yeah, the current use-cases in APIs above impeller/entity are easy to reason about. But, introducing the notion of subpasses above impeller/renderer add significant cognitive overhead when using that layer and muddies the waters for stuff like Flutter GPU. After reading the Vulkan spec. so much over the past few and getting the distinct sense that the API was explicit to the point where a lot of setup was idle ceremony for no real gain (and it will be for all backends other than Vulkan), I am vary of making the HAL similarly hard to use. There are now open questions like what happens if the caller specifies the wrong subpass counts but is running on Metal, do they get validations even if Metal doesn't care? If the first command uses a subpass load, does that get collapsed? etc...

I want to timebox an investigation to figure out if there is a way out. Though, I fully suspect it will come down to a question of how much performance (if any) there is a willingness to sacrifice for a nicer API.

@chinmaygarde
Copy link
Member Author

At the end of my timebox and signing off for the day. Conclusions:

I am not going to make changes to the HAL at this time. If the key observation behind direct encoding was that that all the information necessary for encoding was available at the time of the call was made, that condition is not satisfied on Vulkan because of subpasses and reactor access on OpenGL. I'm going to rebase on ToT this week and back out the direct encoding stuff on the Vulkan backend.

@chinmaygarde
Copy link
Member Author

Was discussing with @bdero at some point how we shouldn't add subpasses to the HAL. But that may have been a bad call. We can add that notion to HAL and and get stuff like validation across APIs as well as direct encode.

@jonahwilliams
Copy link
Member

I believe this change would be a large improvement for advanced blends on Vulkan devices (though I've only tested on Pixels so far). I can hold off on removing deferred encoding for Vulkan if you'd like - then this should have an easier path to landing?

I disagree that we should worry about an internal facing API for now. After we've shipped on Android we can revisit what these APIs should look like for Flutter GPU.

@chinmaygarde
Copy link
Member Author

I also lost concurrent subpass pipeline preloading with the direct encoding. So the issues were more than API ergonomics which I think we have flexibility on. Let's have a quick chat about this during our sync today to figure out a way forward.

@jonahwilliams
Copy link
Member

Do you know how much preloading is buying you? Even today we record -> encode immediately, such that I doubt there is more than a hundred microseconds between the preload start and when the PSO needs to be ready.

My intuition is that if there are enough distinct pipelines, then adding a frame with multiple advanced blends (such that we end up constructing new pipeline variants for each one) is going to drop the frame - but that might be okay as long as subsequent frames continue to use the pipeline cached entries.

@jonahwilliams
Copy link
Member

Although I see the issue, if you have to block on every new pipeline variant you don't get any benefits from batching it up 🤔

@chinmaygarde
Copy link
Member Author

Closing in favor of #50154 which solves the subpass PSO variants issue and addresses merge conflicts.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
e: impeller Work in progress (WIP) Not ready (yet) for review!
Projects
No open projects
Status: ✅ Done
Development

Successfully merging this pull request may close these issues.

2 participants