-
Notifications
You must be signed in to change notification settings - Fork 6k
[Impeller] Implement framebuffer-fetch via subpasses in Vulkan without extensions. #49787
Conversation
…t extensions. * Subpasses are not exposed in the HAL and the need for subpasses in Vulkan can be determined based on the presence and use of input-attachments in the shaders. This information is already reflected by the compiler. Because of this, all references to subpasses have been removed from APIs above the HAL. * `RenderPassBuilderVK` is a lightweight object used to generate render passes to use either with the pipelines (compat, base, or per-subpass) or during rendering along with the framebuffer. Using the builder also sets up the right subpass dependencies. As long as the builder contains compatible attachments and subpass counts, different subpasses stamped by the builder (via the `Build` method) are guaranteed to be compatible per the rules in the spec. * Pass attachments are now in the `eGeneral` layout. There was no observable difference in performance when manually inserting the right transitions. Except, a lot of transitions needed to be inserted. If we need it, we can add it back in short order. I wouldn't be averse to adding it if reviewers insist. * Depending on where the subpass cursor is for each command, a different pipeline variant is necessary. For instance, if a command uses a pipeline at subpass 1 of 10 and that same pipeline is reused later in say subpass 6, the variant for subpass 1 is not suitable for subpass 6 (`VkGraphicsPipelineCreateInfo::subpass(uint32_t)` is part of the compat rules). Creation of these subpass variants from the lone-pass (subpass index 0 of count 1) prototype is done via a preload operation with jobs submitted to a concurrent worker. The preload can only happen once the number of passes needed can be determined. On mobile and desktop devices at hand, the observation was that the variants obtained from the prototype already in a `PipelineCacheVK` was extremely fast. Even so, once the variant is obtained, it is cached in `PipelineVK`. Notably, this is not present in `PipelineCacheVK`. That top-level cache only contains prototypes for the lone-pass pipeline configuration. The allows for purging of subpass variants of which there can theoretically be an unbounded number, and also a single point where subpass prototype creation can be elided completely if the `rasterization_order_attachment_access` extension is present. * Speaking of the `rasterization_order_attachment_access` extension, its use has been removed in this patch. I am prototyping adding it back to measure the overhead introduced by manual subpass management. If the overhead is measurable, we can use the extension on devices that have it as an added optimization. * The complexity of command encoding remains linear (to the number of commands) per pass. * This patch only works on a single color attachment being used as an input attachment. While this is sufficient for current use cases, the Metal implementation is significantly more capable since the multiple attachments and attachment types (depth) are already supported. Rounding out support for this is in progress. * This patch contains some test harness updates for MoltenVK that will be backed out and submitted separately.
#49686 and #49780 are expected to cause significant merge conflicts. So after gathering feedback here, I am going to re-apply the patches on top of those two changes when they land. cc @jonahwilliams |
This patch works for the existing use-cases with no validation errors. Will work on the minor WIP items while the other changes below the HAL land before rebasing. |
Awesome: some quick thoughts: Re: rasterization_order_attachment_access Yeah lets drop it for now, I'd happily take a single fast enough system over a mess of extensions. not to mention #47131 is its own mess. Re: command look ahead. As you already know my patches are removing the command recording which you're using to calculate subpass count in this patch. however, I think we already know the exact number of advanced blends in entity_pass (https://github.com/flutter/engine/blob/main/impeller/entity/entity_pass.cc#L77 ) so perhaps we can recycle this information into the RenderPass constructor to correctly set up the subpasses. If that doesn't immediately work we could probably get away with doing a lookahead in entity pass. |
Yeah, the upcoming changes is why I wanted to get your eyes on this mechanism. Getting the subpass counts is necessary but not sufficient as the |
It seems like the two components are:
This is easy, we already have it computed in entity_pass. As a starting point, you could just provide that to the RenderPass constructor.
Without the deferred recording, I think all you have to do is advance the subpass count when a pipeline is recorded that samples from the subpass input. The render pass can maintain that state. all this seems like is a line like
Side note, but from some early experiments I think this patch is breaking clips, but not related to the actual subpass usage as far as I can tell. |
I am treating (1) as a non-starter at the moment because it involves changes above the HAL and now relies on developers to divine the pass count before encoding. The cognitive load increases too because now you have to explain what the heck a subpass is and why callers should care. Oh, and its a Vulkan specific thing. It fundamentally changes how renderpass could be setup and used. Theres got to be way to have our cake and eat it too. I am thinking secondary command buffers ATM but need to explore some more. Momentido. (2) is about right. You just need to collapse the first subpass if there is an input attachment load in the first command (just a load from clear). But yeah. |
But we're the developers, and we've already figured it out! |
…oesn't specify the transition out of it.
Following up, I'm not aware of any use cases we have today where we are exposing the impeller rendering primitives to developers and expecting them to do something with them. It seems like we already have the subpass count computed today in entity pass, which all flutter users and current non flutter impeller users will benefit from. |
Yeah, the current use-cases in APIs above I want to timebox an investigation to figure out if there is a way out. Though, I fully suspect it will come down to a question of how much performance (if any) there is a willingness to sacrifice for a nicer API. |
At the end of my timebox and signing off for the day. Conclusions: I am not going to make changes to the HAL at this time. If the key observation behind direct encoding was that that all the information necessary for encoding was available at the time of the call was made, that condition is not satisfied on Vulkan because of subpasses and reactor access on OpenGL. I'm going to rebase on ToT this week and back out the direct encoding stuff on the Vulkan backend. |
Was discussing with @bdero at some point how we shouldn't add subpasses to the HAL. But that may have been a bad call. We can add that notion to HAL and and get stuff like validation across APIs as well as direct encode. |
I believe this change would be a large improvement for advanced blends on Vulkan devices (though I've only tested on Pixels so far). I can hold off on removing deferred encoding for Vulkan if you'd like - then this should have an easier path to landing? I disagree that we should worry about an internal facing API for now. After we've shipped on Android we can revisit what these APIs should look like for Flutter GPU. |
I also lost concurrent subpass pipeline preloading with the direct encoding. So the issues were more than API ergonomics which I think we have flexibility on. Let's have a quick chat about this during our sync today to figure out a way forward. |
Do you know how much preloading is buying you? Even today we record -> encode immediately, such that I doubt there is more than a hundred microseconds between the preload start and when the PSO needs to be ready. My intuition is that if there are enough distinct pipelines, then adding a frame with multiple advanced blends (such that we end up constructing new pipeline variants for each one) is going to drop the frame - but that might be okay as long as subsequent frames continue to use the pipeline cached entries. |
Although I see the issue, if you have to block on every new pipeline variant you don't get any benefits from batching it up 🤔 |
Closing in favor of #50154 which solves the subpass PSO variants issue and addresses merge conflicts. |
RenderPassBuilderVK
is a lightweight object used to generate render passes to use either with the pipelines (compat, base, or per-subpass) or during rendering along with the framebuffer. Using the builder also sets up the right subpass dependencies. As long as the builder contains compatible attachments and subpass counts, different subpasses stamped by the builder (via theBuild
method) are guaranteed to be compatible per the rules in the spec.eGeneral
layout. There was no observable difference in performance when manually inserting the right transitions. Except, a lot of transitions needed to be inserted. If we need it, we can add it back in short order. I wouldn't be averse to adding it if reviewers insist.VkGraphicsPipelineCreateInfo::subpass(uint32_t)
is part of the compat rules). Creation of these subpass variants from the lone-pass (subpass index 0 of count 1) prototype is done via a preload operation with jobs submitted to a concurrent worker. The preload can only happen once the number of passes needed can be determined. On mobile and desktop devices at hand, the observation was that the variants obtained from the prototype already in aPipelineCacheVK
was extremely fast. Even so, once the variant is obtained, it is cached inPipelineVK
. Notably, this is not present inPipelineCacheVK
. That top-level cache only contains prototypes for the lone-pass pipeline configuration. The allows for purging of subpass variants of which there can theoretically be an unbounded number, and also a single point where subpass prototype creation can be elided completely if therasterization_order_attachment_access
extension is present.rasterization_order_attachment_access
extension, its use has been removed in this patch. I am prototyping adding it back to measure the overhead introduced by manual subpass management. If the overhead is measurable, we can use the extension on devices that have it as an added optimization.