Skip to content
This repository was archived by the owner on Feb 25, 2025. It is now read-only.

Made Picture::toImage happen on the IO thread with no need for an onscreen surface. #9813

Merged
merged 10 commits into from
Jul 16, 2019

Conversation

gaaclarke
Copy link
Member

This fixes ShaderWarmUp for iOS (possibly Android too).
References: flutter/flutter#35923

I suspect there are a few other issues related to this bug too, this was probably an issue for Android as well. Here are a few extra benefits of this change:

  1. Now ui.Picture.toImage won't interfere with regular rendering.
  2. In the fullscreen case we should be getting a few more milliseconds of shader warmup
  3. It simplifies the API by removing the SnapshotDelegate

I have a few concerns for this PR:

  1. The thread safety of PersistentCache is suspect. I'll have to audit it further. Chinmay said it should be threadsafe.
  2. There is a race condition where rendering the first frame to the onscreen surface is not waiting for the shader warmup to complete. In the worse case (assuming PersistentCache is threadsafe) we have already compiled a shader in the warmup but the first frame won't have access to it. To fix that we'd have to give the ShaderWarmUp some ability to block the GPU thread.

I'm open to any suggestions for automated testing.

Copy link
Contributor

@liyuqian liyuqian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you Aaron!

This has been something that I wanted to do for a long time. I'd say that doing warm-up/snapshot on the IO thread without blocking is better than doing the work on the GPU thread, or blocking GPU thread in some other ways. By blocking, we're giving users no frame at all during the warm-up; by not blocking, the users can at least see some frames as soon as possible. Of course, we can always provide developers flexibility on whether to block or not.

For unit tests, I think you can add something to testing/dart/canvas_test.dart: record a Picture, calls Picture.toImage, and then check that its pixels are expected. The only challenge I know is how to run the Dart test locally. So far, the only way I know is to run testing/run_tests.sh. You can modify the script to only run the test that you care about.

Once this gets landed, please feel free to add some comments about the performance and UX difference between this new IO-thread approach, and the old GPU-thread approach in the design doc (https://docs.google.com/document/d/1dTcjrK5Nn7roa3WftaKcoOk7-2u_2HItCahlNc73Xt8/edit#heading=h.ar902g5pm5jo) for our future reference.

}
}

sk_sp<SkImage> MakeRasterSnapshot(sk_sp<SkPicture> picture,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are some points that Chinmay and I discussed offline, and think would be nice to add to the code comment:

  1. MakeRasterSnapshot makes a CPU-backed image (instead of a GPU-backed image). It's enables peeking pixels from the Dart code, or encoding the image for storage. But it introduces extra overhead for converting GPU images to CPU images. (Raster means CPU in Skia's jargon.)

  2. Maybe it's nice to also have a MakeSnapshot API that generates a GPU-backed image without the GPU-to-CPU overhead. It doesn't have to be added in this PR. Maybe just left a comment here and add it in a future PR, potentially with some benchmarks to verify the performance gain of MakeSnapshot over MakeRasterSnapshot.

  3. Shader warm-up only needs MakeSnapshot (and as Skia pointed out, we can further modify the GrContext to skip all GPU operations other than shader compilations). We can still preserve the behavior of using MakeRasterSnapshot in this PR. But eventually we'll want to switch to a cheaper version. Maybe leave a comment in the code so we don't forget this.

  4. In some cases, the developer may just want to have a GPU-backed snapshot, and directly composite it back into a GPU surface. The MakeRasterSnapshot will be quite expensive for this case as it triggers a GPU-to-CPU copy, and a CPU-to-GPU copy. Maybe leave a comment here to suggest that we should use the future MakeSnapshot API for this case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I'll add an issue once this lands.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to add that #2 was discussed here at length:
#8835

Would it be possible to work #2 in since this section of code is related? Having a GPU-backed image and avoiding the GPU-to-CPU overhead would be beneficial.

Adding @Hixie - I will soon address the concerns brought up there about increased documentation and tests in this area.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to add that #2 was discussed here at length:
#8835

Would it be possible to work #2 in since this section of code is related? Having a GPU-backed image and avoiding the GPU-to-CPU overhead would be beneficial.

Adding @Hixie - I will soon address the concerns brought up there about increased documentation and tests in this area.

Noted. I think it's something we should do as well, but I think it should be prioritized against other issues and not serve as an impediment to this change.

}

{
TRACE_EVENT0("flutter", "DeviceHostTransfer");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe leave a comment here like "Here device means GPU and host means CPU; this is different from use cases like Flutter driver tests where device means mobile devices and host means laptops/desktops."

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Member

@chinmaygarde chinmaygarde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ui_unittests target now is OpenGL ES capable on the IO thread. Please add a unit-test for this. See the see image_decoder_unittests.cc for a harness you can use as a reference.

MakeSnapshotSurface(picture_bounds, resource_context);
sk_sp<SkImage> raster_image = MakeRasterSnapshot(picture, surface);

fml::TaskRunner::RunNowOrPostTask(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RunNowOrPostTask is only used if there is some form of latching involved. There isn't any here. Please post the task directly to the UI task runner. I realize this may have been because the previous code did the same. But there really is no reason to do this here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to revert this too. It was causing a deadlock for some tests.

/// @param[in] picture The picture that will get converted to an image.
/// @param[in] surface The surface tha will be used to render the picture. This
/// will be CPU or GPU based.
/// @todo Currently this creates a RAM backed image regardless of what type of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"RAM backed" is odd terminology. Even device memory is in RAM right? Can we stick to just the device/host terminology like you have done for the traces?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO device/host is very confusing. Even yuqian asked me to add a comment to his code to clarify them. Images in RAM vs VRAM is the model I've used when dealing with the discrete GPU model. Can you tell me exactly what you want here and I'll plop it in.

sk_sp<SkSurface> MakeSnapshotSurface(const SkISize& picture_size,
fml::WeakPtr<GrContext> resource_context) {
SkImageInfo image_info = SkImageInfo::MakeN32Premul(
picture_size.width(), picture_size.height(), SkColorSpace::MakeSRGB());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the colorspace is nullptr, it defaults to SRGB. We have stuck to this default behavior elsewhere. Please do the same here as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to remove this change, it started causing tests to fail.

@gaaclarke
Copy link
Member Author

gaaclarke commented Jul 15, 2019

The ui_unittests target now is OpenGL ES capable on the IO thread. Please add a unit-test for this. See the see image_decoder_unittests.cc for a harness you can use as a reference.

I've tested the unit-test I wrote for yuqian. It actually is running on the GPU. Maybe we can add an optional 3rd argument that will specify that you want it to run on the GPU if available. I'm a bit worried this could mess with golden image tests as-is since it might be moving those over to GPU from CPU.

I attempted this Chinmay but it is non-trivial for the following reasons:
1) The function I would be testing takes as input and output Dart_Handle's. I'm sure there is a way to introspect into them. I'm not sure it's worth doing from the C++ side.
2) The functions rely on a global UIDartState. We could factor that out to an argument and rework it as a subset needed for my function.
3) I'd have to rewrite the golden image testing I made for Dart in C++.

I think the best way to test this would be to run the test on the Dart side like I did at yuqian's request. Do you have any ideas how we could take what you did in ui_unittests and get Dart running on top of it so we can test it similarly?

@chinmaygarde
Copy link
Member

It actually is running on the GPU.

Flutter tester does not run the GPU backend. Are you sure? The ui_unittests can run with both the GPU and CPU backend on a per test basis.

@chinmaygarde
Copy link
Member

For example, see image_decoder_unittests.cc for ValidImageResultsInSuccess and CanDecodeWithoutAGPUContext. That target also sets up Dart code to run in AOT as well as JIT modes.

@gaaclarke
Copy link
Member Author

It actually is running on the GPU.

Flutter tester does not run the GPU backend. Are you sure? The ui_unittests can run with both the GPU and CPU backend on a per test basis.

I was mistaken, no they run on the CPU.

@gaaclarke
Copy link
Member Author

@chinmaygarde We talked about this offline, the automated test tests the CPU render path. The GPU render path was only tested manually and there is some future work to get Dart running (with dependencies) on the GPU backend.

@chinmaygarde
Copy link
Member

Filed flutter/flutter#36224 and flutter/flutter#36225 to remind us to continue this yak shave.

@jason-simmons
Copy link
Member

This affected the memory usage metrics in the performance benchmarks.

Picture.toImage is called during app startup by the ShaderWarmUp class in the framework. After this change, the total memory usage of Gallery at startup is reported as several MB higher than before in the Android dumpsys meminfo report. The difference is due to a combination of increased native heap and graphics memory consumption.

I'm not sure why moving picture rasterization to the IO thread has this effect. This may be related to implementation details of how Android malloc handles requests from multiple threads (similar to what we were looking at in flutter/flutter#36079)

@chinmaygarde

@gaaclarke
Copy link
Member Author

gaaclarke commented Jul 17, 2019

This affected the memory usage metrics in the performance benchmarks.

Picture.toImage is called during app startup by the ShaderWarmUp class in the framework. After this change, the total memory usage of Gallery at startup is reported as several MB higher than before in the Android dumpsys meminfo report. The difference is due to a combination of increased native heap and graphics memory consumption.

Are the performance benchmarks running with the OpenGL backend? If so, a memory difference is expected because previously shaderwarmup could be incorrectly running on the cpu renderer which meant no shaders actually warmup. The added memory should be compiled shaders. I wouldn't expect those to be several megs in size though. Even as raw text they are maybe 1MB.

@gaaclarke
Copy link
Member Author

This affected the memory usage metrics in the performance benchmarks.

Another thing to consider is that depending on how you are measuring memory usage this could cause a higher number because the code is more parallelized. For example if it is measuring peak allocations we are now potentially running the GPU thread at the same time as the IO thread so peak usage could be higher. Whereas previously everything we put on the GPU thread so peak usage would be gated on what that one thread was doing.

@liyuqian
Copy link
Contributor

It seems that most memory regressions are on Android, and I believe the shader warm-up were previuosly using GPUs on Android (it was only using CPUs on iOS).

Having resources on the IO thread in addition to those in the GPU thread might be a good explanation. In that case, maybe we should find some way to confirm the conjecture. Finally, if the 10MB cost was unavoidable in the IO thread, then we need to figure out the trade-offs to decide whether to use GPU or IO thread by default (the decisions could be different for Andoird and iOS), and probably provide options to developers to choose which thread to use.

@jason-simmons
Copy link
Member

You can see the effect on memory usage by starting a simple app created with flutter create and running adb shell dumpsys meminfo.

I took these samples on a Nexus 5X.

Before this PR:

Java Heap:     4124
Native Heap:   13340
Code:          12560
Stack:         556
Graphics:      50792
Private Other: 15660
System:        5161

After:

Java Heap:     4172
Native Heap:   23224
Code:          13556
Stack:         568
Graphics:      61824
Private Other: 15692
System:        7232

The additional memory usage appears to start when surface->getCanvas()->flush() is called on the IO thread in MakeRasterSnapshot.

@gaaclarke
Copy link
Member Author

Here are traces from before and after this patch of the first frame:

post.json.gz
pre.json.gz

@gaaclarke
Copy link
Member Author

Here is one more trace that shows the gpu no longer being blocked by the warmup. This results in a first frame that is roughly 75ms sooner.

concurrent.json.gz

@liyuqian
Copy link
Contributor

liyuqian commented Aug 9, 2019

Several notes as @xster and I recently discussed:

  • iOS persistent cache only has GLSL but not binary. Hence doing warm-up on the IO thread is going to hurt the iOS performance as the shaders covered in DefaultShaderWarmUp still need to be compiled from GLSL to binary.
  • Warm-up on the IO thread also makes some GrContext related resource warm-up not working as IO thread and GPU thread have different GrContexts. (I remember one particular example is the ProgramBuilder which takes about 30ms to initialize in @xster 's trace.)
  • Before this patch, our GPU thread warm-up probably doesn't work for iOS add-to-app scenario. But it should be working on standalone iOS app as traces in Made Picture::toImage happen on the IO thread with no need for an onscreen surface. #9813 (comment) show.

Hence we might want to reconsider how to do shader warm-up on iOS. It seems to be better to use the GPU thread for standalone iOS apps.

liyuqian added a commit to liyuqian/flutter that referenced this pull request Aug 9, 2019
The update is copied from an update we made to a Google-internal
client: cl/260202900

The update will save 1 shader compilation.

This should help solve our regression:
flutter#31203

More regressions on iOS might be introduced later by
flutter/engine#9813 (comment)

Unfortunately, we didn't rebase our benchmarks so such regressions
were not detected.

Hence to fully solve flutter#31203,
we might need to revert some change in
flutter/engine#9813 to make iOS shader warm-up
happen on the GPU thread again.
liyuqian added a commit to flutter/flutter that referenced this pull request Aug 15, 2019
The update is copied from an update we made to a Google-internal
client: cl/260202900

The update will save 1 shader compilation.

This should help solve our regression:
#31203

More regressions on iOS might be introduced later by
flutter/engine#9813 (comment)

Unfortunately, we didn't rebase our benchmarks so such regressions
were not detected.

Hence to fully solve #31203,
we might need to revert some change in
flutter/engine#9813 to make iOS shader warm-up
happen on the GPU thread again.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants