[RFC] Hardware-accelerated video decoding #2439

bryandeng · 2020-07-09T09:10:29Z

🚀 Feature

Hardware-accelerated video decoding

Motivation

Now that torchscriptable transforms natively supporting GPU have landed, hardware-accelerated video decoding may further help relieve the IO bottleneck commonly seen in large-scale video deep learning tasks.

Pitch

This functionality is likely to be built upon FFmpeg's hardware acceleration APIs, since FFmpeg is already in use and it's easier to support multiple hardware platforms and platform APIs in this way.

Alternatives

Decord and NVIDIA VPF are both PyTorch-friendly video IO libraries which support (NVIDIA only) hardware-accelerated video decoding to some extent.
NVIDIA VPF is built upon NVIDIA Video Codec SDK directly without FFmpeg.

Additional context

https://trac.ffmpeg.org/wiki/HWAccelIntro
https://developer.nvidia.com/video-encode-and-decode-gpu-support-matrix-new

fmassa · 2020-07-09T11:33:14Z

Hi,

Thanks for opening the issue!

I agree that video decoding on the GPU would be a nice functionality to have, although it's not in the near-term plan for now (maybe in 6 months?).

About seeking on videos, we are preparing a revamp of the video reading abstractions which will be more generic and allow for more flexibility while performing video decoding. cc @bjuncek

For video transformations on the GPU, note that we are making all transforms in torchvision to be torchscriptable and work with Tensors, so that they will natively support GPU if needed.

bryandeng · 2020-11-11T18:14:33Z

Updated description after the release of torchscriptable transforms.

bryandeng · 2020-11-11T18:19:38Z

We may first

discuss which hardware platforms to support
find out the constraints on input video formats and output pixel formats, etc.

JuanFMontesinos · 2021-05-18T18:29:02Z

Hi,
I wrote this small library which wraps ffmeg and does exactly that.
https://github.com/JuanFMontesinos/PyNVIdeoReader
For those who need it, it can work in the meanwhile.

The only drawback against NVIDIA VPF is numpy arrays are allocated at CPU in the end. However it allows to use all the ffmpeg tools. thus it's way more flexible.

I think something like this can be adapted to the current video reader that torchvision have.

dwrodri · 2021-07-13T02:37:18Z

Pinging this thread to check for interest. Nvidia has been working on writing their own bridge between their GPU's encoders/decoders and popular machine learning libraries (link here), but it's still in beta. Is there any interest in making this a feature in PyTorch?

I could see something like this introducing some unintuitive edge cases, since an iterator that returns CUDA Tensors may not play well with DataLoaders which leverage multiprocessing. However, the goal of this feature addition would be to offer developers the option to leverage the Nvidia GPU decoders without installing another dependency.

However, if the integration can be done from FFmpeg's C++ API, then it could probably be easily expanded to include Intel's and AMD decoder APIs as well, assuming they're all available form the same API.

JuanFMontesinos · 2021-07-13T08:16:06Z

@dwrodri
In fact Nvidia offers a c++ API (video codek SDK) which implements this GPU decoding in case anyone wants to create a customized library.
In my case I just found calling FFMPEG was simpler for the problem I wanted to solve.

From my point of view it would be a really interesting tool. Working with videos and cpu is horrible and I just found maaaany issues. DALI works like a charm but it's not very flexible.

dwrodri · 2021-07-13T09:36:32Z

Most Python libraries I've seen that interface with the decoder onboard an Nvidia GPU pawn it off to FFmpeg's CLI using subprocess. There's nothing inherently wrong with this, and is definitely the most "quick and dirty" way to get it done. Unfortunately, I don't think this approach would suffice for PyTorch most users would want to access the decoded frames as a CUDA Tensor as they exit the decoder.

libAV, the library form of FFmpeg, offers a wrapper around the propprietary hardware accelerators they support. If I can find the time, I'd like to see if I can put together a PR which would allow someone to pass a torch.device to the torchvision.io.VideoReader.__init__() call, and then set up the backend accordingly using libAV. That way, you could use FFmpeg's bindings for AMD's AMF/VCE, which is especially relevant because of PyTorch's recently-added Rocm support. Of course, all of this easier said than done.

DALI doesn't support variable-frame-rate video, which is unfortunate for me because I work with video containing missing frames quite often.

JuanFMontesinos · 2021-07-14T07:36:29Z

I agree that it would be nice but it's not straight forward. The support varies a lot across different GPU models.
Besides, the video format is really specific: mp4 yuv 4:2:0 for decoding and mp4 420,422 or 444 for encoding (and depending on the gpu model)
More info here

In short, if the user want to use GPU decoding it's really likely will have to prepare the videos beforehand. I find much simpler to pass a flag and let the user to fall in the codec exception if the format is wrong.

JuanFMontesinos · 2021-07-14T07:40:14Z

Let me just add that DALI's fixed-frame-rate support is not that bad.
It allows to disect a video precisely ensuring proper seeking and audio-visual synchronization.
Different libraries address missing frames in different ways so that you can find they return different amount of frames.

bjuncek · 2021-07-19T09:49:54Z

If I understand correctly, the main drawback of GPU decoding are the lack of supported codecs? i.e. one would most likely need to prepare videos specifically for that?
IMHO this is ok, as there has to be a speed/flexibility tradeoff. CPU decoding should not be deprecated regardless.

@JuanFMontesinos have you used DALI extensively for things like audio/video trainings? I'm interested if there are some weird/unexpected failure cases there, specifically with multiple modalities (streams) - in the past NVCODEC had some major issues with that, and I've been out of touch with it for the last year or so.

JuanFMontesinos · 2021-07-21T09:45:33Z

Hi @bjuncek
So I'm just gonna talk about what I know, which is Nvidia's library. I don't know whether there is anything opensource or amd based.

So yes, the main drawback is nvidia only supports h264 (and h265 encoding from 30XX gen onwards).
I find it ok as the performance is really nice compared to cpu decoding.

Soo In my experience DALI works really well as it optimizes resizing/cropping ops at the time of decoding the data (I'm not an expert, just metioning what i read in their docs)
Let me introduce a bit DALI just to have context.
Here is a short tutorial for audiovisual loading.
In short they conceived DALI for traditional ML task aka: video classification, image classification, audio classification, segmentation or so.
The system is really well optimized as it relies on c++ loader which converts gpu data directly into pytorch's tensors.

The user doesn't have access to the streams, thus, I cannot make comments about it.
Just found that the performance is awesome and really fluid compared to CPU decoding. One saves the time of allocating the tensors on GPU and the optimization for cropping/resizing really matters.
Note that it works with constant framerate. No idea if the issues you found were related to variable framerate. However I find it convenient as it allows to do frame seeking by frame ID (rather than timestamps). This allows to recover any set of frames precisely (ensuring a perfect synchronization between streams).

Soo IMO it would be really awesome to adapt this to pytorch natively. A possible drawback I find is about pytorch's dataloader, which is python-coded and uses multiprocessing. I barely know that this would be problematic together with tensors allocated on gpu. So this may force to have a c++ decoder?

bjuncek · 2021-09-30T16:00:44Z

(copy from #4392)

I'm back after some time and have been doing some benchmarking:

< insert funny gif here >

In straight-up video reading, it is not obviously clear that GPU decoding is actually faster than CPU decoding [1].

In chat with some people that have tried it in their training pipelines, it seems like there is benefit of GPU decoding in end-to-end pipelines where decoded frames can be directly manipulated (transforms) and consumed (model) by the GPU.
This raises the questions whether we can actually support this as our transforms are afaik done on CPU which means we'd actually have one additional memcopy to do. As @JuanFMontesinos mentions, I think a lot of in DALI performance comes from having this nicely integrated system, where everything is optimised. I'll be writing up some training code to see if I can properly test that in the following weeks.

Note that Mike from pyAV had similar thoughts and reasoning for not supporting HWAD in their docs [2]

[1] note, these comparisons were done on the machine with above-average CPU and what used to be quite competitive GPU. Running it on different hardware would probably provide more results.

[2] https://pyav.org/docs/develop/overview/about.html : see section "Unsuported features"

dwrodri · 2021-10-06T00:57:21Z

(copy from #4392)

I'm back after some time and have been doing some benchmarking:

< insert funny gif here >

In straight-up video reading, it is not obviously clear that GPU decoding is actually faster than CPU decoding [1].

In chat with some people that have tried it in their training pipelines, it seems like there is benefit of GPU decoding in end-to-end pipelines where decoded frames can be directly manipulated (transforms) and consumed (model) by the GPU. This raises the questions whether we can actually support this as our transforms are afaik done on CPU which means we'd actually have one additional memcopy to do. As @JuanFMontesinos mentions, I think a lot of in DALI performance comes from having this nicely integrated system, where everything is optimised. I'll be writing up some training code to see if I can properly test that in the following weeks.

Note that Mike from pyAV had similar thoughts and reasoning for not supporting HWAD in their docs [2]

[1] note, these comparisons were done on the machine with above-average CPU and what used to be quite competitive GPU. Running it on different hardware would probably provide more results.

[2] https://pyav.org/docs/develop/overview/about.html : see section "Unsuported features"

Thanks for sharing this info! This is good data that I'll incorporate into future discussions. You've already somewhat touched on the main case for PyTorch, so I'll elaborate a little on it here and hopefully pose some questions that which will hopefully move us closer to deciding the correct approach.

You've already referenced the main case to be made in favor of GPU-side video decode: GPU-side preprocessing pipelines. The performance gains of transcoding video footage on a GPU are namely limited by overheads in communication over the PCI-E bus.

However, the intended goal of adding this feature to Pytorch is to enable users to construct high throughput preprocessing pipelines by handling CudaTensors provided straight from an iterable. DALI shows great promise, and the devs are always responsive on their issue page. That being said, they are limited by the fact that the decoding process is tightly coupled to the parameters of the preprocessing pipeline, making it difficult to support variable frame rate footage. Furthermore, GPU-sided decode uses a minimal amount of GPU resources while freeing up the CPU from performing the task, which is quite resource intensive.

So question follows: Where does one draw the line for performance expectactions Pytorch's Python API? If you're going to the point where things like the GIL are bottlenecking inference, you can see a fantastic writeup here showing significant inference speedup when video decoding is moved from the CPU to the GPU (along with other things). There's strong case to made that if you're deploying computer vision models, Python isn't the best choice for performance.

I haven't looked at Decord in a while, and it appears that they've put a lot of work into their wrapper around the video encode/decode hardware on Nvidia GPUs. I'll have to check it out.

Thanks again for sharing the useful info!

prabhat00155 · 2022-03-03T14:32:00Z

@bryandeng We added GPU video decoder recently. Detailed installation instructions can be found here. It would be very helpful if you gave it a try and report any feedback you may have.

bryandeng · 2022-03-06T12:46:41Z

@prabhat00155 Thanks a lot! I will have a try.

JuanFMontesinos · 2022-03-10T09:52:53Z

Hi, I just remembered there is an Nvidia Toolkit
https://github.com/NVIDIA/VideoProcessingFramework

Just pasting the description as I didn't try it:

VPF stands for Video Processing Framework. It’s set of C++ libraries and Python bindings which provides full HW acceleration for video processing tasks such as decoding, encoding, transcoding and GPU-accelerated color space and pixel format conversions.

VPF also supports exporting GPU memory objects such as decoded video frames to PyTorch tensors without Host to Device copies. Check the Wiki page on how to build from source.

Best

fmassa added new feature module: io module: video labels Jul 9, 2020

bryandeng changed the title ~~Support hardware-accelerated video decoding~~ [RFC] Hardware-accelerated video decoding Nov 10, 2020

This was referenced Jul 16, 2021

TorchVision Roadmap - 2021 H1 #3221

Closed

TorchVision Roadmap - 2021 H2 #4187

Closed

prabhat00155 mentioned this issue Sep 10, 2021

Video support in torchvision #4392

Closed

4 tasks

JanuszL mentioned this issue Oct 18, 2021

Using the External Source operator for video sequences NVIDIA/DALI#3433

Closed

prabhat00155 mentioned this issue Dec 15, 2021

Add video GPU decoder #5019

Merged

fmassa closed this as completed in #5019 Dec 30, 2021

prabhat00155 self-assigned this Dec 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC] Hardware-accelerated video decoding #2439

[RFC] Hardware-accelerated video decoding #2439

bryandeng commented Jul 9, 2020 •

edited

Loading

fmassa commented Jul 9, 2020

Uh oh!

bryandeng commented Nov 11, 2020

Uh oh!

bryandeng commented Nov 11, 2020

Uh oh!

JuanFMontesinos commented May 18, 2021 •

edited

Loading

Uh oh!

dwrodri commented Jul 13, 2021

Uh oh!

JuanFMontesinos commented Jul 13, 2021

Uh oh!

dwrodri commented Jul 13, 2021 •

edited

Loading

Uh oh!

JuanFMontesinos commented Jul 14, 2021

Uh oh!

JuanFMontesinos commented Jul 14, 2021

Uh oh!

bjuncek commented Jul 19, 2021

Uh oh!

JuanFMontesinos commented Jul 21, 2021

Uh oh!

bjuncek commented Sep 30, 2021 •

edited

Loading

Uh oh!

dwrodri commented Oct 6, 2021 •

edited

Loading

Uh oh!

prabhat00155 commented Mar 3, 2022

Uh oh!

bryandeng commented Mar 6, 2022

Uh oh!

JuanFMontesinos commented Mar 10, 2022 •

edited

Loading

Uh oh!

[RFC] Hardware-accelerated video decoding #2439

[RFC] Hardware-accelerated video decoding #2439

Comments

bryandeng commented Jul 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🚀 Feature

Motivation

Pitch

Alternatives

Additional context

fmassa commented Jul 9, 2020

Uh oh!

bryandeng commented Nov 11, 2020

Uh oh!

bryandeng commented Nov 11, 2020

Uh oh!

JuanFMontesinos commented May 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dwrodri commented Jul 13, 2021

Uh oh!

JuanFMontesinos commented Jul 13, 2021

Uh oh!

dwrodri commented Jul 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JuanFMontesinos commented Jul 14, 2021

Uh oh!

JuanFMontesinos commented Jul 14, 2021

Uh oh!

bjuncek commented Jul 19, 2021

Uh oh!

JuanFMontesinos commented Jul 21, 2021

Uh oh!

bjuncek commented Sep 30, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dwrodri commented Oct 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

prabhat00155 commented Mar 3, 2022

Uh oh!

bryandeng commented Mar 6, 2022

Uh oh!

JuanFMontesinos commented Mar 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bryandeng commented Jul 9, 2020 •

edited

Loading

JuanFMontesinos commented May 18, 2021 •

edited

Loading

dwrodri commented Jul 13, 2021 •

edited

Loading

bjuncek commented Sep 30, 2021 •

edited

Loading

dwrodri commented Oct 6, 2021 •

edited

Loading

JuanFMontesinos commented Mar 10, 2022 •

edited

Loading