Skip to content

[RFC] Hardware-accelerated video decoding #2439

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Tracked by #4392
bryandeng opened this issue Jul 9, 2020 · 16 comments · Fixed by #5019
Closed
Tracked by #4392

[RFC] Hardware-accelerated video decoding #2439

bryandeng opened this issue Jul 9, 2020 · 16 comments · Fixed by #5019

Comments

@bryandeng
Copy link

bryandeng commented Jul 9, 2020

🚀 Feature

Hardware-accelerated video decoding

Motivation

Now that torchscriptable transforms natively supporting GPU have landed, hardware-accelerated video decoding may further help relieve the IO bottleneck commonly seen in large-scale video deep learning tasks.

Pitch

This functionality is likely to be built upon FFmpeg's hardware acceleration APIs, since FFmpeg is already in use and it's easier to support multiple hardware platforms and platform APIs in this way.

Alternatives

Decord and NVIDIA VPF are both PyTorch-friendly video IO libraries which support (NVIDIA only) hardware-accelerated video decoding to some extent.
NVIDIA VPF is built upon NVIDIA Video Codec SDK directly without FFmpeg.

Additional context

https://trac.ffmpeg.org/wiki/HWAccelIntro
https://developer.nvidia.com/video-encode-and-decode-gpu-support-matrix-new

@fmassa
Copy link
Member

fmassa commented Jul 9, 2020

Hi,

Thanks for opening the issue!

I agree that video decoding on the GPU would be a nice functionality to have, although it's not in the near-term plan for now (maybe in 6 months?).

About seeking on videos, we are preparing a revamp of the video reading abstractions which will be more generic and allow for more flexibility while performing video decoding. cc @bjuncek

For video transformations on the GPU, note that we are making all transforms in torchvision to be torchscriptable and work with Tensors, so that they will natively support GPU if needed.

@bryandeng bryandeng changed the title Support hardware-accelerated video decoding [RFC] Hardware-accelerated video decoding Nov 10, 2020
@bryandeng
Copy link
Author

Updated description after the release of torchscriptable transforms.

@bryandeng
Copy link
Author

We may first

  • discuss which hardware platforms to support
  • find out the constraints on input video formats and output pixel formats, etc.

@JuanFMontesinos
Copy link

JuanFMontesinos commented May 18, 2021

Hi,
I wrote this small library which wraps ffmeg and does exactly that.
https://github.com/JuanFMontesinos/PyNVIdeoReader
For those who need it, it can work in the meanwhile.

The only drawback against NVIDIA VPF is numpy arrays are allocated at CPU in the end. However it allows to use all the ffmpeg tools. thus it's way more flexible.

I think something like this can be adapted to the current video reader that torchvision have.

@dwrodri
Copy link

dwrodri commented Jul 13, 2021

Pinging this thread to check for interest. Nvidia has been working on writing their own bridge between their GPU's encoders/decoders and popular machine learning libraries (link here), but it's still in beta. Is there any interest in making this a feature in PyTorch?

I could see something like this introducing some unintuitive edge cases, since an iterator that returns CUDA Tensors may not play well with DataLoaders which leverage multiprocessing. However, the goal of this feature addition would be to offer developers the option to leverage the Nvidia GPU decoders without installing another dependency.

However, if the integration can be done from FFmpeg's C++ API, then it could probably be easily expanded to include Intel's and AMD decoder APIs as well, assuming they're all available form the same API.

@JuanFMontesinos
Copy link

@dwrodri
In fact Nvidia offers a c++ API (video codek SDK) which implements this GPU decoding in case anyone wants to create a customized library.
In my case I just found calling FFMPEG was simpler for the problem I wanted to solve.

From my point of view it would be a really interesting tool. Working with videos and cpu is horrible and I just found maaaany issues. DALI works like a charm but it's not very flexible.

@dwrodri
Copy link

dwrodri commented Jul 13, 2021

Most Python libraries I've seen that interface with the decoder onboard an Nvidia GPU pawn it off to FFmpeg's CLI using subprocess. There's nothing inherently wrong with this, and is definitely the most "quick and dirty" way to get it done. Unfortunately, I don't think this approach would suffice for PyTorch most users would want to access the decoded frames as a CUDA Tensor as they exit the decoder.

libAV, the library form of FFmpeg, offers a wrapper around the propprietary hardware accelerators they support. If I can find the time, I'd like to see if I can put together a PR which would allow someone to pass a torch.device to the torchvision.io.VideoReader.__init__() call, and then set up the backend accordingly using libAV. That way, you could use FFmpeg's bindings for AMD's AMF/VCE, which is especially relevant because of PyTorch's recently-added Rocm support. Of course, all of this easier said than done.

DALI doesn't support variable-frame-rate video, which is unfortunate for me because I work with video containing missing frames quite often.

@JuanFMontesinos
Copy link

I agree that it would be nice but it's not straight forward. The support varies a lot across different GPU models.
Besides, the video format is really specific: mp4 yuv 4:2:0 for decoding and mp4 420,422 or 444 for encoding (and depending on the gpu model)
More info here

In short, if the user want to use GPU decoding it's really likely will have to prepare the videos beforehand. I find much simpler to pass a flag and let the user to fall in the codec exception if the format is wrong.

@JuanFMontesinos
Copy link

Let me just add that DALI's fixed-frame-rate support is not that bad.
It allows to disect a video precisely ensuring proper seeking and audio-visual synchronization.
Different libraries address missing frames in different ways so that you can find they return different amount of frames.

@bjuncek
Copy link
Contributor

bjuncek commented Jul 19, 2021

If I understand correctly, the main drawback of GPU decoding are the lack of supported codecs? i.e. one would most likely need to prepare videos specifically for that?
IMHO this is ok, as there has to be a speed/flexibility tradeoff. CPU decoding should not be deprecated regardless.

@JuanFMontesinos have you used DALI extensively for things like audio/video trainings? I'm interested if there are some weird/unexpected failure cases there, specifically with multiple modalities (streams) - in the past NVCODEC had some major issues with that, and I've been out of touch with it for the last year or so.

@JuanFMontesinos
Copy link

Hi @bjuncek
So I'm just gonna talk about what I know, which is Nvidia's library. I don't know whether there is anything opensource or amd based.

So yes, the main drawback is nvidia only supports h264 (and h265 encoding from 30XX gen onwards).
I find it ok as the performance is really nice compared to cpu decoding.

Soo In my experience DALI works really well as it optimizes resizing/cropping ops at the time of decoding the data (I'm not an expert, just metioning what i read in their docs)
Let me introduce a bit DALI just to have context.
Here is a short tutorial for audiovisual loading.
In short they conceived DALI for traditional ML task aka: video classification, image classification, audio classification, segmentation or so.
The system is really well optimized as it relies on c++ loader which converts gpu data directly into pytorch's tensors.

The user doesn't have access to the streams, thus, I cannot make comments about it.
Just found that the performance is awesome and really fluid compared to CPU decoding. One saves the time of allocating the tensors on GPU and the optimization for cropping/resizing really matters.
Note that it works with constant framerate. No idea if the issues you found were related to variable framerate. However I find it convenient as it allows to do frame seeking by frame ID (rather than timestamps). This allows to recover any set of frames precisely (ensuring a perfect synchronization between streams).

Soo IMO it would be really awesome to adapt this to pytorch natively. A possible drawback I find is about pytorch's dataloader, which is python-coded and uses multiprocessing. I barely know that this would be problematic together with tensors allocated on gpu. So this may force to have a c++ decoder?

@bjuncek
Copy link
Contributor

bjuncek commented Sep 30, 2021

(copy from #4392)

I'm back after some time and have been doing some benchmarking:

< insert funny gif here >

In straight-up video reading, it is not obviously clear that GPU decoding is actually faster than CPU decoding [1].

In chat with some people that have tried it in their training pipelines, it seems like there is benefit of GPU decoding in end-to-end pipelines where decoded frames can be directly manipulated (transforms) and consumed (model) by the GPU.
This raises the questions whether we can actually support this as our transforms are afaik done on CPU which means we'd actually have one additional memcopy to do. As @JuanFMontesinos mentions, I think a lot of in DALI performance comes from having this nicely integrated system, where everything is optimised. I'll be writing up some training code to see if I can properly test that in the following weeks.

Note that Mike from pyAV had similar thoughts and reasoning for not supporting HWAD in their docs [2]

test

[1] note, these comparisons were done on the machine with above-average CPU and what used to be quite competitive GPU. Running it on different hardware would probably provide more results.

[2] https://pyav.org/docs/develop/overview/about.html : see section "Unsuported features"

@dwrodri
Copy link

dwrodri commented Oct 6, 2021

(copy from #4392)

I'm back after some time and have been doing some benchmarking:

< insert funny gif here >

In straight-up video reading, it is not obviously clear that GPU decoding is actually faster than CPU decoding [1].

In chat with some people that have tried it in their training pipelines, it seems like there is benefit of GPU decoding in end-to-end pipelines where decoded frames can be directly manipulated (transforms) and consumed (model) by the GPU. This raises the questions whether we can actually support this as our transforms are afaik done on CPU which means we'd actually have one additional memcopy to do. As @JuanFMontesinos mentions, I think a lot of in DALI performance comes from having this nicely integrated system, where everything is optimised. I'll be writing up some training code to see if I can properly test that in the following weeks.

Note that Mike from pyAV had similar thoughts and reasoning for not supporting HWAD in their docs [2]

test

[1] note, these comparisons were done on the machine with above-average CPU and what used to be quite competitive GPU. Running it on different hardware would probably provide more results.

[2] https://pyav.org/docs/develop/overview/about.html : see section "Unsuported features"

Thanks for sharing this info! This is good data that I'll incorporate into future discussions. You've already somewhat touched on the main case for PyTorch, so I'll elaborate a little on it here and hopefully pose some questions that which will hopefully move us closer to deciding the correct approach.

You've already referenced the main case to be made in favor of GPU-side video decode: GPU-side preprocessing pipelines. The performance gains of transcoding video footage on a GPU are namely limited by overheads in communication over the PCI-E bus.

However, the intended goal of adding this feature to Pytorch is to enable users to construct high throughput preprocessing pipelines by handling CudaTensors provided straight from an iterable. DALI shows great promise, and the devs are always responsive on their issue page. That being said, they are limited by the fact that the decoding process is tightly coupled to the parameters of the preprocessing pipeline, making it difficult to support variable frame rate footage. Furthermore, GPU-sided decode uses a minimal amount of GPU resources while freeing up the CPU from performing the task, which is quite resource intensive.

So question follows: Where does one draw the line for performance expectactions Pytorch's Python API? If you're going to the point where things like the GIL are bottlenecking inference, you can see a fantastic writeup here showing significant inference speedup when video decoding is moved from the CPU to the GPU (along with other things). There's strong case to made that if you're deploying computer vision models, Python isn't the best choice for performance.

I haven't looked at Decord in a while, and it appears that they've put a lot of work into their wrapper around the video encode/decode hardware on Nvidia GPUs. I'll have to check it out.

Thanks again for sharing the useful info!

@prabhat00155
Copy link
Contributor

@bryandeng We added GPU video decoder recently. Detailed installation instructions can be found here. It would be very helpful if you gave it a try and report any feedback you may have.

@bryandeng
Copy link
Author

@prabhat00155 Thanks a lot! I will have a try.

@JuanFMontesinos
Copy link

JuanFMontesinos commented Mar 10, 2022

Hi, I just remembered there is an Nvidia Toolkit
https://github.com/NVIDIA/VideoProcessingFramework

Just pasting the description as I didn't try it:

VPF stands for Video Processing Framework. It’s set of C++ libraries and Python bindings which provides full HW acceleration for video processing tasks such as decoding, encoding, transcoding and GPU-accelerated color space and pixel format conversions.

VPF also supports exporting GPU memory objects such as decoded video frames to PyTorch tensors without Host to Device copies. Check the Wiki page on how to build from source.

Best

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants