Skip to content

[Discussion] Decreasing binding calls #2009

@shford

Description

@shford

Prior to trying PyAV I was only accessing ffmpeg via subprocess and ffmpeg-normalize. This led to a lot overhead. If you want the details, I described it in the following paragraph. If not, feel free to skip it. My point is that I had a lot of both runtime and I/O overhead.

This had the drawback that subprocess has a lot of overhead with every call. subprocess filtering couldn't talk efficiently to ffmpeg-normalize which led to some hacky workarounds and otherwise unnecessary file I/O. I found that ffmpeg-normalize with libmp3lame triggers a bug if trying to normalizing .mp3 files which for every audio file I processed necessitated filtering a .mp3, converting the .mp3 to another format (I chose .wav as it's the best support by ffmpeg-normalize), reading the .wav, invoking ffmpeg-normalize on the .wav, converting the .wav back to .mp3, writing the .mp3 from ffmpeg-normalize, reading the .mp3, and finishing filtering.

I tried PyAV because with direct bindings I'd be able to cut out the superfluous subprocess calls and I/O operations.

I was surprised to see my PyAV implementation was significantly slower (~9 times slower without threading & ~2.5 slower with threading) than just calling subprocess repeatedly.

From profiling it appears the bulk of the time is spent pushing and pulling frames through the filter graph. Each individual frame object must be converted to its C object (small overhead), processed inside ffmpeg (pretty dang fast), and then converted back to python (small overhead). The problem is that small overhead occurs for every one of the thousands of frames for every file processed, and this overhead multiplies.

You can alleviate push overhead by pushing more often and letting the results build up in the underlying C buffer. Then when you pull, you pull a batch all at once. But I suspect letting frames build up in an underlying C buffer still incurs a lot of reallocating because there's way to indicate to the underlying function how much it should expect you to push. And this still workaround for pulling doesn't help at all with batch pushing.

Threading will divide up the overhead between threads, but aggregrate overhead (py_c_conversion_t * c_py_conversion_t * pushes_eq_frames * avg_#_of_batches_of_pulls * avg_frames_per_file * files) won't be decreased, just divided. And that's not enough to offset this sizeable performance hit. Maybe if you have a threadripper... a man can dream.

I think the only solution is to add a layer of C wrappers around ffmpeg filtering functions that essentially just hold the converted data (Py to C), pass that converted data to the underlying ffmpeg functions, poll for completion, accumulate the returned data, and return once all processed frames are available. Essentially this would enable efficient batch processing to drastically reduce overhead.

I read through the PyAV documentation that looked potentially relevant as well as the GitHub example file for audio processing (my use case). But I apologize if I'm missing something, and there's already a neat way to do this. Also this may just be outside the scope of this project, but I wanted to try to contribute a bit in my own very limited way by bringing it up :)

It also occured to me that some part of ffmpeg must handle passing frames to the underlying filter functions so maybe that would be a place to look? I couldn't get the project to build, and the source code was way over my head so I'm kind of shot on that.

Here's how I was processing frames link to the file:

def process_frames(frames, graph):
    processed_frames = []
    frame_iter = iter(frames)
    has_frames_to_push = True
    while True:
        # try to push next input frame, if available
        if has_frames_to_push:
            try:
                frame = next(frame_iter)
                graph.push(frame)
            except StopIteration:
                has_frames_to_push = False
                graph.push(None)  # signal end of input
            except (av.BlockingIOError, av.EOFError):
                # in case of implementation that's not like fsm and not done
                # but was actually just blocking
                # benign: just means the graph isn't ready yet
                pass

        # poll to pull available frames
        while True:
            try:
                f = graph.pull()
                if f is None and not has_frames_to_push:
                    # done.
                    return processed_frames
                elif f is not None:
                    processed_frames.append(f)
                break  # break if not done - to poll or push more frames
            except av.BlockingIOError:
                # graph is not ready for more output yet,
                # attempt to push more frames (good if like fsm),
                # polls if not able to push
                break
            except av.EOFError:
                # some implementations let you know they're done via this error :')
                return processed_frames

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions