[Community] cancelable and asynchronous pipelines #374

keturn · 2022-09-06T16:20:58Z

In #developers-corner we have a request for cancellation support in the pipeline API.

That could also be something to consider in the context of a more asynchronous API in general; some applications want to avoid blocking the thread if all the slow parts are on a GPU device and the CPU is available to do other work.

I don't expect this is something to fit in the next release or two, but we can plant the seed to start thinking what sort of API could provide these features.

parlance-zz · 2022-09-07T00:11:30Z

I agree this would be helpful.

patrickvonplaten · 2022-09-07T09:50:06Z

Cool idea @keturn indeed! I think we had similar ideas for transformers pipelines. @Narsil do you have some ideas here maybe? :-)

Also cc @anton-l @patil-suraj

patrickvonplaten · 2022-09-13T16:04:52Z

cc @anton-l @patil-suraj again

patrickvonplaten · 2022-09-13T16:08:34Z

@keturn - do you have by any chance already a design in mind that we could use to enable asynchronous pipelines?

keturn · 2022-09-13T19:54:52Z

Not completely.

My primary familiarity with asynchronous APIs in Python is through Twisted, and I get the feeling that is not a common API among your target audience in this day and age.

I imagine you want something framework-agnostic, but with an eye toward being convenient in Jupyter notebooks, so I'd look to see if there's established precedent for successful async apis in notebooks.

My searches for async + PyTorch didn't find much, but there is at least one good example of an async pytorch pipeline. He looks like a very busy guy these days but I bet @lantiga could offer some valuable guidance here.

I also heard that your peers at @gradio-app have just released a version that does live updates for iterative outputs, so they might have some ideas too.

anton-l · 2022-09-14T13:15:04Z

We might have an issue with returning intermediate results from StableDiffusionPipeline due to the need to run a safety checker, unfortunately.
But I'm open to try at least a generator design (yield-based) for other pipelines! Not sure we'd be able to go fully async though, but maybe @Narsil has a better intuition for torch-based pipelines.

patrickvonplaten · 2022-09-17T20:51:10Z

IMO we can be a bit more lenient here with the safety checker since it's intermediate results (also cc @natolambert).

However, I think such a pipeline would be best implemented as a community pipeline and not into the official pipeline :-)

Narsil · 2022-09-27T12:27:09Z

Cancelable and asynchronous are not linked (as far as I understand) to generator and yielding.

Cancelable and asynchronous:

I think this is meant to me used in async context like webservers. Pytorch itself is not async, so I don't think this library (or any that depend on pytorch) should ever try to create artifical asynchronous behavior (it's asking for trouble IMO). If you want to handle this, then you need a proper thread to do the Pytorch computation. It's especially important if the model is large and you don't want to duplicate. Be it for CPU or GPU pytorch attempts (to the best of my knowledge ) to use all available resources, so have two threads or processes for a single model will likely lead to weird things happening.
So diffusers or transformers for that matter shouldn't be async aware and it's up to the user to figure out how to integrate best into an async environment.

Generator and yielding:

Here I think the idea is slighlty different, generators are used to prevent memory from being allocated too early. I would make the comparison with DataLoader within the Pytorch ecosystem. This definitely belong in this library IMO, mostly to make sure to remove any CPU computation when the model is on GPU so that the GPU can stay as busy as possible. Last I checked the current library and code (at least for stable diffusion) was already doing a pretty good job at it (mostly because the model is really GPU intensive but still it works ! )

Another benefit for using generators here, is when batching might be needed. https://huggingface.co/docs/transformers/v4.22.1/en/main_classes/pipelines#pipeline-batching

Feel free to correct me if I misunderstood anything !

keturn · 2022-09-27T19:15:18Z

You are correct that we've muddied the discussion here with several different concepts with their own concerns.

The original request was for cancellation support: the ability to interrupt work on a task that's in-progress, allowing the process to move on to other work. Use cases might be things like:

the client disconnected, so the result is no longer deliverable.
the intermediate outputs look discouraging, so we want to cut our losses on this run.

async comes up because that's an interface that represents "task in progress," and async interfaces often provide a cancel methods.

A generator API is different, though it's somewhat mixed together in Python's history of using yield for both its generators and coroutines. A generator would give us some of what we're looking for from cancellation, because a generator doesn't so much defer the allocation of memory — it's more a deferral of work.

That is, if you decide you are no longer interested in the results of a generator, you "cancel" it by simply not asking it for any more results. (and you might want generators anyway, for gradio reasons.)

Pytorch itself is not async, so […] diffusers or transformers for that matter shouldn't be async aware and it's up to the user to figure out how to integrate best into an async environment.

As I understand it, CUDA is asynchronous (as are most calls to graphics drivers), but as you say, PyTorch is not.

IMHO nearly all libraries should be async, because APIs that fail to reflect the fact that they're waiting on some external event inevitably lead to programs that block far more than they need to. But that's a different rant. We're not going to rewrite PyTorch today. And PyTorch is at least kind enough to release the GIL, which allows us to run it in a thread efficiently.

So what does all that mean for a Python API? Again, this is where my Python knowledge is at least six versions out of date, leaving me feeling very rusty for designing an API going in to 2023.

It might be that this doesn't call for an API change after all, and it's better addressed by

a short example that shows how to dispatch a call to a pipeline to a thread and optionally cancel it before it completes,
a test or two to ensure that cancelling a coroutine in that matter does actually prevent the pipeline from running its remaining steps, and
some documentation on which things are not thread-safe or otherwise unsafe to use from parallel coroutines.

Narsil · 2022-09-29T09:04:09Z

doesn't so much defer the allocation of memory — it's more a deferral of work.

You are entirely correct ! In my day-to-day I tend to see memory allocations being the issue more often than not, but you are entirely correct it's lazy work in general.

PyTorch is not.

It's more that some calls within (Like printings, fetching values, or moving data from devices) ARE blocking because they need to wait on the GPU to finish to deliver the final data.

IMHO nearly all libraries should be async

I tend to err on the other side that NO library should be async, and ordering work should be left to the kernel but I think we would agree that it's the color problem the biggest issue. (And realistically async/not async is here to stay no matter our opinions ).

a short example that shows how to [dispatch a call to a pipeline to a thread](https://docs.python.org/3/library/asyncio-
task.html#running-in-threads) and optionally cancel it before it completes,
a test or two to ensure that cancelling a coroutine in that matter does actually prevent the pipeline from running its remaining steps, and
some documentation on which things are not thread-safe or otherwise unsafe to use from parallel coroutines.

Not a maintainer of this lib so take my opinion with a grain of salt but I would refrain to do that here.

Parallism, async, threading multi processing are choices.

And the best solution will probably be different for different use cases. Do you want to maximize CPU usage ? and hence use all cores, do you want to minimize latency in a webserver context (Hence cancellable and CPU parallelism used to do the computations themselves) ? Are you using multiple nodes to do some job on GPU ? Do you want 25% of your cores doing video processing and use only 75% for the model inference ? How does that play with your GPU usage.

All these are very different contexts and pushing users into one direction is likely to mislead some into suboptimal courses.

I think it's impossible to be exhaustive on the matter, and isn't even a library's responsability. The best a library can realistically do, is explain what it's doing in terms of parallelism so users can adjust. And always enable a way to DISABLE parallelism of any kind so that users can do the parallelization upstairs. One example which we had to implement: huggingface/transformers#5486

That being said, having somewhere in the docs (or just this issue) to refer to users asking for help would definitely help.

patrickvonplaten · 2022-09-29T18:59:18Z

Also cc @pcuenca here since it's similar to the PR that's about to be merged: #521

irgolic · 2022-10-26T16:08:45Z

Bump to unstale.

I believe this is a very important feature; surely we're one step closer to solving this now that pipeline callbacks are here. On that topic, does interrupting the pipeline with a callback at each step incur any slowdown?

patrickvonplaten · 2022-10-26T16:55:06Z

Hey @irgolic and @keturn,

I'm not sure if we want to support that much logic in the "native" pipelines - could we maybe try to make this feature a community pipeline, see: #841 and if it pans out nicely we could in a next step merge it to the native pipelines?

jamestiotio · 2022-10-28T13:06:48Z

On that topic, does interrupting the pipeline with a callback at each step incur any slowdown?

@irgolic As for the current callback implementation, strictly and technically speaking, in terms of the number of sequential CPU instructions being executed, yes. At the very least, even with an empty callback function that does nothing, some Python bytecode will still be generated and executed.

For example, as of Python 3.10, let's say we have the following code:

def main():
    pass

It would be compiled to:

  1           0 LOAD_CONST               0 (<code object main at 0x5569c2b59450, file "example.py", line 1>)
              2 LOAD_CONST               1 ('main')
              4 MAKE_FUNCTION            0
              6 STORE_NAME               0 (main)
              8 LOAD_CONST               2 (None)
             10 RETURN_VALUE

Disassembly of <code object main at 0x5569c2b59450, file "example.py", line 1>:
  2           0 LOAD_CONST               0 (None)
              2 RETURN_VALUE

Meanwhile, if we have the following code instead:

def func():
    pass

def main():
    func()

It would be compiled to:

  1           0 LOAD_CONST               0 (<code object func at 0x55b5cd0e4520, file "example.py", line 1>)
              2 LOAD_CONST               1 ('func')
              4 MAKE_FUNCTION            0
              6 STORE_NAME               0 (func)

  4           8 LOAD_CONST               2 (<code object main at 0x55b5cd10bf50, file "example.py", line 4>)
             10 LOAD_CONST               3 ('main')
             12 MAKE_FUNCTION            0
             14 STORE_NAME               1 (main)
             16 LOAD_CONST               4 (None)
             18 RETURN_VALUE

Disassembly of <code object func at 0x55b5cd0e4520, file "example.py", line 1>:
  2           0 LOAD_CONST               0 (None)
              2 RETURN_VALUE

Disassembly of <code object main at 0x55b5cd10bf50, file "example.py", line 4>:
  5           0 LOAD_GLOBAL              0 (func)
              2 CALL_FUNCTION            0
              4 POP_TOP
              6 LOAD_CONST               0 (None)
              8 RETURN_VALUE

Notice that at the very least, there is an additional LOAD_GLOBAL, CALL_FUNCTION, and POP_TOP instruction being executed. The Python interpreter would also still need to construct a frame object for the func function, even though it technically does nothing. Furthermore, this also doesn't fully take into account some of the additional JMP instructions somewhere in the resulting assembly code that would be the result of the if conditional statements that are doing the checks for whether there are any defined callbacks (as it is in the implementation of callbacks in #521).

Of course, with empty callback functions, this is negligible with modern CPUs that can execute billions of instructions per second. Most of the time/work would instead be spent on the Stable Diffusion model itself. However, I would not recommend putting heavy computation/workload on your callback functions as it is ultimately still a synchronous operation.

Regarding the extra logic involved in implementing this feature, I have put some of my comments here about the backward compatibility aspect that might be relevant to whoever is looking into implementing this feature. Extra care might be needed if this feature is to be implemented in the "native" pipelines. That said, I can see GUI-heavy applications greatly benefitting from asynchronous pipelines.

irgolic · 2022-10-29T01:14:08Z

How about something simple like #1053? Alternatively, I've found the fix for my use case: raising a custom exception through the callback and catching it outside the pipe() call.

irgolic · 2022-10-29T15:56:37Z

On the topic of asynchronous pipelines, what I'd like to see is a way to yield (asyncio.sleep(0)) at callback time.

IIRC, that will implicitly allow cancelling the pipeline in an async way. When pipeline_task yields to another coroutine, and the other coroutine calls pipeline_task.cancel(), an asyncio.CancelledError will be raised the next time control returns to pipeline_task.

carson-katri · 2023-02-11T22:10:05Z

All of our pipelines in Dream Textures use generators, and it works well for us with cancellation and step previews: https://github.com/carson-katri/dream-textures/blob/c3f3a3780c2229eb4cce7390f193b59b0569f64a/generator_process/actions/prompt_to_image.py#L509

It would be nice to have this option upstream, but I understand it may be difficult to maintain both the callback and generator methods.

patrickvonplaten · 2023-02-13T11:33:38Z

Could the callbacks help for this or is this not really sufficient?
https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/text2img#diffusers.StableDiffusionPipeline.__call__.callback_steps

patrickvonplaten · 2023-02-13T11:33:51Z

Also cc @williamberman

williamberman · 2023-02-16T00:38:08Z

This is a really interesting issue but also possibly really complicated!

I think this very quickly gets into specific details of the python and cuda runtimes that we can't answer questions around or plan solutions for unless we get very specific problems we're trying to solve or benchmarks for inference we're trying to hit. I think it also pretty quickly gets into questions around standards for productionizing serving models that I'd want get some other more experienced people to help answer (I'd be surprised if making model execution async from a main thread was custom handled by every model library -- I'd assume there'd be production runtimes with standard interfaces with webserving frameworks that would handle this (does onnx handle something like this?)).

It sounds to me like basic use cases such as accessing intermediate outputs or interrupting pipeline execution (rather naively with exceptions -- would assume there's better ways to do so in a production environment), callbacks are sufficient.

I think an open ended discussion about how we can assist people productionizing diffusers (more broadly than just cancelable and async pipelines) might be a good fit for a forum discussion but I think the issue as is currently is a little too broad to know how to help :)

williamberman · 2023-02-23T23:05:35Z

Going to close discussion for now but if anyone feels question isn't adequately answered, feel free to re-open!

* Update README.md * Update setup_venv.sh

keturn mentioned this issue Sep 15, 2022

Add callback parameters for Stable Diffusion pipelines #521

Merged

irgolic mentioned this issue Oct 23, 2022

Cancel task functionality irgolic/stable-diffusion-api#18

Closed

github-actions bot added the stale Issues that haven't received updates label Oct 24, 2022

patil-suraj removed the stale Issues that haven't received updates label Oct 26, 2022

huggingface deleted a comment from github-actions bot Oct 26, 2022

patrickvonplaten changed the title ~~cancelable and asynchronous pipelines~~ [Community] cancelable and asynchronous pipelines Oct 26, 2022

patrickvonplaten added hacktoberfest good first issue Good for newcomers labels Oct 26, 2022

irgolic mentioned this issue Oct 29, 2022

lpw_stable_diffusion: Add is_cancelled_callback #1053

Merged

williamberman closed this as completed Feb 23, 2023

PhaneeshB pushed a commit to nod-ai/diffusers that referenced this issue Mar 1, 2023

Update torch-mlir releases page in setup_venv.sh (huggingface#374)

56f8a0d

* Update README.md * Update setup_venv.sh

patrickvonplaten added the Good Example PR label Jul 25, 2023

patrickvonplaten mentioned this issue Jul 26, 2023

Ability to interrupt generate process #4165

Closed

[Community] cancelable and asynchronous pipelines #374

[Community] cancelable and asynchronous pipelines #374

Comments

keturn commented Sep 6, 2022

parlance-zz commented Sep 7, 2022

Uh oh!

patrickvonplaten commented Sep 7, 2022

Uh oh!

patrickvonplaten commented Sep 13, 2022

Uh oh!

patrickvonplaten commented Sep 13, 2022

Uh oh!

keturn commented Sep 13, 2022

Uh oh!

anton-l commented Sep 14, 2022

Uh oh!

patrickvonplaten commented Sep 17, 2022

Uh oh!

Narsil commented Sep 27, 2022

Uh oh!

keturn commented Sep 27, 2022

Uh oh!

Narsil commented Sep 29, 2022

Uh oh!

patrickvonplaten commented Sep 29, 2022

Uh oh!

irgolic commented Oct 26, 2022

Uh oh!

patrickvonplaten commented Oct 26, 2022

Uh oh!

jamestiotio commented Oct 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

irgolic commented Oct 29, 2022

Uh oh!

irgolic commented Oct 29, 2022

Uh oh!

carson-katri commented Feb 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

patrickvonplaten commented Feb 13, 2023

Uh oh!

patrickvonplaten commented Feb 13, 2023

Uh oh!

williamberman commented Feb 16, 2023

Uh oh!

williamberman commented Feb 23, 2023

Uh oh!

jamestiotio commented Oct 28, 2022 •

edited

Loading

carson-katri commented Feb 11, 2023 •

edited

Loading