-
Notifications
You must be signed in to change notification settings - Fork 6k
[Community] cancelable and asynchronous pipelines #374
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I agree this would be helpful. |
Cool idea @keturn indeed! I think we had similar ideas for Also cc @anton-l @patil-suraj |
cc @anton-l @patil-suraj again |
@keturn - do you have by any chance already a design in mind that we could use to enable asynchronous pipelines? |
Not completely. My primary familiarity with asynchronous APIs in Python is through Twisted, and I get the feeling that is not a common API among your target audience in this day and age. I imagine you want something framework-agnostic, but with an eye toward being convenient in Jupyter notebooks, so I'd look to see if there's established precedent for successful async apis in notebooks. My searches for async + PyTorch didn't find much, but there is at least one good example of an async pytorch pipeline. He looks like a very busy guy these days but I bet @lantiga could offer some valuable guidance here. I also heard that your peers at @gradio-app have just released a version that does live updates for iterative outputs, so they might have some ideas too. |
We might have an issue with returning intermediate results from |
IMO we can be a bit more lenient here with the safety checker since it's intermediate results (also cc @natolambert). However, I think such a pipeline would be best implemented as a community pipeline and not into the official pipeline :-) |
Cancelable and asynchronous are not linked (as far as I understand) to generator and yielding. Cancelable and asynchronous:
Generator and yielding:
Another benefit for using generators here, is when batching might be needed. https://huggingface.co/docs/transformers/v4.22.1/en/main_classes/pipelines#pipeline-batching Feel free to correct me if I misunderstood anything ! |
You are correct that we've muddied the discussion here with several different concepts with their own concerns. The original request was for cancellation support: the ability to interrupt work on a task that's in-progress, allowing the process to move on to other work. Use cases might be things like:
async comes up because that's an interface that represents "task in progress," and async interfaces often provide a A generator API is different, though it's somewhat mixed together in Python's history of using That is, if you decide you are no longer interested in the results of a generator, you "cancel" it by simply not asking it for any more results. (and you might want generators anyway, for gradio reasons.)
As I understand it, CUDA is asynchronous (as are most calls to graphics drivers), but as you say, PyTorch is not. IMHO nearly all libraries should be async, because APIs that fail to reflect the fact that they're waiting on some external event inevitably lead to programs that block far more than they need to. But that's a different rant. We're not going to rewrite PyTorch today. And PyTorch is at least kind enough to release the GIL, which allows us to run it in a thread efficiently. So what does all that mean for a Python API? Again, this is where my Python knowledge is at least six versions out of date, leaving me feeling very rusty for designing an API going in to 2023. It might be that this doesn't call for an API change after all, and it's better addressed by
|
You are entirely correct ! In my day-to-day I tend to see memory allocations being the issue more often than not, but you are entirely correct it's lazy work in general.
It's more that some calls within (Like printings, fetching values, or moving data from devices) ARE blocking because they need to wait on the GPU to finish to deliver the final data.
I tend to err on the other side that NO library should be async, and ordering work should be left to the kernel but I think we would agree that it's the color problem the biggest issue. (And realistically async/not async is here to stay no matter our opinions ).
Not a maintainer of this lib so take my opinion with a grain of salt but I would refrain to do that here. Parallism, async, threading multi processing are choices. And the best solution will probably be different for different use cases. Do you want to maximize CPU usage ? and hence use all cores, do you want to minimize latency in a webserver context (Hence cancellable and CPU parallelism used to do the computations themselves) ? Are you using multiple nodes to do some job on GPU ? Do you want 25% of your cores doing video processing and use only 75% for the model inference ? How does that play with your GPU usage. All these are very different contexts and pushing users into one direction is likely to mislead some into suboptimal courses. I think it's impossible to be exhaustive on the matter, and isn't even a library's responsability. The best a library can realistically do, is explain what it's doing in terms of parallelism so users can adjust. And always enable a way to DISABLE parallelism of any kind so that users can do the parallelization upstairs. One example which we had to implement: huggingface/transformers#5486 That being said, having somewhere in the docs (or just this issue) to refer to users asking for help would definitely help. |
Bump to unstale. I believe this is a very important feature; surely we're one step closer to solving this now that pipeline callbacks are here. On that topic, does interrupting the pipeline with a callback at each step incur any slowdown? |
@irgolic As for the current callback implementation, strictly and technically speaking, in terms of the number of sequential CPU instructions being executed, yes. At the very least, even with an empty callback function that does nothing, some Python bytecode will still be generated and executed. For example, as of Python 3.10, let's say we have the following code: def main():
pass It would be compiled to:
Meanwhile, if we have the following code instead: def func():
pass
def main():
func() It would be compiled to:
Notice that at the very least, there is an additional Of course, with empty callback functions, this is negligible with modern CPUs that can execute billions of instructions per second. Most of the time/work would instead be spent on the Stable Diffusion model itself. However, I would not recommend putting heavy computation/workload on your callback functions as it is ultimately still a synchronous operation. Regarding the extra logic involved in implementing this feature, I have put some of my comments here about the backward compatibility aspect that might be relevant to whoever is looking into implementing this feature. Extra care might be needed if this feature is to be implemented in the "native" pipelines. That said, I can see GUI-heavy applications greatly benefitting from asynchronous pipelines. |
How about something simple like #1053? Alternatively, I've found the fix for my use case: raising a custom exception through the |
On the topic of asynchronous pipelines, what I'd like to see is a way to yield ( IIRC, that will implicitly allow cancelling the pipeline in an async way. When |
All of our pipelines in Dream Textures use generators, and it works well for us with cancellation and step previews: https://github.com/carson-katri/dream-textures/blob/c3f3a3780c2229eb4cce7390f193b59b0569f64a/generator_process/actions/prompt_to_image.py#L509 It would be nice to have this option upstream, but I understand it may be difficult to maintain both the callback and generator methods. |
Could the callbacks help for this or is this not really sufficient? |
Also cc @williamberman |
This is a really interesting issue but also possibly really complicated! I think this very quickly gets into specific details of the python and cuda runtimes that we can't answer questions around or plan solutions for unless we get very specific problems we're trying to solve or benchmarks for inference we're trying to hit. I think it also pretty quickly gets into questions around standards for productionizing serving models that I'd want get some other more experienced people to help answer (I'd be surprised if making model execution async from a main thread was custom handled by every model library -- I'd assume there'd be production runtimes with standard interfaces with webserving frameworks that would handle this (does onnx handle something like this?)). It sounds to me like basic use cases such as accessing intermediate outputs or interrupting pipeline execution (rather naively with exceptions -- would assume there's better ways to do so in a production environment), callbacks are sufficient. I think an open ended discussion about how we can assist people productionizing diffusers (more broadly than just cancelable and async pipelines) might be a good fit for a forum discussion but I think the issue as is currently is a little too broad to know how to help :) |
Going to close discussion for now but if anyone feels question isn't adequately answered, feel free to re-open! |
* Update README.md * Update setup_venv.sh
In #developers-corner we have a request for cancellation support in the pipeline API.
That could also be something to consider in the context of a more asynchronous API in general; some applications want to avoid blocking the thread if all the slow parts are on a GPU device and the CPU is available to do other work.
I don't expect this is something to fit in the next release or two, but we can plant the seed to start thinking what sort of API could provide these features.
The text was updated successfully, but these errors were encountered: