Skip to content

Cosmos Transfer2.5 inference pipeline: general/{seg, depth, blur, edge}#13066

Open
miguelmartin75 wants to merge 36 commits intohuggingface:mainfrom
miguelmartin75:cosmos/transfer2.5
Open

Cosmos Transfer2.5 inference pipeline: general/{seg, depth, blur, edge}#13066
miguelmartin75 wants to merge 36 commits intohuggingface:mainfrom
miguelmartin75:cosmos/transfer2.5

Conversation

@miguelmartin75
Copy link
Contributor

@miguelmartin75 miguelmartin75 commented Feb 2, 2026

What does this PR do?

This PR introduces Cosmos Transfer2.5 inference pipeline, which extends the existing code in transformer_cosmos.py and introduces a new controlnet class for cosmos. The conversion script is updated to convert the checkpoints too.

I've intentionally split the controlnet from the base predict model to match the rest of the diffusers codebase. To do this, I have had to duplicate some layers/weights from the base model (relating to the patch & timestep embeddings), but I believe SD3 does this.

Similar to predict2.5, I have added documentation and unit tests.

Additional PRs will be submitted for the following features (in order of priority):

  1. Auto-regressive inference support, currently inference can only be applied to a fix number of frames. In cosmos-transfer2.5 AR inference is performed.
  2. Additional transfer2.5 variants:
    • multi-control (multiple controlnets at once)
    • auto/multiview
  3. Image reference

In addition, unfortunately, the guardrails safety model is too aggressive: it currently flags "not safe" for the examples we have on cosmos-transfer2.5 (e.g. edge example for 93 frames is flagged). This guardrail model needs to be updated, but this work is ~orthogonal of this PR.

Who can review?

Core library:

@miguelmartin75 miguelmartin75 changed the title Cosmos/transfer2.5 Cosmos Transfer2.5 inference pipeline: general/{seg, depth, blur, edge} Feb 2, 2026
Copy link
Collaborator

@yiyixuxu yiyixuxu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! The overall structure looks good. I left some minor comments.

One question before I can review further: Are the base transformer weights the same across the different control variants?

This helps us understand whether splitting the controlnet from the transformer makes sense (i.e., can users mix and match?), and also helps me understand whether the controlnet is required for this pipeline etc

@miguelmartin75
Copy link
Contributor Author

miguelmartin75 commented Feb 6, 2026

Addressed your comment about transfer2_5_forward + updated the example code

Are the base transformer weights the same across the different control variants? ... can users mix and match?

Yes, mix & matching controlnets is possible, but only if an image context reference is not included(see here, including an image reference is not currently supported in this PR). Additionally, including multiple controlnets "multicontrol" will be possible (any base transformer can be used; cosmos-transfer2.5 always picks "edge"), but I will need to submit a separate PR for this. Note, multicontrol does not support an image reference.

To be more specific, the base transformer weights are almost the same. The difference lies in the weights of the cross attention layers for an image reference (see here), i.e. attn2 in diffusers-land for these layers for all blocks in the base transformer. Without an image reference, all base transformers are functionally same, in this case the img_context tensor is torch.zeros; I also qualitatively verified all pairs of base transformer + controlnet as a sanity check and it looks like they output the same results.

I will need to document this when I have a PR up for image reference feature, (3) in my description

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants