Skip to content

Conversation

@yaswanth19
Copy link
Contributor

@yaswanth19 yaswanth19 commented Feb 5, 2025

What does this PR do?

Fixes #35928

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@yaswanth19 yaswanth19 marked this pull request as draft February 5, 2025 16:40
@yaswanth19 yaswanth19 changed the title Add janus model 🚧 [WiP] Add janus model Feb 5, 2025
@yaswanth19
Copy link
Contributor Author

yaswanth19 commented Feb 5, 2025

@zucchini-nlp I’ve started working on Janus and would love to get some guidance. Right now, I’ve just created a skeleton and implemented the ImageProcessing class.

My first major hurdle is the CONFIG. The Janus config on the Hub is quite composite and non-standard. Standard values like hidden_size, num_attention_heads, etc., seem to be hardcoded in their implementation.

From a testing perspective, how should I approach writing the config class? Loading this config directly using AutoConfig.from_pretrained() doesn’t work.

Shall I write an ad hoc script to convert this config into a standard Hugging Face config (similar to convert_weights_to_hf.py, but for config)?

Config: https://huggingface.co/deepseek-ai/Janus-Pro-1B/blob/main/config.json

@yaswanth19 yaswanth19 changed the title 🚧 [WiP] Add janus model 🚧 [WiP] Add Janus model Feb 5, 2025
@zucchini-nlp
Copy link
Member

@yaswanth19 super cool to see a draft PR!

Yeah, that reminds me of Molmo is modeled, also with hardcoded values for configuration. I am not sure how you usually approach testing. If you will test by matching with the actual weights, then yes, converting a config will be helpful

As a first step, I'd suggest to make a working model code and then convert weights/config. The vision backbone should be very similar to existing CLIP models, and for VQ part feel free to look at Emu3 model here. When the model is converted, we can try to match logits and see in which modules the logits start to diverge. LMK if you are stuck in any place and need help :)

Btw, I would be very interested to see if Janus can handle interleaved generation of image + text in one go 👀 If that's possible, would be super super nice

@geetu040
Copy link
Contributor

geetu040 commented Feb 6, 2025

Hi @yaswanth19, this is really nice. I am also really interested in the model, do you think I can collaborate with you on this one? I can help with the implementation.

@yaswanth19
Copy link
Contributor Author

@geetu040 Thanks for your interest and for offering to collaborate! This is quite an ambitious PR for myself, and I’d like to take on the challenge of tackling it myself. That said, I’ll definitely reach out if I get stuck or unable to continue working on it :)

@geetu040
Copy link
Contributor

geetu040 commented Feb 6, 2025

@geetu040 Thanks for your interest and for offering to collaborate! This is quite an ambitious PR for myself, and I’d like to take on the challenge of tackling it myself. That said, I’ll definitely reach out if I get stuck or unable to continue working on it :)

Sure, understood, I wish you a very Good Luck.

@hsilva664
Copy link
Contributor

Hello @yaswanth19. I understand that you want to do it by yourself. Yet, I'd like to insist on collaboration a bit further. I've spent a bit of time this week on reproducing their virtual environment and running their generative and understanding modes of Janus for 1B and 7B models on rented GPUs, as well as reading the original Janus paper, the SigLip paper (i.e. the "understanding" image encoder) and the LlamaGen paper (from which they take the generative image tokenizer). I was somewhat busy in this specific week, so did not have a lot of time for this. Still, I was about to start reading their code before trying to collaborate on integrating. As I've spent some time on this already, are you OK with trying to work together?

@yaswanth19
Copy link
Contributor Author

@hsilva664 Cool! since you have already made some progress IG we are on the same page. Let's start with implementing the Modular version. Janus has three components SigLip, VQ model, Llama - If I not wrong SigLip and Llama are staright forward with so I can pick those two and will create the structure and processor w.r.t. Now the middle part which is a VQ model ; you can start working on that. The reason I am insisting this way is that VQ model is a bit ambigous as I am not much familiar with that (Best case scenario we can inherit all the components for this model from existing ones :) ). Hence trying to code up the familiar part first and then iterate to match logits. WDYT?

@hsilva664
Copy link
Contributor

That works. Also, on the Janus paper (the original one, not Janus pro), they state that they use DeepSeek-LLM (1.3B), not Llama, I think on Janus Pro they keep it that way, but change it to 7B. It is the VQ paper that uses Llama + the autoregressive decoder.

@hsilva664
Copy link
Contributor

@zucchini-nlp I've added a preliminary version of the VQ part to a clone of @yaswanth19 branch and am trying to set up a simple test code that will be used for checking weights/activations consistency with the original Janus model in the future (still have to import their weight values). From what I can gather, it seems that in HF the common practice is to have a JINJA template to convert between a list of dicts such as:

conversation = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "What’s shown in this image?"},
            ],
    },
    {
        "role": "assistant",
        "content": [{"type": "text", "text": "This image shows a red stop sign."},]
    },
    {

        "role": "user",
        "content": [
            {"type": "text", "text": "Describe the image in more details."},
        ],
    },
]

To the processed string that will then be tokenized. This template seems to usually be stored on the hub, in JSON files like this one. On their codebase, they have this conversion done by a Python module instead (see here, they use the template called "deepseek"). Just to confirm, should I change that format to a JINJA template, or would it be something "normal" to keep it like what they are doing (they call a function manually to convert the list inside the processor)? Was something similar to this done somewhere, so I can base myself on it (it is OK if not)?

@yaswanth19
Copy link
Contributor Author

yaswanth19 commented Feb 13, 2025

@hsilva664 We have to create the jinja template. I have created one just for model testing purposes and it needs some corrections.

{
    "chat_template": "{% for message in messages %}{{ message['role'].capitalize() + ': '}}{% for content in message['content'] %}{% if content['type'] == 'text' %}{{ content['text'] }}{% endif %}{% endfor %}{% if not loop.last %}\n\n{% endif %}{% endfor %}"
  }

My testing Conversation = [{"role":"User",
"content":[{'type':"text","text":"<image_placeholder>\nConvert the formula into latex code.\n"}]},
{"role": "Assistant", "content": " "},
]

First of all this template doesn't account for new line chars before and after applying to conv. In the sense- to match the input_ids with original implementation we have to hackily use this way system_prompt + '\n\n' + apply_chat_template(conv)+'\n'.

@hsilva664
Copy link
Contributor

About the random initialization of weights (e.g. EmuVQVAE.__init_weights__ or Emu3PreTrainedModel.__init_weights__ in this file). Are these meant to be used when the user wants to train from scratch (otherwise these would just be overwritten by the pretrained weights and serve no purpose...)? If so, unless I missed something, the Janus implementation does nothing like that and just leaves the default torch initialization. Should I do the same?

@yaswanth19
Copy link
Contributor Author

yaswanth19 commented Feb 16, 2025

In the original implementation for Image generation they are generating a token and then applying Classifier Free Guidance after which processing it through some layers. But for text generation task it's using the HF implementation of generate function. AFAIK it won't be possible to have a single generate function for both these functionalities (Image+Text -> Text and Text->Image).

Also I don't think it can perform Interleaved generation as mentioned by one of the authors [LINK]. For image generation they are explicitly adding <begin_of_image> tag to let the model know to generate image related logits. So I not sure how to add this image generation logic (In generate and prompt creation) without hampering the text generation functionality. One hack I can think of; is to use a flag where user explicitly mentions to generate image and we can process accordingly.

@zucchini-nlp Would like your thoughts on this issue and attached the image generation inference file below.

https://github.com/deepseek-ai/Janus/blob/1daa72fa409002d40931bd7b36a9280362469ead/generation_inference.py#L94

@zucchini-nlp
Copy link
Member

zucchini-nlp commented Feb 17, 2025

@hsilva664

Are these meant to be used when the user wants to train from scratch (otherwise these would just be overwritten by the pretrained weights and serve no purpose...)?

Yep, only for training from scratch and for some internal tests when a dummy model is init. If Janus uses a different initialization, you can add it in _init_weights

@yaswanth19 really interesting point, didn't know Janus would add more layers after CFG 🙃 We have CFG in transformers, so it won't be a big problem. For the rest of layers we can override generate() and smth like below, WDYT?

def generate(self, input_ids, **kwargs):
    # if generating text, just call super() 
    if kwargs.get("generation_mode", None) == "text":
        return super().generate(*input_ids, *kwargs)
    
    # Otherwise do custom logic, and rely only on greedy/sampling methods
    # Beam search, etc not supported

    # So some preprocessing, same as in `generation/utils.py`. Feel free to look at how MusicGen does it
    # Would have been nice to have self.prepare_for_generate(), planned and was low priority
    # I think since we have Janus planned, we can raise the priority a bit
    generation_config = ....

    for i in range(generation_config.max_new_tokens):
        next_token_outputs = super()._sample(input_ids, max_new_tokens=1 **kwargs)
        input_ids = self.layers_after_CFG(next_token_outputs)
    return next_token_outputs

cc @gante we're having first image generation model that doesn't work with generate() as is. I believe refactoring generation to tiny functions will help. We need to allow users to apply any pre/post-processing before and after the next token is sampled. So after separating out prefill and decoding, we can add an umbrella self.get_next_token maybe, though it will be hard to make it work with beam search and assisted decoding 😢

@yaswanth19
Copy link
Contributor Author

@zucchini-nlp WDYT on image quality. Just wanted to confirm as you were a bit hesitant for ANOLE Model. These images are generated using original implementation itself (in worst case we can match it). The upper row images are generated using Janus-Pro 7B whereas the bottom row using 1B variant.

Prompt 1: Generate an image of a snowman.
Prompt 2: A close-up high-contrast photo of Sydney Opera House sitting next to Eiffel tower, under a blue night sky of roiling energy, exploding yellow stars, and radiating swirls of blue.

image

@yaswanth19
Copy link
Contributor Author

@zucchini-nlp The PR is good for merge. The failing tests are unrelated to this PR 🚀 🤗

Here is the deepseek-community org and both the 1B and 7B are uploaded there.

LINK: https://huggingface.co/deepseek-community

@zucchini-nlp
Copy link
Member

Super super cool, thanks for adding an org! Let's resolve merge conflicts and I'll merge

@yaswanth19
Copy link
Contributor Author

@zucchini-nlp Resolved the Merge Conflict and lets merge it before another one appears 😅

@zucchini-nlp
Copy link
Member

Let's go 🚀

@zucchini-nlp zucchini-nlp merged commit a2ef3cf into huggingface:main Apr 17, 2025
18 checks passed
cyr0930 pushed a commit to cyr0930/transformers that referenced this pull request Apr 18, 2025
* Iterative generation using input embeds

* Add Janus model

* discard changes

* Janus imports

* Refactor config and processor

* Added Vision tower of Janus

* Import Janus Image processor

* Vision tower fixes

* Refactor code

* Added VQ Model

* Complete model integration

* temp conversion script

* processor refactor

* Adding files to facilitate pulling

* Fixes after debugging

* Skip test for these models

* Add Janus Model

* discard changes

* Janus imports

* Refactor config and processor

* Added Vision tower of Janus

* Import Janus Image processor

* Vision tower fixes

* Refactor code

* Added VQ Model

* Complete model integration

* temp conversion script

* processor refactor

* Adding files to facilitate pulling

* Fixes after debugging

* Refactor to Text config

* ✨ Added generate function

* Saving intermediate convert file. Still need to read configs from the hub and convert them to our format.

* Adding version that reads from the JSON files. Still have to tweak some parameters manually.

* relative imports

* Initial tests

* Refactor image processor

* Seemingly working version of the conversion script, will need to test further.

* Adding command message

* Fixing conflicting JanusTextConfig class

* Incorporating some of the discussed changes.

* Small fix to create dir.

* Removing system from JINJA template

* Adding draft processor tests

* style fixes

* Minor fixes and enhancement

* added generation config

* Initial tests

* Small modifications, tests are now passing.

* Small changes I noticed while reading code.

* more fixes

* Added JanusModel class

* Small merge adaptations

* Small merge adaptations

* Image processing tests passing

* More tests and fixes

* Convert script updated and refactored

* Tests and cleanup

* make style

* Postprocessing for image generation

* generate refactor

* fixes

* - Passing tests that write a part of the model to cpu (e.g. test_cpu_offload)
- Passing tests of dispatching SDPA
- Only gradient checkpointing tests are left.

* Removing temporary code

* Changes

* Writing change to modular

* Added JanusVisionModel. SDPA dispatch tests pass more robustly. Gradient checkpoint tests are next

* Gradient checkpoint tests passing

* Removing debug code

* Major generate refactor 😮‍💨

* Temp changes for testing

* Green quality CI

* 2 out of 4 integration tests passing

* breadcrumbs

* Usage Examples

* Regenerate modeling after merge

* dirty code

* JanusIntegrationTest are passing

* breadcrumbs

* happy CI

* fixes

* Changing template

* nits

* Text generation logits matching original codebase at 100% precision

* Remove ./tmp from git tracking

* Remove ./tmp from git tracking

* Checkpointing changes after reviewing

* Fixing code in docstrings

* CHanging comments and small bug in convert file

* Fixing bug in image_token_id for 7B version

* Removing line that was added by both of us

* Pushing changes after discussion. Only one left is to change the key mapping for convert file.

* Updating module file

* New convert file using dict. Tested that it is equivalent to the old one by:
- comparing keys in a script
- comparing checksums of the output files between version generated with the current convert script and those generated with the old script. This is a more reliable test.

* revert changes

* mistake

* consistency change for CI

* make style

* doc fixes

* more fixes

* experimenting with masking out pad token

* checkpoint

* Batched generation with multi-images working for 1B models. Will test 7B next.

* Device fix.

* Writing changes to modular, previous ones were written to modeling just for quick testing.

* Using passed processor attention mask (only in modeling for now)

* Matching performance done in the non-standard way

* Working version of batched generation. Will change how some args are passed to make it more similar to language case

* More compliant version of the code

* Removed duplicated `_prepare_4d_causal_attention_mask_with_cache_position`

* Updating modular file, making masked filling with paddings more efficient

* Slightly more efficient version

* Modifying JanusVisionModel to be a wrapper

* Fixing test to comply with new names

* Modular overhaul

* More refactoring

* - Changing JanusVisionModel back
- Changing forward pass
- Adding boi token to the comparison

* - Removing whole context model_ids
- Using inherited implementation of prepare_inputs_for_generation

* Moving the way boi token is passed to the model

* Fixing sdpa test

* Minor changes

* testing changes

* Minor fix

* - Adding postprocessing test
- checking values of generated image on integration test

* changes

* Removing pooled attention vision module, fixing convert script as a consequence

* More changes

* Fixes

* Draft after merge

* Bug fixes

* More bug fix

* Fixing docs

* Nits

* Refactor return dict

* Moving image post processing test to main processor post process

* Passing guidance_scale as kwarg

* make style

* 🔥 refactor

* make style

* Update and green CI

* Nits and tests update

* up

* Added MID block

* fix

* Dead code

* update testcase

* update

* model_id change

* init_weight changes

---------

Co-authored-by: hsilva664 <[email protected]>
zucchini-nlp pushed a commit to zucchini-nlp/transformers that referenced this pull request May 14, 2025
* Iterative generation using input embeds

* Add Janus model

* discard changes

* Janus imports

* Refactor config and processor

* Added Vision tower of Janus

* Import Janus Image processor

* Vision tower fixes

* Refactor code

* Added VQ Model

* Complete model integration

* temp conversion script

* processor refactor

* Adding files to facilitate pulling

* Fixes after debugging

* Skip test for these models

* Add Janus Model

* discard changes

* Janus imports

* Refactor config and processor

* Added Vision tower of Janus

* Import Janus Image processor

* Vision tower fixes

* Refactor code

* Added VQ Model

* Complete model integration

* temp conversion script

* processor refactor

* Adding files to facilitate pulling

* Fixes after debugging

* Refactor to Text config

* ✨ Added generate function

* Saving intermediate convert file. Still need to read configs from the hub and convert them to our format.

* Adding version that reads from the JSON files. Still have to tweak some parameters manually.

* relative imports

* Initial tests

* Refactor image processor

* Seemingly working version of the conversion script, will need to test further.

* Adding command message

* Fixing conflicting JanusTextConfig class

* Incorporating some of the discussed changes.

* Small fix to create dir.

* Removing system from JINJA template

* Adding draft processor tests

* style fixes

* Minor fixes and enhancement

* added generation config

* Initial tests

* Small modifications, tests are now passing.

* Small changes I noticed while reading code.

* more fixes

* Added JanusModel class

* Small merge adaptations

* Small merge adaptations

* Image processing tests passing

* More tests and fixes

* Convert script updated and refactored

* Tests and cleanup

* make style

* Postprocessing for image generation

* generate refactor

* fixes

* - Passing tests that write a part of the model to cpu (e.g. test_cpu_offload)
- Passing tests of dispatching SDPA
- Only gradient checkpointing tests are left.

* Removing temporary code

* Changes

* Writing change to modular

* Added JanusVisionModel. SDPA dispatch tests pass more robustly. Gradient checkpoint tests are next

* Gradient checkpoint tests passing

* Removing debug code

* Major generate refactor 😮‍💨

* Temp changes for testing

* Green quality CI

* 2 out of 4 integration tests passing

* breadcrumbs

* Usage Examples

* Regenerate modeling after merge

* dirty code

* JanusIntegrationTest are passing

* breadcrumbs

* happy CI

* fixes

* Changing template

* nits

* Text generation logits matching original codebase at 100% precision

* Remove ./tmp from git tracking

* Remove ./tmp from git tracking

* Checkpointing changes after reviewing

* Fixing code in docstrings

* CHanging comments and small bug in convert file

* Fixing bug in image_token_id for 7B version

* Removing line that was added by both of us

* Pushing changes after discussion. Only one left is to change the key mapping for convert file.

* Updating module file

* New convert file using dict. Tested that it is equivalent to the old one by:
- comparing keys in a script
- comparing checksums of the output files between version generated with the current convert script and those generated with the old script. This is a more reliable test.

* revert changes

* mistake

* consistency change for CI

* make style

* doc fixes

* more fixes

* experimenting with masking out pad token

* checkpoint

* Batched generation with multi-images working for 1B models. Will test 7B next.

* Device fix.

* Writing changes to modular, previous ones were written to modeling just for quick testing.

* Using passed processor attention mask (only in modeling for now)

* Matching performance done in the non-standard way

* Working version of batched generation. Will change how some args are passed to make it more similar to language case

* More compliant version of the code

* Removed duplicated `_prepare_4d_causal_attention_mask_with_cache_position`

* Updating modular file, making masked filling with paddings more efficient

* Slightly more efficient version

* Modifying JanusVisionModel to be a wrapper

* Fixing test to comply with new names

* Modular overhaul

* More refactoring

* - Changing JanusVisionModel back
- Changing forward pass
- Adding boi token to the comparison

* - Removing whole context model_ids
- Using inherited implementation of prepare_inputs_for_generation

* Moving the way boi token is passed to the model

* Fixing sdpa test

* Minor changes

* testing changes

* Minor fix

* - Adding postprocessing test
- checking values of generated image on integration test

* changes

* Removing pooled attention vision module, fixing convert script as a consequence

* More changes

* Fixes

* Draft after merge

* Bug fixes

* More bug fix

* Fixing docs

* Nits

* Refactor return dict

* Moving image post processing test to main processor post process

* Passing guidance_scale as kwarg

* make style

* 🔥 refactor

* make style

* Update and green CI

* Nits and tests update

* up

* Added MID block

* fix

* Dead code

* update testcase

* update

* model_id change

* init_weight changes

---------

Co-authored-by: hsilva664 <[email protected]>
@Dswn888
Copy link

Dswn888 commented Jun 5, 2025

Hi @yaswanth19 @zucchini-nlp @hsilva664 , thanks for your excellent work! While using Janus for text to image generation, I met the following issue:
"JanusForConditionalGeneration has no _prepare_4d_causal_attention_mask_with_cache_position method defined in its base modeling class. Compiled forward passes will be sub-optimal. If you're writing code, see Llama for an example implementation. If you're a user, please report this issue on GitHub."

Any insights or guidance on this would be greatly appreciated.

@Dswn888
Copy link

Dswn888 commented Jun 5, 2025

Hi @yaswanth19 @zucchini-nlp @hsilva664 , thanks for your excellent work! While using Janus for text to image generation, I met the following issue: "JanusForConditionalGeneration has no _prepare_4d_causal_attention_mask_with_cache_position method defined in its base modeling class. Compiled forward passes will be sub-optimal. If you're writing code, see Llama for an example implementation. If you're a user, please report this issue on GitHub."

Any insights or guidance on this would be greatly appreciated.

To provide more context, I used the exact text to image generation code example from https://huggingface.co/docs/transformers/main/en/model_doc/janus, and this issue occurred while calling model.generate().

@Dswn888
Copy link

Dswn888 commented Jun 5, 2025

Hi @yaswanth19 @zucchini-nlp @hsilva664 , thanks for your excellent work! While using Janus for text to image generation, I met the following issue: "JanusForConditionalGeneration has no _prepare_4d_causal_attention_mask_with_cache_position method defined in its base modeling class. Compiled forward passes will be sub-optimal. If you're writing code, see Llama for an example implementation. If you're a user, please report this issue on GitHub."
Any insights or guidance on this would be greatly appreciated.

To provide more context, I used the exact text to image generation code example from https://huggingface.co/docs/transformers/main/en/model_doc/janus, and this issue occurred while calling model.generate().

My transformers version is 4.52.4.

@zucchini-nlp
Copy link
Member

I think I know where the issue stems from. @yaswanth19 want to submit a PR or I can do it as well? We can just add _prepare_4d_causal_attention_mask_with_cache_position or another options is to apply the new attn mask API (as in llama) with create_causal_mask

@yaswanth19
Copy link
Contributor Author

Thanks @Dswn888 for reporting the issue. I don't have an idea rn on why this resurfaced but will look into it. Yep @zucchini-nlp , I raise a PR, with new masking API to be up to date 🤔

@Dswn888
Copy link

Dswn888 commented Jun 6, 2025

Thanks for your prompt reply @yaswanth19 @zucchini-nlp !
By the way, could you please explain the potential impact of this issue?

@yaswanth19
Copy link
Contributor Author

yaswanth19 commented Jun 6, 2025

Hey @Dswn888 , I ran the code with specified transformers version but couldn't replicate the issue. Even on main this doesn't exist 😢 . Can you run transformers-cli env and share the complete output. Also LMK if this issue persists on main in your setup. I am running this example

@yaswanth19 yaswanth19 deleted the add-janus-model branch June 13, 2025 10:46
@Dswn888
Copy link

Dswn888 commented Jul 6, 2025

Hi @yaswanth19 @zucchini-nlp @geetu040 @hsilva664 @ArthurZucker , thanks for your excellent work!

I recently tried to use Janus for image generation.

In addition to providing text prompts, I also provided other images as part of the prompt. However, in the official implementation of Janus by transformers, I found the following code in the definition of JanusProcessor, which seems to mean that when generating images, the processor does not process the image elements in the input? So it seems that in image generation mode, the model only processes unimodal text prompts?

# Process images if pixel values are provided.
if images is not None and generation_mode != "image":
    data["pixel_values"] = self.image_processor(images=images, **output_kwargs["images_kwargs"])[
        "pixel_values"
    ]

Therefore, I hope to confirm with the official whether the image generation part of Janus supports multimodal input?

@Dswn888
Copy link

Dswn888 commented Jul 6, 2025

Hey @Dswn888 , I ran the code with specified transformers version but couldn't replicate the issue. Even on main this doesn't exist 😢 . Can you run transformers-cli env and share the complete output. Also LMK if this issue persists on main in your setup. I am running this example

Hi! Appreciate you trying to replicate the issue.

Here are my environment details from transformers-cli env:

  • transformers version: 4.52.4
  • Platform: Linux-4.19.90-2107.6.0.0192.8.oe1.bclinux.x86_64-x86_64-with-glibc2.35
  • Python version: 3.11.11
  • Huggingface_hub version: 0.32.3
  • Safetensors version: 0.5.3
  • Accelerate version: 1.7.0
  • Accelerate config: not found
  • DeepSpeed version: not installed
  • PyTorch version (GPU?): 2.7.0+cu126 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using distributed or parallel set-up in script?:
  • Using GPU in script?:
  • GPU type: NVIDIA A800 80GB PCIe

@yaswanth19
Copy link
Contributor Author

I hope to confirm with the official whether the image generation part of Janus supports multimodal input?

Hey @Dswn888 ,Yep, on image generation it's a simple T2I and doesn't support multi-modal inputs or output. And regarding the issue - Are you still facing the issue 🤔 try later versions or from main as I am unable to build the specified env in my setup due to cuda issues 😅

@Dswn888
Copy link

Dswn888 commented Jul 6, 2025

I hope to confirm with the official whether the image generation part of Janus supports multimodal input?

Hey @Dswn888 ,Yep, on image generation it's a simple T2I and doesn't support multi-modal inputs or output. And regarding the issue - Are you still facing the issue 🤔 try later versions or from main as I am unable to build the specified env in my setup due to cuda issues 😅

Hey, thanks for the quick reply and confirming the T2I aspect!

I was wondering a couple of things building on that:

  1. Did the original DeepSeek official Janus model support multimodal inputs for image generation?
  2. If I were to adapt the code to handle multimodal input, would the current model weights be able to understand the image-based information, or would SFT be necessary for that?

Also, about the issue we discussed earlier, it's still popping up on my end. I'll definitely try pulling a later version or from main to see if that fixes it.

@yaswanth19
Copy link
Contributor Author

yaswanth19 commented Jul 6, 2025

Did the original DeepSeek official Janus model support multimodal inputs for image generation.

AFAIK, No if it was supported back then, I would have added that feature. So, ig it was not supported

If I were to adapt the code to handle multimodal input, would the current model weights be able to understand

I had thought about it and tried some thing similar to image editing where I pass the image and supporting text in the hope of getting a edited image but was getting OOM error coz the input would be dense (576 image tokens + n text tokens) and on top of that we have generate 576 tokens for the image. If you have enough ram for 1B variant then it's worth a try.

To make what i said to work you might need to do some hacks:

  • Process the text + image in the same manner as we process for text generation (I would prefer not to use chat template here)
  • Next generate the embeddings for the same and If I recall correctly then you have to use get_image_features on your end to adjust the image embeddings.
  • Next you can pass the input embeds which now contains the text and image info to model (comment out the image embeddings adjustment part) to generate the image tokens. some hacks here and there and it should work.

@Dswn888
Copy link

Dswn888 commented Jul 6, 2025

Did the original DeepSeek official Janus model support multimodal inputs for image generation.

AFAIK, No if it was supported back then, I would have added that feature. So, ig it was not supported

If I were to adapt the code to handle multimodal input, would the current model weights be able to understand

I had thought about it and tried some thing similar to image editing where I pass the image and supporting text in the hope of getting a edited image but was getting OOM error coz the input would be dense (576 image tokens + n text tokens) and on top of that we have generate 576 tokens for the image. If you have enough ram for 1B variant then it's worth a try.

To make what i said to work you might need to do some hacks:

  • Process the text + image in the same manner as we process for text generation (I would prefer not to use chat template here)
  • Next generate the embeddings for the same and If I recall correctly then you have to use get_image_features on your end to adjust the image embeddings.
  • Next you can pass the input embeds which now contains the text and image info to model (comment out the image embeddings adjustment part) to generate the image tokens. some hacks here and there and it should work.

Thanks for the detailed explanation and the insights! I really appreciate you taking the time to explain the challenges with multimodal inputs and the suggested hacks.

I have two follow-up questions if you don't mind:

  1. Regarding processing text and images, you mentioned preferring not to use a chat template. The current chat template looks like this:

{
"role": "user",
"content": [
{"type": "text", "text": "What’s the difference between"},
{"type": "image", "url": image_urls[0]},
{"type": "text", "text": " and "},
{"type": "image", "url": image_urls[1]}
]
}

Could you elaborate a bit on why you'd advise against using this type of chat template in this specific context?

  1. Would you be open to sharing the modified code or any snippets from your earlier attempts at image editing where you were trying to pass images and supporting text? It would be incredibly helpful for my understanding and further experimentation.

Thanks again for all your help!

@yaswanth19
Copy link
Contributor Author

Could you elaborate a bit on why you'd advise against using this type of chat template in this specific context?

Sorry MB, we can utilize chat_template - I was of the opinion the we only utilize chat_template for text generation but we also utilize for image gen so no issues and infact better as model is trained that way.

Regarding the scripts I unfortunately don't have those code snippets. Just to reiterate conceptually what I was saying above is we can pass multi modal inputs for text generation so we want the processing of text gen pipeline (so to get multimodal inputs) and generate of image pipeline (so for image gen pipeline this is like a long text and hopefully it identifies the multimodal info)
what ever hacks I have suggested above were just to fit these two halves of pipeline.

@Dswn888
Copy link

Dswn888 commented Jul 7, 2025

Could you elaborate a bit on why you'd advise against using this type of chat template in this specific context?

Sorry MB, we can utilize chat_template - I was of the opinion the we only utilize chat_template for text generation but we also utilize for image gen so no issues and infact better as model is trained that way.

Regarding the scripts I unfortunately don't have those code snippets. Just to reiterate conceptually what I was saying above is we can pass multi modal inputs for text generation so we want the processing of text gen pipeline (so to get multimodal inputs) and generate of image pipeline (so for image gen pipeline this is like a long text and hopefully it identifies the multimodal info) what ever hacks I have suggested above were just to fit these two halves of pipeline.

Got it, thanks so much for the detailed explanations!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Deepseek AI's Janus model

10 participants