-
Notifications
You must be signed in to change notification settings - Fork 31.7k
Add Janus model #36053
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Janus model #36053
Conversation
89d564a to
5d6d37a
Compare
|
@zucchini-nlp I’ve started working on Janus and would love to get some guidance. Right now, I’ve just created a skeleton and implemented the ImageProcessing class. My first major hurdle is the CONFIG. The Janus config on the Hub is quite composite and non-standard. Standard values like From a testing perspective, how should I approach writing the config class? Loading this config directly using AutoConfig.from_pretrained() doesn’t work. Shall I write an ad hoc script to convert this config into a standard Hugging Face config (similar to convert_weights_to_hf.py, but for config)? Config: https://huggingface.co/deepseek-ai/Janus-Pro-1B/blob/main/config.json |
|
@yaswanth19 super cool to see a draft PR! Yeah, that reminds me of Molmo is modeled, also with hardcoded values for configuration. I am not sure how you usually approach testing. If you will test by matching with the actual weights, then yes, converting a config will be helpful As a first step, I'd suggest to make a working model code and then convert weights/config. The vision backbone should be very similar to existing CLIP models, and for VQ part feel free to look at Emu3 model here. When the model is converted, we can try to match logits and see in which modules the logits start to diverge. LMK if you are stuck in any place and need help :) Btw, I would be very interested to see if Janus can handle interleaved generation of image + text in one go 👀 If that's possible, would be super super nice |
|
Hi @yaswanth19, this is really nice. I am also really interested in the model, do you think I can collaborate with you on this one? I can help with the implementation. |
|
@geetu040 Thanks for your interest and for offering to collaborate! This is quite an ambitious PR for myself, and I’d like to take on the challenge of tackling it myself. That said, I’ll definitely reach out if I get stuck or unable to continue working on it :) |
Sure, understood, I wish you a very Good Luck. |
|
Hello @yaswanth19. I understand that you want to do it by yourself. Yet, I'd like to insist on collaboration a bit further. I've spent a bit of time this week on reproducing their virtual environment and running their generative and understanding modes of Janus for 1B and 7B models on rented GPUs, as well as reading the original Janus paper, the SigLip paper (i.e. the "understanding" image encoder) and the LlamaGen paper (from which they take the generative image tokenizer). I was somewhat busy in this specific week, so did not have a lot of time for this. Still, I was about to start reading their code before trying to collaborate on integrating. As I've spent some time on this already, are you OK with trying to work together? |
|
@hsilva664 Cool! since you have already made some progress IG we are on the same page. Let's start with implementing the Modular version. Janus has three components SigLip, VQ model, Llama - If I not wrong SigLip and Llama are staright forward with so I can pick those two and will create the structure and processor w.r.t. Now the middle part which is a VQ model ; you can start working on that. The reason I am insisting this way is that VQ model is a bit ambigous as I am not much familiar with that (Best case scenario we can inherit all the components for this model from existing ones :) ). Hence trying to code up the familiar part first and then iterate to match logits. WDYT? |
|
That works. Also, on the Janus paper (the original one, not Janus pro), they state that they use DeepSeek-LLM (1.3B), not Llama, I think on Janus Pro they keep it that way, but change it to 7B. It is the VQ paper that uses Llama + the autoregressive decoder. |
|
@zucchini-nlp I've added a preliminary version of the VQ part to a clone of @yaswanth19 branch and am trying to set up a simple test code that will be used for checking weights/activations consistency with the original Janus model in the future (still have to import their weight values). From what I can gather, it seems that in HF the common practice is to have a JINJA template to convert between a list of dicts such as: To the processed string that will then be tokenized. This template seems to usually be stored on the hub, in JSON files like this one. On their codebase, they have this conversion done by a Python module instead (see here, they use the template called "deepseek"). Just to confirm, should I change that format to a JINJA template, or would it be something "normal" to keep it like what they are doing (they call a function manually to convert the list inside the processor)? Was something similar to this done somewhere, so I can base myself on it (it is OK if not)? |
|
@hsilva664 We have to create the jinja template. I have created one just for model testing purposes and it needs some corrections. My testing Conversation = [{"role":"User", First of all this template doesn't account for new line chars before and after applying to conv. In the sense- to match the input_ids with original implementation we have to hackily use this way |
|
About the random initialization of weights (e.g. |
|
In the original implementation for Image generation they are generating a token and then applying Also I don't think it can perform Interleaved generation as mentioned by one of the authors [LINK]. For image generation they are explicitly adding <begin_of_image> tag to let the model know to generate image related logits. So I not sure how to add this image generation logic (In generate and prompt creation) without hampering the text generation functionality. One hack I can think of; is to use a flag where user explicitly mentions to generate image and we can process accordingly. @zucchini-nlp Would like your thoughts on this issue and attached the image generation inference file below. |
Yep, only for training from scratch and for some internal tests when a dummy model is init. If Janus uses a different initialization, you can add it in @yaswanth19 really interesting point, didn't know Janus would add more layers after CFG 🙃 We have CFG in transformers, so it won't be a big problem. For the rest of layers we can override def generate(self, input_ids, **kwargs):
# if generating text, just call super()
if kwargs.get("generation_mode", None) == "text":
return super().generate(*input_ids, *kwargs)
# Otherwise do custom logic, and rely only on greedy/sampling methods
# Beam search, etc not supported
# So some preprocessing, same as in `generation/utils.py`. Feel free to look at how MusicGen does it
# Would have been nice to have self.prepare_for_generate(), planned and was low priority
# I think since we have Janus planned, we can raise the priority a bit
generation_config = ....
for i in range(generation_config.max_new_tokens):
next_token_outputs = super()._sample(input_ids, max_new_tokens=1 **kwargs)
input_ids = self.layers_after_CFG(next_token_outputs)
return next_token_outputscc @gante we're having first image generation model that doesn't work with |
|
@zucchini-nlp WDYT on image quality. Just wanted to confirm as you were a bit hesitant for ANOLE Model. These images are generated using original implementation itself (in worst case we can match it). The upper row images are generated using Prompt 1: Generate an image of a snowman.
|
|
@zucchini-nlp The PR is good for merge. The failing tests are unrelated to this PR 🚀 🤗 Here is the deepseek-community org and both the 1B and 7B are uploaded there. |
|
Super super cool, thanks for adding an org! Let's resolve merge conflicts and I'll merge |
|
@zucchini-nlp Resolved the Merge Conflict and lets merge it before another one appears 😅 |
|
Let's go 🚀 |
* Iterative generation using input embeds * Add Janus model * discard changes * Janus imports * Refactor config and processor * Added Vision tower of Janus * Import Janus Image processor * Vision tower fixes * Refactor code * Added VQ Model * Complete model integration * temp conversion script * processor refactor * Adding files to facilitate pulling * Fixes after debugging * Skip test for these models * Add Janus Model * discard changes * Janus imports * Refactor config and processor * Added Vision tower of Janus * Import Janus Image processor * Vision tower fixes * Refactor code * Added VQ Model * Complete model integration * temp conversion script * processor refactor * Adding files to facilitate pulling * Fixes after debugging * Refactor to Text config * ✨ Added generate function * Saving intermediate convert file. Still need to read configs from the hub and convert them to our format. * Adding version that reads from the JSON files. Still have to tweak some parameters manually. * relative imports * Initial tests * Refactor image processor * Seemingly working version of the conversion script, will need to test further. * Adding command message * Fixing conflicting JanusTextConfig class * Incorporating some of the discussed changes. * Small fix to create dir. * Removing system from JINJA template * Adding draft processor tests * style fixes * Minor fixes and enhancement * added generation config * Initial tests * Small modifications, tests are now passing. * Small changes I noticed while reading code. * more fixes * Added JanusModel class * Small merge adaptations * Small merge adaptations * Image processing tests passing * More tests and fixes * Convert script updated and refactored * Tests and cleanup * make style * Postprocessing for image generation * generate refactor * fixes * - Passing tests that write a part of the model to cpu (e.g. test_cpu_offload) - Passing tests of dispatching SDPA - Only gradient checkpointing tests are left. * Removing temporary code * Changes * Writing change to modular * Added JanusVisionModel. SDPA dispatch tests pass more robustly. Gradient checkpoint tests are next * Gradient checkpoint tests passing * Removing debug code * Major generate refactor 😮💨 * Temp changes for testing * Green quality CI * 2 out of 4 integration tests passing * breadcrumbs * Usage Examples * Regenerate modeling after merge * dirty code * JanusIntegrationTest are passing * breadcrumbs * happy CI * fixes * Changing template * nits * Text generation logits matching original codebase at 100% precision * Remove ./tmp from git tracking * Remove ./tmp from git tracking * Checkpointing changes after reviewing * Fixing code in docstrings * CHanging comments and small bug in convert file * Fixing bug in image_token_id for 7B version * Removing line that was added by both of us * Pushing changes after discussion. Only one left is to change the key mapping for convert file. * Updating module file * New convert file using dict. Tested that it is equivalent to the old one by: - comparing keys in a script - comparing checksums of the output files between version generated with the current convert script and those generated with the old script. This is a more reliable test. * revert changes * mistake * consistency change for CI * make style * doc fixes * more fixes * experimenting with masking out pad token * checkpoint * Batched generation with multi-images working for 1B models. Will test 7B next. * Device fix. * Writing changes to modular, previous ones were written to modeling just for quick testing. * Using passed processor attention mask (only in modeling for now) * Matching performance done in the non-standard way * Working version of batched generation. Will change how some args are passed to make it more similar to language case * More compliant version of the code * Removed duplicated `_prepare_4d_causal_attention_mask_with_cache_position` * Updating modular file, making masked filling with paddings more efficient * Slightly more efficient version * Modifying JanusVisionModel to be a wrapper * Fixing test to comply with new names * Modular overhaul * More refactoring * - Changing JanusVisionModel back - Changing forward pass - Adding boi token to the comparison * - Removing whole context model_ids - Using inherited implementation of prepare_inputs_for_generation * Moving the way boi token is passed to the model * Fixing sdpa test * Minor changes * testing changes * Minor fix * - Adding postprocessing test - checking values of generated image on integration test * changes * Removing pooled attention vision module, fixing convert script as a consequence * More changes * Fixes * Draft after merge * Bug fixes * More bug fix * Fixing docs * Nits * Refactor return dict * Moving image post processing test to main processor post process * Passing guidance_scale as kwarg * make style * 🔥 refactor * make style * Update and green CI * Nits and tests update * up * Added MID block * fix * Dead code * update testcase * update * model_id change * init_weight changes --------- Co-authored-by: hsilva664 <[email protected]>
* Iterative generation using input embeds * Add Janus model * discard changes * Janus imports * Refactor config and processor * Added Vision tower of Janus * Import Janus Image processor * Vision tower fixes * Refactor code * Added VQ Model * Complete model integration * temp conversion script * processor refactor * Adding files to facilitate pulling * Fixes after debugging * Skip test for these models * Add Janus Model * discard changes * Janus imports * Refactor config and processor * Added Vision tower of Janus * Import Janus Image processor * Vision tower fixes * Refactor code * Added VQ Model * Complete model integration * temp conversion script * processor refactor * Adding files to facilitate pulling * Fixes after debugging * Refactor to Text config * ✨ Added generate function * Saving intermediate convert file. Still need to read configs from the hub and convert them to our format. * Adding version that reads from the JSON files. Still have to tweak some parameters manually. * relative imports * Initial tests * Refactor image processor * Seemingly working version of the conversion script, will need to test further. * Adding command message * Fixing conflicting JanusTextConfig class * Incorporating some of the discussed changes. * Small fix to create dir. * Removing system from JINJA template * Adding draft processor tests * style fixes * Minor fixes and enhancement * added generation config * Initial tests * Small modifications, tests are now passing. * Small changes I noticed while reading code. * more fixes * Added JanusModel class * Small merge adaptations * Small merge adaptations * Image processing tests passing * More tests and fixes * Convert script updated and refactored * Tests and cleanup * make style * Postprocessing for image generation * generate refactor * fixes * - Passing tests that write a part of the model to cpu (e.g. test_cpu_offload) - Passing tests of dispatching SDPA - Only gradient checkpointing tests are left. * Removing temporary code * Changes * Writing change to modular * Added JanusVisionModel. SDPA dispatch tests pass more robustly. Gradient checkpoint tests are next * Gradient checkpoint tests passing * Removing debug code * Major generate refactor 😮💨 * Temp changes for testing * Green quality CI * 2 out of 4 integration tests passing * breadcrumbs * Usage Examples * Regenerate modeling after merge * dirty code * JanusIntegrationTest are passing * breadcrumbs * happy CI * fixes * Changing template * nits * Text generation logits matching original codebase at 100% precision * Remove ./tmp from git tracking * Remove ./tmp from git tracking * Checkpointing changes after reviewing * Fixing code in docstrings * CHanging comments and small bug in convert file * Fixing bug in image_token_id for 7B version * Removing line that was added by both of us * Pushing changes after discussion. Only one left is to change the key mapping for convert file. * Updating module file * New convert file using dict. Tested that it is equivalent to the old one by: - comparing keys in a script - comparing checksums of the output files between version generated with the current convert script and those generated with the old script. This is a more reliable test. * revert changes * mistake * consistency change for CI * make style * doc fixes * more fixes * experimenting with masking out pad token * checkpoint * Batched generation with multi-images working for 1B models. Will test 7B next. * Device fix. * Writing changes to modular, previous ones were written to modeling just for quick testing. * Using passed processor attention mask (only in modeling for now) * Matching performance done in the non-standard way * Working version of batched generation. Will change how some args are passed to make it more similar to language case * More compliant version of the code * Removed duplicated `_prepare_4d_causal_attention_mask_with_cache_position` * Updating modular file, making masked filling with paddings more efficient * Slightly more efficient version * Modifying JanusVisionModel to be a wrapper * Fixing test to comply with new names * Modular overhaul * More refactoring * - Changing JanusVisionModel back - Changing forward pass - Adding boi token to the comparison * - Removing whole context model_ids - Using inherited implementation of prepare_inputs_for_generation * Moving the way boi token is passed to the model * Fixing sdpa test * Minor changes * testing changes * Minor fix * - Adding postprocessing test - checking values of generated image on integration test * changes * Removing pooled attention vision module, fixing convert script as a consequence * More changes * Fixes * Draft after merge * Bug fixes * More bug fix * Fixing docs * Nits * Refactor return dict * Moving image post processing test to main processor post process * Passing guidance_scale as kwarg * make style * 🔥 refactor * make style * Update and green CI * Nits and tests update * up * Added MID block * fix * Dead code * update testcase * update * model_id change * init_weight changes --------- Co-authored-by: hsilva664 <[email protected]>
|
Hi @yaswanth19 @zucchini-nlp @hsilva664 , thanks for your excellent work! While using Janus for text to image generation, I met the following issue: Any insights or guidance on this would be greatly appreciated. |
To provide more context, I used the exact text to image generation code example from https://huggingface.co/docs/transformers/main/en/model_doc/janus, and this issue occurred while calling |
My transformers version is 4.52.4. |
|
I think I know where the issue stems from. @yaswanth19 want to submit a PR or I can do it as well? We can just add |
|
Thanks @Dswn888 for reporting the issue. I don't have an idea rn on why this resurfaced but will look into it. Yep @zucchini-nlp , I raise a PR, with new masking API to be up to date 🤔 |
|
Thanks for your prompt reply @yaswanth19 @zucchini-nlp ! |
|
Hi @yaswanth19 @zucchini-nlp @geetu040 @hsilva664 @ArthurZucker , thanks for your excellent work! I recently tried to use Janus for image generation. In addition to providing text prompts, I also provided other images as part of the prompt. However, in the official implementation of Janus by transformers, I found the following code in the definition of JanusProcessor, which seems to mean that when generating images, the processor does not process the image elements in the input? So it seems that in image generation mode, the model only processes unimodal text prompts?
Therefore, I hope to confirm with the official whether the image generation part of Janus supports multimodal input? |
Hi! Appreciate you trying to replicate the issue. Here are my environment details from
|
Hey @Dswn888 ,Yep, on image generation it's a simple T2I and doesn't support multi-modal inputs or output. And regarding the issue - Are you still facing the issue 🤔 try later versions or from main as I am unable to build the specified env in my setup due to cuda issues 😅 |
Hey, thanks for the quick reply and confirming the T2I aspect! I was wondering a couple of things building on that:
Also, about the issue we discussed earlier, it's still popping up on my end. I'll definitely try pulling a later version or from |
AFAIK, No if it was supported back then, I would have added that feature. So, ig it was not supported
I had thought about it and tried some thing similar to image editing where I pass the image and supporting text in the hope of getting a edited image but was getting OOM error coz the input would be dense (576 image tokens + n text tokens) and on top of that we have generate 576 tokens for the image. If you have enough ram for 1B variant then it's worth a try. To make what i said to work you might need to do some hacks:
|
Thanks for the detailed explanation and the insights! I really appreciate you taking the time to explain the challenges with multimodal inputs and the suggested hacks. I have two follow-up questions if you don't mind:
Could you elaborate a bit on why you'd advise against using this type of chat template in this specific context?
Thanks again for all your help! |
Sorry MB, we can utilize chat_template - I was of the opinion the we only utilize chat_template for text generation but we also utilize for image gen so no issues and infact better as model is trained that way. Regarding the scripts I unfortunately don't have those code snippets. Just to reiterate conceptually what I was saying above is we can pass multi modal inputs for text generation so we want the processing of text gen pipeline (so to get multimodal inputs) and generate of image pipeline (so for image gen pipeline this is like a long text and hopefully it identifies the multimodal info) |
Got it, thanks so much for the detailed explanations! |

What does this PR do?
Fixes #35928
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.