Generate Stylized Art and Captions from a Single Seed
This project blends generative creativity with semantic understanding. Your goal is to build a multimodal model that outputs both:
- An image in a selected artistic style (e.g., oil painting, sculpture, engraving).
- A caption that accurately describes the generated image.
Both outputs must originate from a shared latent input — a single seed. The image and caption should reflect a unified concept, not two disconnected processes.
Dataset for reference:
Art Images Dataset — Drawing, Painting, Sculpture, Engraving
Design a system that can:
- Generate an image in a specific art style.
- Simultaneously generate a caption that reflects the content and style of the image.
- Ensure both outputs are generated from a shared latent representation or seed — not a sequential pipeline.
This is about building a joint generative system, where semantics are shared and outputs are inherently aligned.
- Choose one artistic style from the dataset (e.g., engravings, sculptures).
- Clean, resize, and normalize the dataset.
- Use a captioning model (e.g., BLIP2) to generate accurate textual descriptions for each image.
- Align each image with its corresponding caption for training.
Develop a single model or tightly integrated architecture that, given a random seed, produces:
- A stylized image.
- A grounded caption that describes that exact image.
This should not be a two-stage pipeline (e.g., image → caption). Both outputs must emerge from the same latent representation.
-
Latent Diffusion Model + Text Decoder
- Use a latent diffusion model to generate images.
- Attach a transformer decoder to generate text from the same latent.
-
GAN-based Generator + Caption Head
- Extend models like StyleGAN or VQGAN to output a text embedding.
- Train both image and caption outputs from the same noise vector.
-
Multimodal Token Stream Model
- Inspired by DALL·E or Flamingo. Decode both image and text from a single token stream using a shared transformer architecture.
-
Contrastive Latent Mapping
- Generate both outputs from the same seed and align them in shared latent space using a contrastive loss.
Bonus: Ensure that captions capture not just objects, but also artistic elements (e.g., "an oil painting of a stormy sea" vs. "a wave").
# Unified generation from a single seed
seed = get_random_seed()
image, caption = multimodal_model(seed)
return {
"image": image,
"caption": caption
}-
Working codebase for:
- Dataset preparation and caption alignment.
- Model training and inference.
-
A script or demo (e.g., Colab or Gradio) that:
- Accepts a seed.
- Generates and displays both image and caption.
-
Sample output images and captions.
-
A
README.mdincluding:- Chosen style and preprocessing methodology.
- Model architecture and reasoning.
- Training approach and trade-offs.
- Scalability plan (e.g., supporting 100,000+ generations).
Submissions will be evaluated based on the following:
- Scalability: Can your system handle large datasets or long training runs efficiently?
- Multimodal Coherence: Do the image and caption clearly correspond?
- Style Consistency: Is the selected art style reflected reliably in the outputs?
- Code Quality: Is the implementation clean, modular, and reproducible?
- Creative Architecture: Innovative design decisions are encouraged.
- Use BLIP2, OFA, or GIT for dataset captioning.
- Experiment with Stable Diffusion + LoRA for lightweight image generation.
- Use JSONL or Parquet to store structured data at scale.
- Track progress and visualizations with Weights & Biases, TensorBoard, or custom dashboards.
- Allow conditioning on both prompt and seed.
- Make captions style-aware (e.g., "baroque portrait of...").
- Implement latent space interpolation for smooth transitions.
- Package a shareable demo (Gradio or HuggingFace Space).
If you want to brainstorm approaches or share early prototypes, feel free to reach out. soumyadeep [at] dashtoon.com
- Select a visual art style.
- Build a unified model that generates both image and caption from a shared seed.
- Focus on semantic alignment, consistent style, and scalable design.
- Submit a demo, clean code, and sample results.
We look forward to seeing what you create. — Dashverse Research Team