#multimodallargemodelMMaDA | Explore Tumblr posts and blogs

govindhtech · 1 month ago

Text

MMaDA: Open-Source Multimodal Large Diffusion Models

Multimodal Large Diffusion Language Models

Hugging Face introduces MMaDA, a new class of multimodal diffusion basis models, to achieve excellent performance in several areas. It strives for excellence in text-to-image generation, multimodal understanding, and textual reasoning.

Three key innovations distinguish this method:

In its unified diffusion architecture, MMaDA uses a modality-agnostic design and a shared probabilistic formulation. By eliminating modality-specific components, this ensures data integration and processing across types.

Mixed lengthy CoT fine-tuning: This strategy aims to curate a CoT format across modalities. This strategy allows cold-start training for the last reinforcement learning (RL) stage by coordinating reasoning processes between the textual and visual domains, improving the model's ability to solve difficult issues immediately.

The unified policy-gradient-based RL algorithm (UniGRPO) is a method created for diffusion foundation models. To unify post-training performance gains across reasoning and generating tasks, it uses diversified reward models.

Experimental data shows that MMaDA-8B is a unified multimodal foundation model with good generalisation. It outperforms other powerful models in several areas. MMaDA-8B outperforms LLaMA-3-7B and Qwen2-7B in textual reasoning, SDXL and Janus in text-to-image generation, and Show-o and SEED-X in multimodal understanding. These results show how successfully MMaDA links pre- and post-training in unified diffusion systems.

MMaDA has several training stage checkpoints:

After instruction tuning and pretraining, MMaDA-8B-Base is available. It generates simple text, photos, captions, and thoughts. Huggingface open-sourced MMaDA-8B-Base. Specifications are 8.08B. This process involves pre-training on ImageNet (step 1.1), an Image-Text Dataset (Stage 1.2), and text instruction following (Stage 1.3).

MixCoT (shortly to arrive): This edition uses mixed extended chain-of-thought (CoT) fine-tuning. It should have advanced multimodal, textual, and image-generation reasoning. After Mix-CoT finetuning with text reasoning (Stage 2.1), training adds multimodal reasoning (Stage 2.2). Release was expected two weeks following the 2025-05-22 update.

MMaDA-8B-Max (coming soon): UniGRPO reinforcement learning trains it. It should excel at complex logic and stunning graphics. Stage 3 of UniGRPO RL training is slated for release after the code move to OpenRLHF. It was expected one month following the 2025-05-22 upgrade.

For inference, MMaDA supports semi-autoregressive sample text production. Multimodal generation uses non-autoregressive diffusion denoising. We offer inference programs for text, multimodal, and text-to-image generation. The training method includes pre-training and Mix-CoT fine-tuning.

Research paper introducing MMaDA was submitted to arXiv on May 21, 2025. Also available through Hugging Face Papers. Huggingface and Gen-Verse/MMaDA's GitHub account hosts the training models and code. Huggingface Spaces offers a demo online. The project repository on GitHub has 11 forks and 460 stars, according to data. Python dominates the repository. Authors include Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, and Mengdi Wang.

#MMaDA #MMaDA8B #MMaDA8BBase #MMaDA8BMixCoT #MMaDA8BMax #multimodaldiffusionfoundationmodels #multimodallargemodelMMaDA #technology #technews #technologynews #news #govindhtech

0 notes