LightBagel: A Light-weighted, Double Fusion Framework for Unified Multimodal Understanding and Generation

Abstract

Unified multimodal models have recently shown remarkable gains in both capability and versatility, yet most leading systems are still trained from scratch and require substantial computational resources. In this paper, we show that competitive performance can be obtained far more efficiently by strategically fusing publicly available models specialized for either generation or understanding. Our key design is to retain the original blocks while additionally interleaving multimodal self-attention blocks throughout the networks. This double fusion mechanism (1) effectively enables rich multi-modal fusion while largely preserving the original strengths of the base models, and (2) catalyzes synergistic fusion of high-level semantic representations from the understanding encoder with low-level spatial signals from the generation encoder. By training with only ~ 35B tokens, this approach achieves strong results across multiple benchmarks: 0.91 on GenEval for compositional text-to-image generation, 82.16 on DPG-Bench for complex text-to-image generation, 6.06 on GEditBench, and 3.77 on ImgEdit-Bench for image editing. By fully releasing the entire suite of code, model weights, and datasets, we hope to support future research on unified multimodal modeling.

Model Architecture

Figure 2. Overview of the LightBagel architecture. Text and ViT tokens (understanding pathway) and VAE tokens (generation pathway) are processed by pre-trained VLM and DiT blocks, respectively. At each layer, a zero-initialized multimodal self-attention module enables cross-modal interactions without altering the original model architectures.

Quantitative Results

Table 1. Comparison of different models across understanding, generation, editing, and in-context Generation tasks. ^† refers to the methods using LLM rewriter. For UMMs, ★☆☆ refers to only model checkpoints and evaluation code being released; ★★☆ refers to only model checkpoints and training/evaluation code being released; ★★★ refers to the full suite of all the {model, data, code} being released.

Model	# Params	Openness	Understanding			Image Generation		Image Editing
Model	# Params	Openness	MMB	MMMU	MM-Vet	GenEval	DPG	ImgEdit	GEdit-EN
LLaVA-1.5	—	—	36.4	67.8	36.3	—	—	—	—
LLaVA-NeXT	—	—	79.3	51.1	57.4	—	—	—	—
SDXL	—	—	—	—	—	0.55	74.7	—	—
SD3-medium	—	—	—	—	—	0.62	84.1	—	—
FLUX.1-dev	—	—	—	—	—	0.66	84.0	—	—
Instruct-P2P	—	—	—	—	—	—	—	1.88	3.68
MagicBrush	—	—	—	—	—	—	—	1.90	1.86
AnyEdit	—	—	—	—	—	—	—	2.45	3.21
Step1X-Edit	—	—	—	—	—	—	—	3.06	6.70
IC-Edit	—	—	—	—	—	—	—	3.05	4.84
Unified models
Janus-Pro	—	★☆☆	75.5	36.3	39.8	0.80	84.19	—	—
Emu3	—	★☆☆	58.5	31.6	37.2	0.66†	80.60	—	—
UniPic	1.5B	★☆☆	—	—	—	0.86	85.50	3.49	5.83
UniPic 2.0	7B + 2B	★☆☆	83.5	58.6	67.1	0.90†	83.79	4.06	7.10
Ovis-U1	2.4B + 1.2B	★☆☆	77.8	51.1	66.7	0.89	83.72	4.00	6.42
MetaQuery-XL	7B + 1.6B	★★☆	83.5	58.6	66.6	0.80†	82.05	—	—
Show-o2	7B	★★☆	79.3	48.9	—	0.76	86.14	—	—
OmniGen	3.8B	★★☆	—	—	—	0.68	81.16	2.96	5.06
OmniGen2	3B + 4B	★★☆	79.1	53.1	61.8	0.86†	83.57	3.44	6.42
BAGEL	7B + 7B	★★☆	85.0	55.3	67.2	0.88†	85.07	3.20	6.52
BLIP3-o 4B	3B + 1.4B	★★★	78.6	46.6	60.1	0.81†	79.36	—	—
BLIP3-o 8B	7B + 1.4B	★★★	83.5	58.6	66.6	0.84†	81.60	—	—
UniWorld-V1	7B + 12B	★★★	83.5	58.6	67.1	0.84†	81.38	3.26	4.85
LightBagel	7B + 5B + 3B	★★★	83.5	58.6	67.1	0.91†	82.16	3.77	6.06

Qualitative Results

Figure 3. Qualitative text-to-image results from LightBagel, showcasing high-quality generations with strong fidelity to text prompts and consistent rendering across diverse aspect ratios.

Figure 4. Qualitative image editing results generated by LightBagel. The model produces high-quality outputs across a diverse range of editing tasks.

BibTeX

TBD

LightBagel : A Light-weighted, Double Fusion Framework for Unified Multimodal Understanding and Generation

BibTeX