CLIPS: An Enhanced CLIP Framework for Learning with Synthetic Captions


1UC Santa Cruz, 2University of Edinburgh
alt text

The pipeline of our proposed CLIPS. We introduce two simple yet effective designs:
1) only a subpart of the synthetic caption is used in contrastive learning, and
2) a captioner predicts the full synthetic caption based on the web-crawled caption and the image.

Abstract

Previous works show that noisy, web-crawled image-text pairs may limit vision-language pretraining like CLIP and propose learning with synthetic captions as a promising alternative. Our work continues this effort, introducing two simple yet effective designs to better leverage richly described synthetic captions. Firstly, by observing a strong inverse effect in learning with synthetic captions---the short synthetic captions can generally lead to MUCH higher performance than full-length ones---we therefore fed only partial synthetic captions to the text encoder. Secondly, we incorporate an autoregressive captioner to mimic the recaptioning process---by conditioning on the paired image input and web-crawled text description, the captioner learns to predict the full-length synthetic caption generated by advanced MLLMs. Experiments show that our framework significantly improves zero-shot performance in cross-modal retrieval tasks, setting new SOTA results on MSCOCO and Flickr30K. Moreover, such trained vision encoders can enhance the visual capability of LLaVA, showing strong improvements on a range of MLLM benchmarks.

Inverse Effect with Synthetic Captions

Visualization of four different token reduction strategies. These strategies can improve the model's learning efficiency on synthetic captions to varying degrees. Among these strategies, the sub-caption and block mask perform best.

Pipeline 1
Pipeline 2
Pipeline 3
Pipeline 4

The inverse effect of synthetic captions. Unlike the performance drop from reducing token length in original captions, shortening the token length of synthetic captions consistently improves model performance.

Zero-Shot Cross-Modal Retrieval

Zero-shot image-text retrieval results on MSCOCO and Flickr30K. The CLIPA and CoCa results are reproduced by us. Both methods are implemented with a mixture training, where the original caption accounts for 80% and the synthetic caption accounts for 20%. Our method consistently achieves superior performance across all benchmarks and model sizes, yielding significant improvements over the baselines.

Comparision with State-of-the-art Methods

Comparison with other SOTA vision-language pre-training methods. We report top-1 ImageNet-1K classification accuracy and zero-shot recall of image and text retrieval on MSCOCO and Flickr30K. With a ViT-L backbone, our CLIPS substantially outperforms SigLip by 4.7 (from 70.8% to 75.5%) on MSCOCO's R@1 text retrieval, and by 3.3 (from 52.3% to 55.6%) on MSCOCO's R@1 image retrieval. With increased computational resources and scaling, our best model further achieves 76.4% and 96.6% R@1 text retrieval performance on MSCOCO and Flickr30K respectively, and 57.2% and 83.9% R@1 image retrieval performance on the same datasets, setting new state-of-the-art (SOTA) results.

CLIPS in LLaVA

Comparison of LLaVA-1.5 performance. We directly replace the original OpenAI-CLIP-Large-14 with the CLIPS-Large-14 and use LLaMA-3 as the language model. The results demonstrate that integrating CLIPS significantly enhances LLaVA's performance across multiple metrics compared to using the original OpenAI-CLIP visual encoder.

Model Zoo

We have released CLIPS-Large-14-224/336 and CLIPS-Huge-14-224, and more models will be available soon!

Models

Model url
CLIPS-Large-14-224 https://huggingface.co/UCSC-VLAA/ViT-L-14-CLIPS-224-Recap-DataComp-1B
CLIPS-Large-14-336 https://huggingface.co/UCSC-VLAA/ViT-L-14-CLIPS-Recap-DataComp-1B
CLIPS-Huge-14-224 https://huggingface.co/UCSC-VLAA/ViT-H-14-CLIPS-224-Recap-DataComp-1B
CLIPS-Huge-14-336 Coming Soon...

Acknowledge

We would like to thank TPU Research Cloud (TRC) program, Google Cloud Research Credits program, and AWS Cloud Credit for Research program for supporting our computing needs.

BibTeX


      @misc{liu2024clipsenhancedclipframework,
        title={CLIPS: An Enhanced CLIP Framework for Learning with Synthetic Captions}, 
        author={Yanqing Liu and Xianhang Li and Zeyu Wang and Bingchen Zhao and Cihang Xie},
        year={2024},
        eprint={2411.16828},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2411.16828}, 
  }