SPHINX-V Model Card

Model type:

SPHINX-V is a new multimodal large language model designed for visual prompting, equipped with a novel visual prompt encoder and a two-stage training strategy. SPHINX-V supports multiple visual prompts simultaneously across various types, significantly enhancing user flexibility and achieve a fine-grained and open-world understanding of visual prompts.

Paper or resources for more information:

Project Page: Draw-and-Understand
Paper: https://arxiv.org/abs/2403.20271
Code: https://github.com/AFeng-x/Draw-and-Understand
Dataset: MDVP-Data & MDVP-Bench

Intended use

Primary intended uses: The principal application of SPHINX-V is centered around conducting research in the realm of visual prompting large multimodal models and chatbots.

Primary intended users: The model is primarily designed for use by researchers and enthusiasts specializing in fields such as computer vision, natural language processing, and interactive artificial intelligence.

License

Citations

@article{lin2024draw,
  title={Draw-and-understand: Leveraging visual prompts to enable mllms to comprehend what you want},
  author={Lin, Weifeng and Wei, Xinyu and An, Ruichuan and Gao, Peng and Zou, Bocheng and Luo, Yulin and Huang, Siyuan and Zhang, Shanghang and Li, Hongsheng},
  journal={arXiv preprint arXiv:2403.20271},
  year={2024}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support