BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
Paper
β’
2201.12086
β’
Published
β’
3
This model is a fine-tuned bert-base-uncased on the Fakeddit dataset.
It combines post text with image captions generated by Salesforce/blip-image-captioning-base, rather than using raw image features.
[CLS] post text, BLIP image caption [SEP]Salesforce/blip-image-captioning-base| Approach | Accuracy | Macro F1-Score |
|---|---|---|
| Text + Caption | 0.87 | 0.83 |
β‘οΈ Using captions instead of raw image features leads to state-of-the-art performance on Fakeddit, with simpler input and no vision backbone needed during inference.
This model builds on the following works:
Base model
google-bert/bert-base-uncased