flava full
facebookFLAVA Model Documentation
Introduction
The FLAVA model is designed by FAIR researchers to evaluate if a single model can effectively handle multiple modalities using a unified architecture. Pretrained on publicly available datasets with 70 million image-text pairs, FLAVA is notable for its ability to perform zero-shot image classification, image or text retrieval, and can be fine-tuned for various natural language understanding (NLU) tasks. The model's performance was assessed on 32 diverse tasks, showing superior results to CLIP while remaining open-source and reproducible.
Architecture
FLAVA utilizes a ViT-B/32 transformer for both image and text encoding, supplemented by a 6-layer multimodal encoder for tasks requiring integrated vision and language processing. The model components are individually accessible from the facebook/flava-full
checkpoint, with specific classes such as FlavaForPreTraining
and FlavaModel
available for different use cases.
Training
FLAVA was pretrained exclusively on publicly accessible datasets, including COCO, Visual Genome, and others, totaling 70 million image-text pairs. The training aimed to create a reproducible model that performs robustly across various domains, using less data than models like CLIP and SimVLM.
Guide: Running Locally
To run FLAVA locally, follow these steps:
-
Environment Setup:
- Install the
transformers
library from Hugging Face. - Ensure the availability of PIL and requests libraries for image processing.
- Install the
-
Load Model and Processor:
from transformers import FlavaModel, FlavaProcessor model = FlavaModel.from_pretrained("facebook/flava-full") processor = FlavaProcessor.from_pretrained("facebook/flava-full")
-
Process Inputs:
from PIL import Image import requests url = "http://images.cocodataset.org/val2017/000000039769.jpg" image = Image.open(requests.get(url, stream=True).raw) inputs = processor(text=["a photo of a cat"], images=[image], return_tensors="pt")
-
Inference:
outputs = model(**inputs) image_embeddings = outputs.image_embeddings text_embeddings = outputs.text_embeddings
-
Cloud GPUs:
- For improved performance and efficiency, consider using cloud GPU services such as AWS, GCP, or Azure. These platforms offer the necessary computational power to handle large models like FLAVA.
License
The FLAVA model is released under the BSD-3-Clause license, allowing for wide use and modification with minimal restrictions.