FLAVA Model Documentation

Introduction

The FLAVA model is designed by FAIR researchers to evaluate if a single model can effectively handle multiple modalities using a unified architecture. Pretrained on publicly available datasets with 70 million image-text pairs, FLAVA is notable for its ability to perform zero-shot image classification, image or text retrieval, and can be fine-tuned for various natural language understanding (NLU) tasks. The model's performance was assessed on 32 diverse tasks, showing superior results to CLIP while remaining open-source and reproducible.

Architecture

FLAVA utilizes a ViT-B/32 transformer for both image and text encoding, supplemented by a 6-layer multimodal encoder for tasks requiring integrated vision and language processing. The model components are individually accessible from the facebook/flava-full checkpoint, with specific classes such as FlavaForPreTraining and FlavaModel available for different use cases.

Training

FLAVA was pretrained exclusively on publicly accessible datasets, including COCO, Visual Genome, and others, totaling 70 million image-text pairs. The training aimed to create a reproducible model that performs robustly across various domains, using less data than models like CLIP and SimVLM.

Guide: Running Locally

To run FLAVA locally, follow these steps:

Environment Setup:
- Install the transformers library from Hugging Face.
- Ensure the availability of PIL and requests libraries for image processing.

Load Model and Processor:

from transformers import FlavaModel, FlavaProcessor
model = FlavaModel.from_pretrained("facebook/flava-full")
processor = FlavaProcessor.from_pretrained("facebook/flava-full")

Process Inputs:

from PIL import Image
import requests
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(text=["a photo of a cat"], images=[image], return_tensors="pt")

Inference:

outputs = model(**inputs)
image_embeddings = outputs.image_embeddings
text_embeddings = outputs.text_embeddings

Cloud GPUs:
- For improved performance and efficiency, consider using cloud GPU services such as AWS, GCP, or Azure. These platforms offer the necessary computational power to handle large models like FLAVA.

License

The FLAVA model is released under the BSD-3-Clause license, allowing for wide use and modification with minimal restrictions.