blip2 opt 2.7b
SalesforceIntroduction
BLIP-2-OPT-2.7B is a pre-trained model developed by Salesforce and hosted on Hugging Face. It is designed for image-to-text tasks, including image captioning and visual question answering. The model uses the OPT-2.7B language model, which consists of 2.7 billion parameters.
Architecture
BLIP-2 integrates three main components:
- A CLIP-like image encoder
- A Querying Transformer (Q-Former), which is a BERT-like Transformer encoder
- A large language model (OPT-2.7B)
The image encoder and large language model are initialized from pre-trained checkpoints and remain frozen during training. The Q-Former maps "query tokens" to embeddings that bridge the image encoder and language model, facilitating text prediction tasks.
Training
The BLIP-2 model is fine-tuned using image-text datasets, such as LAION, sourced from the internet. The training process focuses on optimizing the Querying Transformer while keeping the other components frozen to maintain pre-trained knowledge.
Guide: Running Locally
To run BLIP-2-OPT-2.7B locally, follow these steps:
-
Install Dependencies: Ensure you have
transformers
,torch
,PIL
, andrequests
installed. You may also needaccelerate
andbitsandbytes
for different precision modes.pip install transformers torch pillow requests pip install accelerate bitsandbytes # for 8-bit precision
-
Load the Model and Processor:
from transformers import Blip2Processor, Blip2ForConditionalGeneration processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b") model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-opt-2.7b")
-
Run Inference:
- CPU Example:
import requests from PIL import Image img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB') question = "how many dogs are in the picture?" inputs = processor(raw_image, question, return_tensors="pt") out = model.generate(**inputs) print(processor.decode(out[0], skip_special_tokens=True).strip())
- GPU Example with Half Precision:
import torch model = Blip2ForConditionalGeneration.from_pretrained( "Salesforce/blip2-opt-2.7b", torch_dtype=torch.float16, device_map="auto" ) inputs = processor(raw_image, question, return_tensors="pt").to("cuda", torch.float16) out = model.generate(**inputs) print(processor.decode(out[0], skip_special_tokens=True).strip())
- CPU Example:
-
Cloud GPUs: For better performance, consider using cloud GPU services such as AWS, GCP, or Azure.
License
BLIP-2-OPT-2.7B is released under the MIT License, allowing for wide usage and modification with proper attribution.