blip2 opt 2.7b

Salesforce

Introduction

BLIP-2-OPT-2.7B is a pre-trained model developed by Salesforce and hosted on Hugging Face. It is designed for image-to-text tasks, including image captioning and visual question answering. The model uses the OPT-2.7B language model, which consists of 2.7 billion parameters.

Architecture

BLIP-2 integrates three main components:

  • A CLIP-like image encoder
  • A Querying Transformer (Q-Former), which is a BERT-like Transformer encoder
  • A large language model (OPT-2.7B)

The image encoder and large language model are initialized from pre-trained checkpoints and remain frozen during training. The Q-Former maps "query tokens" to embeddings that bridge the image encoder and language model, facilitating text prediction tasks.

Training

The BLIP-2 model is fine-tuned using image-text datasets, such as LAION, sourced from the internet. The training process focuses on optimizing the Querying Transformer while keeping the other components frozen to maintain pre-trained knowledge.

Guide: Running Locally

To run BLIP-2-OPT-2.7B locally, follow these steps:

  1. Install Dependencies: Ensure you have transformers, torch, PIL, and requests installed. You may also need accelerate and bitsandbytes for different precision modes.

    pip install transformers torch pillow requests
    pip install accelerate bitsandbytes  # for 8-bit precision
    
  2. Load the Model and Processor:

    from transformers import Blip2Processor, Blip2ForConditionalGeneration
    
    processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
    model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-opt-2.7b")
    
  3. Run Inference:

    • CPU Example:
      import requests
      from PIL import Image
      
      img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg'
      raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')
      
      question = "how many dogs are in the picture?"
      inputs = processor(raw_image, question, return_tensors="pt")
      
      out = model.generate(**inputs)
      print(processor.decode(out[0], skip_special_tokens=True).strip())
      
    • GPU Example with Half Precision:
      import torch
      
      model = Blip2ForConditionalGeneration.from_pretrained(
          "Salesforce/blip2-opt-2.7b", torch_dtype=torch.float16, device_map="auto"
      )
      inputs = processor(raw_image, question, return_tensors="pt").to("cuda", torch.float16)
      
      out = model.generate(**inputs)
      print(processor.decode(out[0], skip_special_tokens=True).strip())
      
  4. Cloud GPUs: For better performance, consider using cloud GPU services such as AWS, GCP, or Azure.

License

BLIP-2-OPT-2.7B is released under the MIT License, allowing for wide usage and modification with proper attribution.

More Related APIs in Image Text To Text