Molmo 72 B 0924

allenai

Introduction

Molmo is a family of open vision-language models developed by the Allen Institute for AI. The Molmo models are trained on PixMo, a dataset comprising 1 million curated image-text pairs. Molmo-72B is the highlight of this family, achieving top academic benchmark scores and excelling in human evaluation. It uses Qwen2-72B as its foundation and OpenAI CLIP as its vision backbone.

Architecture

Molmo-72B is based on Qwen2-72B and incorporates OpenAI's CLIP as the vision backbone. It is designed to excel in multimodal tasks, supported by the PixMo dataset and open-source artifacts, including training codes and checkpoints.

Training

The model's training utilizes the PixMo dataset, consisting of 1 million curated image-text pairs. It achieves high performance on academic benchmarks and ranks favorably in human evaluations. The training artifacts, including code and intermediate checkpoints, will be made available for open-source development and reproducibility.

Guide: Running Locally

To run Molmo locally, follow these steps:

  1. Install Dependencies:

    pip install einops torchvision
    
  2. Load the Model and Processor:

    from transformers import AutoModelForCausalLM, AutoProcessor
    import torch
    
    processor = AutoProcessor.from_pretrained('allenai/Molmo-72B-0924', trust_remote_code=True)
    model = AutoModelForCausalLM.from_pretrained('allenai/Molmo-72B-0924', trust_remote_code=True)
    
  3. Process and Generate Text from an Image:

    from PIL import Image
    import requests
    
    inputs = processor.process(images=[Image.open(requests.get("https://picsum.photos/id/237/536/354", stream=True).raw)], text="Describe this image.")
    inputs = {k: v.to(model.device).unsqueeze(0) for k, v in inputs.items()}
    
    output = model.generate_from_batch(inputs, GenerationConfig(max_new_tokens=200, stop_strings="<|endoftext|>"), tokenizer=processor.tokenizer)
    generated_tokens = output[0, inputs['input_ids'].size(1):]
    generated_text = processor.tokenizer.decode(generated_tokens, skip_special_tokens=True)
    print(generated_text)
    
  4. Optimize with Autocast:

    with torch.autocast(device_type="cuda", enabled=True, dtype=torch.bfloat16):
        output = model.generate_from_batch(inputs, GenerationConfig(max_new_tokens=200, stop_strings="<|endoftext|>"), tokenizer=processor.tokenizer)
    

For enhanced performance, consider using cloud GPUs such as those provided by AWS or Google Cloud.

License

This model is licensed under the Apache 2.0 license, intended for research and educational use. For more details, refer to the Responsible Use Guidelines by the Allen Institute for AI. The base model Qwen2-72B uses the Tongyi Qianwen license.

More Related APIs in Image Text To Text