Molmo 7 B D 0924

allenai

Introduction

Molmo 7B-D is a vision-language model developed by the Allen Institute for AI, part of the Molmo family. It is based on the Qwen2-7B model and utilizes OpenAI's CLIP as its vision backbone. Designed for multimodal tasks, it is trained on the PixMo dataset, which consists of 1 million curated image-text pairs. The model boasts state-of-the-art performance among similar-sized multimodal models and is fully open-source.

Architecture

Molmo 7B-D integrates Qwen2-7B as its core and OpenAI CLIP for vision processing. It excels in both academic benchmarks and human evaluations, positioned between GPT-4V and GPT-4o. The model is part of a broader series of models that leverage the PixMo dataset for superior performance in multimodal tasks.

Training

The model was trained using the PixMo dataset, emphasizing open-source AI development and reproducibility. While the specific training artifacts are not yet released, the Molmo family is committed to making these available in the future.

Guide: Running Locally

  1. Install Dependencies:
    pip install einops torchvision
    
  2. Load Processor and Model:
    from transformers import AutoModelForCausalLM, AutoProcessor
    processor = AutoProcessor.from_pretrained('allenai/Molmo-7B-D-0924', trust_remote_code=True)
    model = AutoModelForCausalLM.from_pretrained('allenai/Molmo-7B-D-0924', trust_remote_code=True)
    
  3. Process and Generate Text from Images:
    from PIL import Image
    import requests
    inputs = processor.process(
        images=[Image.open(requests.get("https://picsum.photos/id/237/536/354", stream=True).raw)],
        text="Describe this image."
    )
    output = model.generate_from_batch(inputs, GenerationConfig(max_new_tokens=200, stop_strings="<|endoftext|>"))
    generated_text = processor.tokenizer.decode(output[0], skip_special_tokens=True)
    print(generated_text)
    
  4. Optimize Inference: Use torch.autocast for more efficient inference and to reduce memory usage:
    with torch.autocast(device_type="cuda", enabled=True, dtype=torch.bfloat16):
        output = model.generate_from_batch(...)
    

Cloud GPUs

For optimal performance, consider using cloud GPUs such as those offered by AWS, Google Cloud, or Azure.

License

The Molmo 7B-D model is licensed under Apache 2.0. It is intended for research and educational purposes. For more details, refer to the Responsible Use Guidelines provided by AllenAI.

More Related APIs in Image Text To Text