Molmo 7 B D 0924
allenaiIntroduction
Molmo 7B-D is a vision-language model developed by the Allen Institute for AI, part of the Molmo family. It is based on the Qwen2-7B model and utilizes OpenAI's CLIP as its vision backbone. Designed for multimodal tasks, it is trained on the PixMo dataset, which consists of 1 million curated image-text pairs. The model boasts state-of-the-art performance among similar-sized multimodal models and is fully open-source.
Architecture
Molmo 7B-D integrates Qwen2-7B as its core and OpenAI CLIP for vision processing. It excels in both academic benchmarks and human evaluations, positioned between GPT-4V and GPT-4o. The model is part of a broader series of models that leverage the PixMo dataset for superior performance in multimodal tasks.
Training
The model was trained using the PixMo dataset, emphasizing open-source AI development and reproducibility. While the specific training artifacts are not yet released, the Molmo family is committed to making these available in the future.
Guide: Running Locally
- Install Dependencies:
pip install einops torchvision
- Load Processor and Model:
from transformers import AutoModelForCausalLM, AutoProcessor processor = AutoProcessor.from_pretrained('allenai/Molmo-7B-D-0924', trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained('allenai/Molmo-7B-D-0924', trust_remote_code=True)
- Process and Generate Text from Images:
from PIL import Image import requests inputs = processor.process( images=[Image.open(requests.get("https://picsum.photos/id/237/536/354", stream=True).raw)], text="Describe this image." ) output = model.generate_from_batch(inputs, GenerationConfig(max_new_tokens=200, stop_strings="<|endoftext|>")) generated_text = processor.tokenizer.decode(output[0], skip_special_tokens=True) print(generated_text)
- Optimize Inference: Use
torch.autocast
for more efficient inference and to reduce memory usage:with torch.autocast(device_type="cuda", enabled=True, dtype=torch.bfloat16): output = model.generate_from_batch(...)
Cloud GPUs
For optimal performance, consider using cloud GPUs such as those offered by AWS, Google Cloud, or Azure.
License
The Molmo 7B-D model is licensed under Apache 2.0. It is intended for research and educational purposes. For more details, refer to the Responsible Use Guidelines provided by AllenAI.