Introduction

Pixtral-12B is a model from the Mistral Community designed for image-text-to-text transformations using the Transformers library. It leverages the LlavaForConditionalGeneration class to generate text descriptions from images.

Architecture

The model utilizes the Transformers library, compatible with pixtral checkpoints. It requires version 4.45 or newer for optimal performance. The architecture allows for both image description and conversational applications, integrating text and image inputs to generate coherent textual outputs.

Training

The training details of Pixtral-12B aren't explicitly provided, but it is designed to handle various image-text tasks, suggesting a robust dataset and methodology were used to train the model for general-purpose image captioning and conversational tasks.

Guide: Running Locally

  1. Environment Setup: Ensure you have Python and the Transformers library installed.
  2. Install from Source: Use the latest code from the Transformers library if version 4.45 is not yet available.
  3. Load the Model:
    from transformers import AutoProcessor, LlavaForConditionalGeneration
    model_id = "mistral-community/pixtral-12b"
    model = LlavaForConditionalGeneration.from_pretrained(model_id)
    processor = AutoProcessor.from_pretrained(model_id)
    
  4. Prepare Inputs: Define image URLs and prompts as shown in the examples.
  5. Generate Output: Run the model and processor to generate and decode outputs.
  6. Hardware Suggestions: For large models like Pixtral-12B, consider using cloud GPUs such as those from AWS, Google Cloud, or Azure to ensure efficient processing.

License

Pixtral-12B is released under the Apache 2.0 License, which allows for permissive use, distribution, and modification of the software.

More Related APIs in Image Text To Text