Pixtral 12 B 2409

mistralai

Introduction

Pixtral-12B-2409 is a multimodal model developed by Mistral AI with 12 billion parameters and a 400 million parameter vision encoder. It is designed to handle both text and image data, supporting various image sizes and excelling in multimodal tasks. The model maintains state-of-the-art performance on text-only benchmarks.

Architecture

Pixtral-12B-2409 is built with a 12B parameter multimodal decoder and a 400M parameter vision encoder, enabling it to process interleaved image and text data efficiently. It supports a sequence length of up to 128k tokens.

Training

The model is natively multimodal, having been trained with interleaved image and text data. It achieves high performance benchmarks in its weight class for both multimodal and text-only tasks, such as VQAv2 and Human Eval.

Guide: Running Locally

Basic Steps

  1. Install the Necessary Libraries:

    • Ensure you have vLLM and mistral_common installed and updated:
      pip install --upgrade vllm mistral_common
      
    • Alternatively, use the ready-to-go Docker image for deployment.
  2. Simple Example:

    • Use the following code to run a simple inference:
      from vllm import LLM
      from vllm.sampling_params import SamplingParams
      
      model_name = "mistralai/Pixtral-12B-2409"
      sampling_params = SamplingParams(max_tokens=8192)
      
      llm = LLM(model=model_name, tokenizer_mode="mistral")
      
      prompt = "Describe this image in one sentence."
      image_url = "https://picsum.photos/id/237/200/300"
      
      messages = [
          {
              "role": "user",
              "content": [{"type": "text", "text": prompt}, {"type": "image_url", "image_url": {"url": image_url}}]
          },
      ]
      
      outputs = llm.chat(messages, sampling_params=sampling_params)
      print(outputs[0].outputs[0].text)
      
  3. Advanced Example:

    • Use multiple images and multi-turn conversations for more complex interactions. Refer to the provided usage examples for guidance.
  4. Server Setup:

    • Spin up a server with:
      vllm serve mistralai/Pixtral-12B-2409 --tokenizer_mode mistral --limit_mm_per_prompt 'image=4'
      
    • Ping the server using a client with curl.

Suggest Cloud GPUs

For enhanced performance, consider using cloud GPUs from providers like AWS, Google Cloud, or Azure.

License

Pixtral-12B-2409 is released under the Apache 2.0 license, allowing for both personal and commercial use, subject to the terms of the license.

More Related APIs in Image Text To Text