Pixtral 12 B 2409
mistralaiIntroduction
Pixtral-12B-2409 is a multimodal model developed by Mistral AI with 12 billion parameters and a 400 million parameter vision encoder. It is designed to handle both text and image data, supporting various image sizes and excelling in multimodal tasks. The model maintains state-of-the-art performance on text-only benchmarks.
Architecture
Pixtral-12B-2409 is built with a 12B parameter multimodal decoder and a 400M parameter vision encoder, enabling it to process interleaved image and text data efficiently. It supports a sequence length of up to 128k tokens.
Training
The model is natively multimodal, having been trained with interleaved image and text data. It achieves high performance benchmarks in its weight class for both multimodal and text-only tasks, such as VQAv2 and Human Eval.
Guide: Running Locally
Basic Steps
-
Install the Necessary Libraries:
- Ensure you have
vLLM
andmistral_common
installed and updated:pip install --upgrade vllm mistral_common
- Alternatively, use the ready-to-go Docker image for deployment.
- Ensure you have
-
Simple Example:
- Use the following code to run a simple inference:
from vllm import LLM from vllm.sampling_params import SamplingParams model_name = "mistralai/Pixtral-12B-2409" sampling_params = SamplingParams(max_tokens=8192) llm = LLM(model=model_name, tokenizer_mode="mistral") prompt = "Describe this image in one sentence." image_url = "https://picsum.photos/id/237/200/300" messages = [ { "role": "user", "content": [{"type": "text", "text": prompt}, {"type": "image_url", "image_url": {"url": image_url}}] }, ] outputs = llm.chat(messages, sampling_params=sampling_params) print(outputs[0].outputs[0].text)
- Use the following code to run a simple inference:
-
Advanced Example:
- Use multiple images and multi-turn conversations for more complex interactions. Refer to the provided usage examples for guidance.
-
Server Setup:
- Spin up a server with:
vllm serve mistralai/Pixtral-12B-2409 --tokenizer_mode mistral --limit_mm_per_prompt 'image=4'
- Ping the server using a client with
curl
.
- Spin up a server with:
Suggest Cloud GPUs
For enhanced performance, consider using cloud GPUs from providers like AWS, Google Cloud, or Azure.
License
Pixtral-12B-2409 is released under the Apache 2.0 license, allowing for both personal and commercial use, subject to the terms of the license.