llama joycaption alpha two hf llava
fancyfeastIntroduction
JoyCaption is an open and uncensored image captioning Visual Language Model (VLM) designed to assist in training Diffusion models. It aims to provide a free alternative to existing captioning tools like ChatGPT, offering broader coverage across various content styles and categories.
Architecture
The model is built upon:
meta-llama/Llama-3.1-8B-Instruct
google/siglip-so400m-patch14-384
Training
JoyCaption is trained on diverse datasets to ensure broad understanding of different image styles and contents, with minimal filtering except for the exclusion of illegal content. This approach is intended to enhance the performance and versatility of diffusion models.
Guide: Running Locally
-
Environment Setup:
- Install required libraries using
pip install torch transformers pillow
.
- Install required libraries using
-
Load Model:
import torch from PIL import Image from transformers import AutoProcessor, LlavaForConditionalGeneration MODEL_NAME = "fancyfeast/llama-joycaption-alpha-two-hf-llava" processor = AutoProcessor.from_pretrained(MODEL_NAME) llava_model = LlavaForConditionalGeneration.from_pretrained(MODEL_NAME, torch_dtype="bfloat16", device_map=0) llava_model.eval()
-
Process Image:
- Load and process an image, then generate captions using the loaded model.
-
Inference:
- Use
vLLM
for optimized performance with commands such as:vllm serve fancyfeast/llama-joycaption-alpha-two-hf-llava --max-model-len 4096 --enable-prefix-caching
- Adjust settings based on your environment, keeping in mind that vLLM can be memory-intensive.
- Use
Suggested Cloud GPUs:
- Consider using cloud services like AWS, Azure, or Google Cloud for GPU resources if local hardware is insufficient.
License
JoyCaption is released as a free and open model, allowing unrestricted use and modification within legal boundaries.