llama joycaption alpha two hf llava

fancyfeast

Introduction

JoyCaption is an open and uncensored image captioning Visual Language Model (VLM) designed to assist in training Diffusion models. It aims to provide a free alternative to existing captioning tools like ChatGPT, offering broader coverage across various content styles and categories.

Architecture

The model is built upon:

  • meta-llama/Llama-3.1-8B-Instruct
  • google/siglip-so400m-patch14-384

Training

JoyCaption is trained on diverse datasets to ensure broad understanding of different image styles and contents, with minimal filtering except for the exclusion of illegal content. This approach is intended to enhance the performance and versatility of diffusion models.

Guide: Running Locally

  1. Environment Setup:

    • Install required libraries using pip install torch transformers pillow.
  2. Load Model:

    import torch
    from PIL import Image
    from transformers import AutoProcessor, LlavaForConditionalGeneration
    
    MODEL_NAME = "fancyfeast/llama-joycaption-alpha-two-hf-llava"
    processor = AutoProcessor.from_pretrained(MODEL_NAME)
    llava_model = LlavaForConditionalGeneration.from_pretrained(MODEL_NAME, torch_dtype="bfloat16", device_map=0)
    llava_model.eval()
    
  3. Process Image:

    • Load and process an image, then generate captions using the loaded model.
  4. Inference:

    • Use vLLM for optimized performance with commands such as:
      vllm serve fancyfeast/llama-joycaption-alpha-two-hf-llava --max-model-len 4096 --enable-prefix-caching
      
    • Adjust settings based on your environment, keeping in mind that vLLM can be memory-intensive.

Suggested Cloud GPUs:

  • Consider using cloud services like AWS, Azure, or Google Cloud for GPU resources if local hardware is insufficient.

License

JoyCaption is released as a free and open model, allowing unrestricted use and modification within legal boundaries.

More Related APIs