V L M_ Web Sight_finetuned

HuggingFaceM4

Introduction

The VLM_WebSight_finetuned model by Hugging Face is designed to convert screenshots of website components into HTML/CSS code. It represents an early checkpoint in the development of a vision-language foundation model, fine-tuned with the Websight dataset. This model is currently in its alpha version, aiming to enhance the conversion process from website screenshots to code.

Architecture

This multi-modal model builds on two pre-trained models: SigLIP and Mistral-7B-v0.1. It integrates these models with newly initialized parameters, trained specifically for this task. The language processing capabilities are focused on English.

Training

The model has been fine-tuned using the Websight dataset, a collection of data designed to train models on converting website screenshots into code. The training process involves learning to generate HTML/CSS from visual inputs, utilizing the strengths of both SigLIP and Mistral-7B-v0.1 as base models.

Guide: Running Locally

To run the model locally, follow these steps:

  1. Set up your environment:

    • Ensure you have Python and PyTorch installed.
    • Install the Transformers library from Hugging Face.
  2. Load the model and processor:

    import torch
    from transformers import AutoModelForCausalLM, AutoProcessor
    
    DEVICE = torch.device("cuda")
    PROCESSOR = AutoProcessor.from_pretrained("HuggingFaceM4/VLM_WebSight_finetuned")
    MODEL = AutoModelForCausalLM.from_pretrained("HuggingFaceM4/VLM_WebSight_finetuned").to(DEVICE)
    
  3. Prepare your inputs:

    • Convert your image to a suitable format.
    • Use the processor to tokenize and transform the image data.
  4. Generate code from the image:

    inputs = PROCESSOR.tokenizer("<BOS_TOKEN><image>", return_tensors="pt")
    inputs["pixel_values"] = PROCESSOR.image_processor([image])
    inputs = {k: v.to(DEVICE) for k, v in inputs.items()}
    generated_ids = MODEL.generate(**inputs, max_length=4096)
    generated_text = PROCESSOR.batch_decode(generated_ids, skip_special_tokens=True)[0]
    print(generated_text)
    

For optimal performance, consider using a cloud GPU service such as AWS, GCP, or Azure.

License

The model is distributed under the Apache-2.0 license. This includes the weights trained for the VLM_WebSight_finetuned model, as well as the underlying SigLIP and Mistral-7B-v0.1 models. Compliance with the Apache-2.0 license is required for usage.

More Related APIs in Text Generation