V L M_ Web Sight_finetuned
HuggingFaceM4Introduction
The VLM_WebSight_finetuned model by Hugging Face is designed to convert screenshots of website components into HTML/CSS code. It represents an early checkpoint in the development of a vision-language foundation model, fine-tuned with the Websight dataset. This model is currently in its alpha version, aiming to enhance the conversion process from website screenshots to code.
Architecture
This multi-modal model builds on two pre-trained models: SigLIP and Mistral-7B-v0.1. It integrates these models with newly initialized parameters, trained specifically for this task. The language processing capabilities are focused on English.
Training
The model has been fine-tuned using the Websight dataset, a collection of data designed to train models on converting website screenshots into code. The training process involves learning to generate HTML/CSS from visual inputs, utilizing the strengths of both SigLIP and Mistral-7B-v0.1 as base models.
Guide: Running Locally
To run the model locally, follow these steps:
-
Set up your environment:
- Ensure you have Python and PyTorch installed.
- Install the Transformers library from Hugging Face.
-
Load the model and processor:
import torch from transformers import AutoModelForCausalLM, AutoProcessor DEVICE = torch.device("cuda") PROCESSOR = AutoProcessor.from_pretrained("HuggingFaceM4/VLM_WebSight_finetuned") MODEL = AutoModelForCausalLM.from_pretrained("HuggingFaceM4/VLM_WebSight_finetuned").to(DEVICE)
-
Prepare your inputs:
- Convert your image to a suitable format.
- Use the processor to tokenize and transform the image data.
-
Generate code from the image:
inputs = PROCESSOR.tokenizer("<BOS_TOKEN><image>", return_tensors="pt") inputs["pixel_values"] = PROCESSOR.image_processor([image]) inputs = {k: v.to(DEVICE) for k, v in inputs.items()} generated_ids = MODEL.generate(**inputs, max_length=4096) generated_text = PROCESSOR.batch_decode(generated_ids, skip_special_tokens=True)[0] print(generated_text)
For optimal performance, consider using a cloud GPU service such as AWS, GCP, or Azure.
License
The model is distributed under the Apache-2.0 license. This includes the weights trained for the VLM_WebSight_finetuned model, as well as the underlying SigLIP and Mistral-7B-v0.1 models. Compliance with the Apache-2.0 license is required for usage.