pixel base
Team-PIXELIntroduction
PIXEL (PIXEL-Based Encoder of Language) is a language model designed to reconstruct masked image patches containing rendered text. Initially pretrained on the English Wikipedia and BookCorpus, it is versatile enough to be fine-tuned on any language that can be rendered as text on a screen. Unlike traditional models, it relies on rendered text rather than a fixed vocabulary tokenizer.
Architecture
PIXEL features three main components:
- Text Renderer: Converts text into image form.
- Encoder: Processes unmasked regions of these images using a Vision Transformer (ViT).
- Decoder: Reconstructs masked image regions at the pixel level, using a lightweight design with 512 hidden units and 8 transformer layers. Post-pretraining, the decoder can be discarded, leaving an 86M parameter encoder for further use.
Training
PIXEL is pretrained by rendering sentences as images, masking 25% of image patches, and using the encoder to process the unmasked portions. The decoder learns to reconstruct pixel values in the masked areas. After pretraining, the encoder can be paired with task-specific classifiers or used as a generative language model.
Guide: Running Locally
To run PIXEL locally:
- Install the required dependencies and clone the PIXEL repository from GitHub:
https://github.com/xplip/pixel
. - Load the model using:
from pixel import PIXELConfig, PIXELForPreTraining config = PIXELConfig.from_pretrained("Team-PIXEL/pixel-base") model = PIXELForPreTraining.from_pretrained("Team-PIXEL/pixel-base", config=config)
- Fine-tune the model on your specific task using the guidelines provided in the repository.
For efficient execution, it is recommended to use cloud GPUs, such as those provided by AWS, Google Cloud, or Azure.
License
PIXEL is licensed under the Apache 2.0 License, allowing for widespread use and modification.