reformer enwik8
googleIntroduction
The Reformer-Enwik8 model is a language model operating at the character level, trained on the enwik8 dataset, which is based on Wikipedia. This model is part of efforts to measure data compression capabilities, such as those related to the Hutter Prize.
Architecture
The Reformer-Enwik8 model is a PyTorch implementation of the Reformer language model, specifically designed for character-level text generation. It was pretrained on the first 90 million characters of the enwik8 dataset, with text chunked into batches of 65,536 characters. The model's weights were sourced from a Trax-ML storage and converted for use with Hugging Face's PyTorch framework.
Training
The model does not require a tokenizer due to its character-level operation. Data encoding involves converting characters to byte format and adjusting their IDs with an offset, while decoding reconstructs the original text from these IDs. The model, when generating text, is not yet fully optimized and may exhibit slower performance.
Guide: Running Locally
-
Environment Setup: Ensure you have Python and PyTorch installed. Install the
transformers
library from Hugging Face.pip install transformers torch
-
Model Loading: Use the following code to load and use the model:
from transformers import ReformerModelWithLMHead import torch # Encoding and Decoding functions here (see above) model = ReformerModelWithLMHead.from_pretrained("google/reformer-enwik8") encoded, attention_masks = encode(["Your input text here"]) result = decode(model.generate(encoded, do_sample=True, max_length=150)) print(result)
-
Hardware Recommendations: For optimal performance, especially for generating longer text sequences, consider using a cloud GPU from platforms like AWS, Google Cloud, or Azure.
License
The Reformer-Enwik8 model and its associated code are subject to the licensing terms specified by Hugging Face and Google. Users should review these terms to ensure compliance with usage restrictions and conditions.