Doge 20 M
JingzeShiDOGE-20M
Introduction
Doge is a research project focused on developing small language models using the Transformer framework. The aim is to create models with fewer cache states and larger knowledge capacity. Doge employs Dynamic Mask Attention for sequence transformation and utilizes either a Multi-Layer Perceptron or Cross Domain Mixture of Experts for state transformation. This model is designed for text input and generation only.
Architecture
- Dynamic Mask Attention: Utilizes self-attention during training and state space during inference.
- Cross Domain Mixture of Experts: Capable of inheriting weights from the Multi-Layer Perceptron for enhanced training.
Training
- Data: Trained on the HuggingFaceTB/smollm-corpus.
- Training Steps: 8,000
- Content Length: 4 billion tokens
- Learning Rate: 8e-3
- Batch Size: 0.5 million
- Precision: bfloat16
Evaluation Metrics
- MMLU: 25.43
- TriviaQA: 0
- ARC-E: 36.83
- ARC-C: 22.53
- PIQA: 58.38
- HellaSwag: 27.25
- OBQA: 25.60
- Winogrande: 50.20
Guide: Running Locally
-
Install Dependencies: Ensure you have the
transformers
library installed.pip install transformers
-
Load Model and Tokenizer:
from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("JingzeShi/Doge-20M") model = AutoModelForCausalLM.from_pretrained("JingzeShi/Doge-20M", trust_remote_code=True)
-
Generate Text:
inputs = tokenizer("Hey how are you doing?", return_tensors="pt") out = model.generate(**inputs, max_new_tokens=100) print(tokenizer.batch_decode(out))
For enhanced performance, consider using cloud GPUs such as those offered by AWS, Google Cloud, or Azure.
License
This project is licensed under the Apache-2.0 License.