bitnet_b1_58 large
1bitLLMIntroduction
This document provides an overview of the BitNet b1.58 model, a reproduction from the BitNet paper. BitNet models are open-source and trained on the RedPajama dataset for 100 billion tokens. The training adheres to the guidelines outlined in Microsoft's paper, focusing on techniques like two-stage learning rate (LR) and weight decay.
Architecture
BitNet b1.58 utilizes a transformer architecture optimized for text generation tasks. It leverages advancements in training efficiency, such as 1-bit optimizations for large language models, making it suitable for large-scale text generation and inference tasks.
Training
The models are trained on the RedPajama dataset for 100 billion tokens using specified hyperparameters and techniques, including two-stage learning rate adjustments and weight decay. Performance is measured via Perplexity (PPL) and zero-shot accuracy across various tasks. The reported and reproduced results show slight variances due to data processing and other factors.
Guide: Running Locally
To evaluate BitNet b1.58 locally, follow these steps:
-
Install Dependencies:
pip install lm-eval==0.3.0
-
Run Evaluation:
- Perplexity Evaluation:
python eval_ppl.py --hf_path 1bitLLM/bitnet_b1_58-3B --seqlen 2048
- Task Evaluation:
python eval_task.py --hf_path 1bitLLM/bitnet_b1_58-3B \ --batch_size 1 \ --tasks \ --output_path result.json \ --num_fewshot 0 \ --ctx_size 2048
- Perplexity Evaluation:
Cloud GPUs
For improved performance, especially for larger models, consider using cloud GPU services such as AWS EC2, Google Cloud Platform, or Azure.
License
The BitNet b1.58 model is released under the MIT License, allowing free use, modification, and distribution of the software.