L La M A 2 7 B 32 K
togethercomputerIntroduction
LLaMA-2-7B-32K is an open-source, long-context language model developed by Together, derived from Meta's Llama-2 7B model. It supports context lengths up to 32K, making it suitable for tasks like multi-document QA and long text summarization.
Architecture
The model builds upon the architecture of Llama-2-7B, incorporating FlashAttention-2 and other optimizations for enhanced speed and efficiency in inference and training.
Training
LLaMA-2-7B-32K underwent a comprehensive training process involving a mix of pre-training and instruction tuning data:
-
Pre-training: Data includes 25% RedPajama Book, 25% RedPajama ArXiv, 25% other RedPajama data, and 25% UL2 Oscar Data. Data shorter than 2K words was excluded to improve long-context capabilities.
-
Fine-tuning: Focuses on few-shot capacity under long context, using 20% Natural Instructions, 20% Public Pool of Prompts, 20% the Pile, and 40% RedPajama data. This phase leverages in-context examples by packing them into 32K-token sequences.
Fine-tuning Examples
-
Long Context QA: Involves multi-document question answering using Wikipedia passages.
bash training/finetune_llama-2-7b-32k-mqa.sh
-
Summarization: Uses BookSum for long-form narrative summarization.
bash training/finetune_llama-2-7b-32k-booksum.sh
Guide: Running Locally
To run LLaMA-2-7B-32K locally:
-
Install necessary packages:
export CUDA_HOME=/usr/local/cuda-11.8 pip install transformers==4.31.0 pip install sentencepiece pip install ninja pip install flash-attn --no-build-isolation pip install git+https://github.com/HazyResearch/flash-attention.git#subdirectory=csrc/rotary
-
Load and use the model:
from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("togethercomputer/LLaMA-2-7B-32K") model = AutoModelForCausalLM.from_pretrained("togethercomputer/LLaMA-2-7B-32K", trust_remote_code=True, torch_dtype=torch.float16) input_context = "Your text here" input_ids = tokenizer.encode(input_context, return_tensors="pt") output = model.generate(input_ids, max_length=128, temperature=0.7) output_text = tokenizer.decode(output[0], skip_special_tokens=True) print(output_text)
Consider using cloud GPUs for optimal performance.
License
LLaMA-2-7B-32K is licensed under the Llama2 license.