L La M A 2 7 B 32 K

togethercomputer

Introduction

LLaMA-2-7B-32K is an open-source, long-context language model developed by Together, derived from Meta's Llama-2 7B model. It supports context lengths up to 32K, making it suitable for tasks like multi-document QA and long text summarization.

Architecture

The model builds upon the architecture of Llama-2-7B, incorporating FlashAttention-2 and other optimizations for enhanced speed and efficiency in inference and training.

Training

LLaMA-2-7B-32K underwent a comprehensive training process involving a mix of pre-training and instruction tuning data:

  • Pre-training: Data includes 25% RedPajama Book, 25% RedPajama ArXiv, 25% other RedPajama data, and 25% UL2 Oscar Data. Data shorter than 2K words was excluded to improve long-context capabilities.

  • Fine-tuning: Focuses on few-shot capacity under long context, using 20% Natural Instructions, 20% Public Pool of Prompts, 20% the Pile, and 40% RedPajama data. This phase leverages in-context examples by packing them into 32K-token sequences.

Fine-tuning Examples

  1. Long Context QA: Involves multi-document question answering using Wikipedia passages.

    bash training/finetune_llama-2-7b-32k-mqa.sh
    
  2. Summarization: Uses BookSum for long-form narrative summarization.

    bash training/finetune_llama-2-7b-32k-booksum.sh
    

Guide: Running Locally

To run LLaMA-2-7B-32K locally:

  1. Install necessary packages:

    export CUDA_HOME=/usr/local/cuda-11.8
    pip install transformers==4.31.0
    pip install sentencepiece
    pip install ninja
    pip install flash-attn --no-build-isolation
    pip install git+https://github.com/HazyResearch/flash-attention.git#subdirectory=csrc/rotary
    
  2. Load and use the model:

    from transformers import AutoTokenizer, AutoModelForCausalLM
    
    tokenizer = AutoTokenizer.from_pretrained("togethercomputer/LLaMA-2-7B-32K")
    model = AutoModelForCausalLM.from_pretrained("togethercomputer/LLaMA-2-7B-32K", trust_remote_code=True, torch_dtype=torch.float16)
    
    input_context = "Your text here"
    input_ids = tokenizer.encode(input_context, return_tensors="pt")
    output = model.generate(input_ids, max_length=128, temperature=0.7)
    output_text = tokenizer.decode(output[0], skip_special_tokens=True)
    print(output_text)
    

Consider using cloud GPUs for optimal performance.

License

LLaMA-2-7B-32K is licensed under the Llama2 license.

More Related APIs in Text Generation