Llama 2 13b chat hf

meta-llama

Introduction

Llama 2 is a series of pretrained and fine-tuned generative text models developed by Meta. These models, ranging from 7 billion to 70 billion parameters, are designed for dialogue and conversational use cases. The 13B fine-tuned version is available in the Hugging Face Transformers format. The Llama 2 models are open for both commercial and research use in English.

Architecture

Llama 2 models use an auto-regressive language model architecture optimized with a transformer design. They employ supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) for alignment with human preferences on helpfulness and safety. The models vary in size, with 7B, 13B, and 70B parameters, and the 70B model uses Grouped-Query Attention (GQA) for enhanced inference scalability.

Training

The Llama 2 models were pretrained on 2 trillion tokens from publicly available sources, with fine-tuning data incorporating over one million new human-annotated examples. Training was conducted on Meta's Research Super Cluster using A100-80GB GPUs, with a total of 539 tCO2eq emissions offset by Meta's sustainability program. The training data has a cutoff of September 2022, with some fine-tuning data extending to July 2023.

Guide: Running Locally

  1. Set Up Environment: Ensure you have Python and PyTorch installed. Install the Hugging Face Transformers library.

    pip install transformers
    
  2. Access the Model: Visit the Llama 2 model page on Hugging Face to request access and accept the license.

  3. Load the Model: Use the following code to load the model.

    from transformers import AutoModelForCausalLM, AutoTokenizer
    
    model_name = "meta-llama/Llama-2-13b-chat-hf"
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(model_name)
    
  4. Run Inference: Use the model to generate text.

    input_text = "Hello, how can I assist you today?"
    inputs = tokenizer(input_text, return_tensors="pt")
    outputs = model.generate(**inputs)
    print(tokenizer.decode(outputs[0], skip_special_tokens=True))
    
  5. Cloud GPUs: For efficient performance, consider using cloud GPUs like AWS EC2 with NVIDIA GPUs, Google Cloud Platform, or Azure.

License

Llama 2 is licensed under the LLAMA 2 Community License Agreement. Users must accept the license terms on the Meta website before accessing the model. Redistribution must include the license, and certain commercial use conditions apply. The license prohibits using Llama 2 to enhance other models and requires compliance with applicable laws and regulations.

More Related APIs in Text Generation