Meta Llama 3.1 8 B Instruct bnb 4bit

unsloth

Meta-Llama-3.1-8B-Instruct-bnb-4bit

Introduction

Meta-Llama-3.1-8B-Instruct is a multilingual large language model (LLM) developed by Meta. It is designed for text generation tasks, optimized for multilingual dialogues, and supports languages such as English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai. The model aims to provide useful outputs aligned with human preferences for helpfulness and safety, using techniques like supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF).

Architecture

Llama 3.1 is an auto-regressive language model based on a transformer architecture. It includes models of varying sizes, including 8B, 70B, and 405B parameters, all using Grouped-Query Attention (GQA) for improved scalability during inference. The model is designed for instruction-tuned text-only use cases, making it suitable for assistant-like chat and other natural language generation tasks.

Training

The model was pretrained on approximately 15 trillion tokens from publicly available sources. Fine-tuning involved over 25 million synthetically generated examples. Training was conducted using Meta's custom GPU infrastructure, utilizing 39.3 million GPU hours on H100-80GB hardware. Meta has maintained net-zero greenhouse gas emissions, achieving zero tons of CO2eq for market-based emissions during training.

Guide: Running Locally

To run Meta-Llama-3.1-8B-Instruct locally:

  1. Install Transformers: Ensure you have transformers >= 4.43.0 by running pip install --upgrade transformers.
  2. Set Up the Model:
    import transformers
    
    model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"
    pipeline = transformers.pipeline(
        "text-generation",
        model=model_id,
        model_kwargs={"torch_dtype": torch.bfloat16},
        device_map="auto",
    )
    
  3. Run Inference:
    messages = [
        {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
        {"role": "user", "content": "Who are you?"},
    ]
    outputs = pipeline(messages, max_new_tokens=256)
    print(outputs[0]["generated_text"][-1])
    
  4. Use Cloud GPUs: For better performance, consider using cloud-based GPUs like those from Google Colab or AWS. A Google Colab notebook with setup instructions is available here.

License

Meta-Llama-3.1-8B-Instruct is released under the Llama 3.1 Community License. The license details can be found here. This license allows for both commercial and research use, provided the guidelines and restrictions outlined in the license are followed.

More Related APIs in Text Generation