Llama3.1 8 B Chinese Chat

shenzhi-wang

Llama3.1-8B-Chinese-Chat

Introduction

Llama3.1-8B-Chinese-Chat is an instruction-tuned language model designed for both Chinese and English users. It is developed on the Meta-Llama-3.1-8B-Instruct model and employs the ORPO fine-tuning algorithm. This model is tailored for roles like roleplaying and tool usage.

Architecture

The model is based on the Meta-Llama-3.1-8B-Instruct architecture with a model size of 8.03 billion parameters and a context length of 128,000 tokens. The model supports dual languages, English and Chinese, and has been fine-tuned to improve on various capabilities such as roleplay and mathematical proficiency.

Training

The training framework used for this model is LLaMA-Factory. Key training details include:

  • Epochs: 3
  • Learning Rate: 3e-6
  • Scheduler: Cosine
  • Warmup Ratio: 0.1
  • Context Length: 8192
  • ORPO Beta: 0.05
  • Batch Size: 128
  • Fine-tuning: Full Parameters
  • Optimizer: Paged AdamW 32-bit

Guide: Running Locally

  1. Environment Setup:
    Ensure you have the latest version of the transformers package (version 4.43.0 or later).

  2. Download the Model:
    Use the Python script below to download the BF16 version of the model:

    from huggingface_hub import snapshot_download
    snapshot_download(repo_id="shenzhi-wang/Llama3.1-8B-Chinese-Chat", ignore_patterns=["*.gguf"])
    
  3. Model Inference:

    import torch
    import transformers
    from transformers import AutoModelForCausalLM, AutoTokenizer
    
    model_id = "/Your/Local/Path/to/Llama3.1-8B-Chinese-Chat"
    dtype = torch.bfloat16
    
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        device_map="cuda",
        torch_dtype=dtype,
    )
    
    chat = [{"role": "user", "content": "写一首关于机器学习的诗。"}]
    input_ids = tokenizer.apply_chat_template(
        chat, tokenize=True, add_generation_prompt=True, return_tensors="pt"
    ).to(model.device)
    
    outputs = model.generate(
        input_ids,
        max_new_tokens=8192,
        do_sample=True,
        temperature=0.6,
        top_p=0.9,
    )
    response = outputs[0][input_ids.shape[-1]:]
    print(tokenizer.decode(response, skip_special_tokens=True))
    
  4. Using GGUF Models:

    • Download GGUF models from the specified folder.
    • Use them with LM Studio or follow instructions from llama.cpp.
  5. Cloud GPU Recommendation:
    For optimal performance, consider using cloud-based GPUs such as AWS EC2 instances with NVIDIA GPUs or Google Cloud's AI Platform.

License

This model is licensed under the Llama-3.1 License. For more information, please refer to the license document.

More Related APIs in Text Generation