Introduction

The LLaMA-68M model is a lightweight, LLaMA-like model with 68 million parameters. It is primarily trained on Wikipedia and selected parts of the C4-en and C4-realnewslike datasets. This model is intended as a base Small Speculative Model for the SpecInfer paper, focusing on speculative inference and token tree verification.

Architecture

LLaMA-68M is built using the transformers library and PyTorch framework. It supports text generation tasks and is compatible with inference endpoints. Despite its small size, it is designed to serve as a foundational model for speculative inference research.

Training

The model has been trained on English datasets, specifically Wikipedia and portions of the C4-en and C4-realnewslike datasets. However, formal evaluations of the model's performance have not been conducted, so users are advised to apply it cautiously.

Guide: Running Locally

To run LLaMA-68M locally:

  1. Install Dependencies: Ensure you have Python and PyTorch installed. Install the transformers library via pip:

    pip install transformers
    
  2. Download the Model: Obtain the model files from the Hugging Face repository.

  3. Load the Model: Use the transformers library to load and run the model:

    from transformers import AutoModelForCausalLM, AutoTokenizer
    
    tokenizer = AutoTokenizer.from_pretrained("JackFram/llama-68m")
    model = AutoModelForCausalLM.from_pretrained("JackFram/llama-68m")
    
  4. Inference: Input some text and generate output using the model:

    inputs = tokenizer("Your input text here", return_tensors="pt")
    outputs = model.generate(**inputs)
    print(tokenizer.decode(outputs[0]))
    

For enhanced performance, consider using a cloud GPU service like AWS EC2, Google Cloud, or Azure.

License

The model is licensed under the Apache 2.0 license, allowing for both personal and commercial use with compliance to its terms.

More Related APIs in Text Generation