camembert base oscar 4gb

almanach

Introduction

CamemBERT is a state-of-the-art French language model built on the RoBERTa architecture, optimized for various tasks in natural language processing. It is available on Hugging Face in six different versions, each differing in parameters and training data sources.

Architecture

CamemBERT is based on the RoBERTa model and comes in different configurations:

  • camembert-base: 110M parameters, trained on OSCAR (138 GB of text).
  • camembert-large: 335M parameters, trained on CCNet (135 GB of text).
  • Other versions include models trained on subsets of OSCAR and CCNet, as well as Wikipedia.

Training

CamemBERT was pre-trained using massive datasets such as OSCAR and CCNet, which contain extensive collections of French text. Training was conducted using techniques that allow the model to understand and generate French language text effectively.

Guide: Running Locally

  1. Setup Environment: Ensure Python and PyTorch are installed.
  2. Install Transformers: Use pip install transformers to install Hugging Face's transformers library.
  3. Load Model and Tokenizer:
    from transformers import CamembertModel, CamembertTokenizer
    tokenizer = CamembertTokenizer.from_pretrained("camembert/camembert-base-oscar-4gb")
    camembert = CamembertModel.from_pretrained("camembert/camembert-base-oscar-4gb")
    camembert.eval()  # Set to evaluation mode
    
  4. Run Inference: Use the model for tasks such as filling in masked words or extracting embeddings.
  5. Cloud GPUs: For intensive tasks, consider using cloud GPU services like AWS EC2, Google Cloud, or Azure for faster computation.

License

CamemBERT is released under a license that allows for research and educational use. Users are encouraged to review the specific terms and conditions applicable to the model's use.

More Related APIs