camembert large

almanach

Introduction

CamemBERT is a state-of-the-art language model for French, based on the RoBERTa architecture. It is available in six versions, differing in the number of parameters, pretraining data volume, and data source domains.

Architecture

CamemBERT is built on the RoBERTa model, leveraging different datasets such as OSCAR and CCNet for pretraining. The models vary in size, from a base version with 110 million parameters to a large version with 335 million parameters.

Training

CamemBERT models are pretrained using large-scale datasets in the French language. Different versions of the model utilize datasets like OSCAR and CCNet, with data volumes ranging from 4GB to 138GB.

Guide: Running Locally

  1. Load the Model and Tokenizer:

    from transformers import CamembertModel, CamembertTokenizer
    tokenizer = CamembertTokenizer.from_pretrained("camembert/camembert-large")
    camembert = CamembertModel.from_pretrained("camembert/camembert-large")
    camembert.eval()
    
  2. Use the Model for Mask Filling:

    from transformers import pipeline 
    camembert_fill_mask = pipeline("fill-mask", model="camembert/camembert-large", tokenizer="camembert/camembert-large")
    results = camembert_fill_mask("Le camembert est <mask> :)")
    
  3. Extract Contextual Embedding Features:

    import torch
    tokenized_sentence = tokenizer.tokenize("J'aime le camembert !")
    encoded_sentence = tokenizer.encode(tokenized_sentence)
    encoded_sentence = torch.tensor(encoded_sentence).unsqueeze(0)
    embeddings, _ = camembert(encoded_sentence)
    
  4. Extract Embeddings from All Layers:

    from transformers import CamembertConfig
    config = CamembertConfig.from_pretrained("camembert/camembert-large", output_hidden_states=True)
    camembert = CamembertModel.from_pretrained("camembert/camembert-large", config=config)
    embeddings, _, all_layer_embeddings = camembert(encoded_sentence)
    
  5. Cloud GPU Suggestion: For efficient processing, especially with larger models, consider using cloud-based GPUs from providers like AWS, Google Cloud, or Azure.

License

The model and its usage are subject to licensing terms as provided by the authors and Hugging Face. Always refer to the model's documentation for specific licensing details.

More Related APIs