camembert base
almanachIntroduction
CamemBERT is a state-of-the-art language model for French, based on the RoBERTa architecture. It is available in multiple versions with varying numbers of parameters and pretraining data sources. It is developed by the ALMANACH team (INRIA) and more details can be found on the Camembert Website.
Architecture
CamemBERT is inspired by the RoBERTa model architecture. It offers various configurations, including models with 110M to 335M parameters, trained on datasets like OSCAR, CCNet, and Wikipedia. Each model version is designed to cater to different data size and domain requirements.
Training
The models were trained using large French text corpora such as OSCAR (138 GB) and CCNet (135 GB), with some variations utilizing subsets of these datasets or other sources like Wikipedia (4 GB). The training process ensures a comprehensive understanding of the French language.
Guide: Running Locally
-
Install Transformers Package:
pip install transformers
-
Load the Model and Tokenizer:
from transformers import CamembertModel, CamembertTokenizer tokenizer = CamembertTokenizer.from_pretrained("camembert/camembert-base-wikipedia-4gb") camembert = CamembertModel.from_pretrained("camembert/camembert-base-wikipedia-4gb") camembert.eval()
-
Perform Inference:
Use the fill-mask pipeline to predict masked tokens.from transformers import pipeline camembert_fill_mask = pipeline("fill-mask", model="camembert/camembert-base-wikipedia-4gb", tokenizer="camembert/camembert-base-wikipedia-4gb") results = camembert_fill_mask("Le camembert est un fromage de <mask>!")
-
Extract Embeddings:
import torch tokenized_sentence = tokenizer.tokenize("J'aime le camembert !") encoded_sentence = tokenizer.encode(tokenized_sentence) encoded_sentence = torch.tensor(encoded_sentence).unsqueeze(0) embeddings, _ = camembert(encoded_sentence)
-
Cloud GPU Suggestion:
To efficiently run the model, consider using cloud services like AWS, GCP, or Azure, which provide access to powerful GPUs.
License
CamemBERT is released under the MIT License, allowing for wide usability and distribution.