camembert base oscar 4gb
almanachIntroduction
CamemBERT is a state-of-the-art French language model built on the RoBERTa architecture, optimized for various tasks in natural language processing. It is available on Hugging Face in six different versions, each differing in parameters and training data sources.
Architecture
CamemBERT is based on the RoBERTa model and comes in different configurations:
- camembert-base: 110M parameters, trained on OSCAR (138 GB of text).
- camembert-large: 335M parameters, trained on CCNet (135 GB of text).
- Other versions include models trained on subsets of OSCAR and CCNet, as well as Wikipedia.
Training
CamemBERT was pre-trained using massive datasets such as OSCAR and CCNet, which contain extensive collections of French text. Training was conducted using techniques that allow the model to understand and generate French language text effectively.
Guide: Running Locally
- Setup Environment: Ensure Python and PyTorch are installed.
- Install Transformers: Use
pip install transformers
to install Hugging Face'stransformers
library. - Load Model and Tokenizer:
from transformers import CamembertModel, CamembertTokenizer tokenizer = CamembertTokenizer.from_pretrained("camembert/camembert-base-oscar-4gb") camembert = CamembertModel.from_pretrained("camembert/camembert-base-oscar-4gb") camembert.eval() # Set to evaluation mode
- Run Inference: Use the model for tasks such as filling in masked words or extracting embeddings.
- Cloud GPUs: For intensive tasks, consider using cloud GPU services like AWS EC2, Google Cloud, or Azure for faster computation.
License
CamemBERT is released under a license that allows for research and educational use. Users are encouraged to review the specific terms and conditions applicable to the model's use.