flaubert_large_cased
flaubertIntroduction
FlauBERT is an unsupervised pre-trained language model specifically designed for the French language. Developed using the CNRS Jean Zay supercomputer, it is trained on a vast and diverse French corpus. FlauBERT aims to enhance reproducible experiments in French NLP and is accompanied by FLUE, an evaluation benchmark similar to GLUE, tailored for French NLP systems.
Architecture
FlauBERT models are available in various configurations:
- flaubert-small-cased: 6 layers, 8 attention heads, 512 embedding dimensions, 54 million parameters.
- flaubert-base-uncased: 12 layers, 12 attention heads, 768 embedding dimensions, 137 million parameters.
- flaubert-base-cased: 12 layers, 12 attention heads, 768 embedding dimensions, 138 million parameters.
- flaubert-large-cased: 24 layers, 16 attention heads, 1024 embedding dimensions, 373 million parameters.
The small variant is partially trained and recommended only for debugging.
Training
Models are pre-trained using a large-scale French corpus utilizing the computational resources of the CNRS Jean Zay supercomputer. The training process leverages state-of-the-art techniques in unsupervised learning for language models, similar to the BERT architecture but optimized for French language processing.
Guide: Running Locally
To use FlauBERT with Hugging Face's Transformers library, follow these steps:
-
Install the Transformers Library:
pip install transformers
-
Load the Model and Tokenizer:
import torch from transformers import FlaubertModel, FlaubertTokenizer modelname = 'flaubert/flaubert_base_cased' flaubert, log = FlaubertModel.from_pretrained(modelname, output_loading_info=True) flaubert_tokenizer = FlaubertTokenizer.from_pretrained(modelname, do_lowercase=False)
-
Encode and Process a Sample Sentence:
sentence = "Le chat mange une pomme." token_ids = torch.tensor([flaubert_tokenizer.encode(sentence)]) last_layer = flaubert(token_ids)[0] cls_embedding = last_layer[:, 0, :]
For optimal performance, consider using cloud GPUs like those available on AWS, Google Cloud, or Azure.
License
FlauBERT is released under the MIT License, allowing for open use and modification with proper attribution.