flaubert_large_cased

flaubert

Introduction

FlauBERT is an unsupervised pre-trained language model specifically designed for the French language. Developed using the CNRS Jean Zay supercomputer, it is trained on a vast and diverse French corpus. FlauBERT aims to enhance reproducible experiments in French NLP and is accompanied by FLUE, an evaluation benchmark similar to GLUE, tailored for French NLP systems.

Architecture

FlauBERT models are available in various configurations:

  • flaubert-small-cased: 6 layers, 8 attention heads, 512 embedding dimensions, 54 million parameters.
  • flaubert-base-uncased: 12 layers, 12 attention heads, 768 embedding dimensions, 137 million parameters.
  • flaubert-base-cased: 12 layers, 12 attention heads, 768 embedding dimensions, 138 million parameters.
  • flaubert-large-cased: 24 layers, 16 attention heads, 1024 embedding dimensions, 373 million parameters.

The small variant is partially trained and recommended only for debugging.

Training

Models are pre-trained using a large-scale French corpus utilizing the computational resources of the CNRS Jean Zay supercomputer. The training process leverages state-of-the-art techniques in unsupervised learning for language models, similar to the BERT architecture but optimized for French language processing.

Guide: Running Locally

To use FlauBERT with Hugging Face's Transformers library, follow these steps:

  1. Install the Transformers Library:

    pip install transformers
    
  2. Load the Model and Tokenizer:

    import torch
    from transformers import FlaubertModel, FlaubertTokenizer
    
    modelname = 'flaubert/flaubert_base_cased'
    
    flaubert, log = FlaubertModel.from_pretrained(modelname, output_loading_info=True)
    flaubert_tokenizer = FlaubertTokenizer.from_pretrained(modelname, do_lowercase=False)
    
  3. Encode and Process a Sample Sentence:

    sentence = "Le chat mange une pomme."
    token_ids = torch.tensor([flaubert_tokenizer.encode(sentence)])
    last_layer = flaubert(token_ids)[0]
    cls_embedding = last_layer[:, 0, :]
    

For optimal performance, consider using cloud GPUs like those available on AWS, Google Cloud, or Azure.

License

FlauBERT is released under the MIT License, allowing for open use and modification with proper attribution.

More Related APIs in Fill Mask