bert base multilingual uncased

google-bert

Introduction

The BERT Multilingual Base model (Uncased) is a pretrained model designed to handle the top 102 languages with the largest presence on Wikipedia. It leverages a masked language modeling (MLM) approach for training and is uncased, meaning it does not differentiate between lowercase and uppercase letters. This model was introduced in a research paper and is available on GitHub for use in various natural language processing tasks.

Architecture

BERT is a transformer model trained on a large multilingual dataset using a self-supervised approach. This involves pretraining on raw text without human labeling, allowing the model to learn bidirectional representations of sentences. The pretraining objectives include:

  • Masked Language Modeling (MLM): Randomly masking 15% of words in the input and predicting them to learn bidirectional context.
  • Next Sentence Prediction (NSP): Determining if two sentences were consecutive in the original text.

These objectives help BERT develop a comprehensive understanding of language, which can be fine-tuned for specific tasks.

Training

BERT was pretrained on Wikipedia data for 102 languages. It employs preprocessing steps such as lowercasing and tokenization using WordPiece with a shared vocabulary of 110,000 tokens. Different languages are sampled variably based on their Wikipedia size. The training procedure involves masking 15% of input tokens, with varying probabilities for using the [MASK] token or random tokens. This allows BERT to learn rich language features applicable to downstream tasks.

Guide: Running Locally

To run the BERT model locally, follow these steps:

  1. Install the Transformers Library:

    pip install transformers
    
  2. Load the Model: Use the Transformers library to load the BERT model for masked language modeling or feature extraction.

    from transformers import pipeline
    unmasker = pipeline('fill-mask', model='bert-base-multilingual-uncased')
    
  3. Run Inference: Use the model for tasks like masked language modeling.

    result = unmasker("Hello I'm a [MASK] model.")
    print(result)
    
  4. Cloud GPUs: For more intensive tasks, consider using cloud GPU services like AWS, Google Cloud, or Azure for better performance.

License

The BERT model is released under the Apache-2.0 license, allowing for wide use and modification with attribution.

More Related APIs in Fill Mask