bert base multilingual cased
google-bertIntroduction
The BERT Multilingual Base Model (Cased) is a pretrained model designed for the top 104 languages with the largest Wikipedia presence. It was developed using a masked language modeling (MLM) objective and introduced in the paper titled "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". This model is sensitive to case differences, distinguishing between words like "english" and "English".
Architecture
BERT is a transformers model trained on a large corpus of multilingual data using self-supervised learning. It employs two main objectives during pretraining:
- Masked Language Modeling (MLM): Randomly masks 15% of words in a sentence and predicts them using the model's bidirectional representation.
- Next Sentence Prediction (NSP): Determines if two sentences are sequentially related by concatenating and analyzing them.
This architecture allows BERT to create a comprehensive representation of languages, which can be leveraged for various downstream tasks such as sequence classification and question answering.
Training
Training Data
The model is pretrained on the 104 languages with the largest Wikipedia datasets. The complete list of languages is available in the multilingual BERT documentation.
Training Procedure
- Preprocessing: Texts are lowercased, tokenized using WordPiece, and arranged into sequences with a maximum length of 512 tokens. Languages with larger corpora are under-sampled, while those with smaller corpora are oversampled.
- Masking: 15% of tokens are masked with 80% replaced by
[MASK]
, 10% by random tokens, and 10% left unchanged.
Guide: Running Locally
To use the BERT Multilingual Base Model locally, follow these steps:
-
Installation: Ensure you have the
transformers
library installed.pip install transformers
-
Load the Model:
-
PyTorch:
from transformers import BertTokenizer, BertModel tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased') model = BertModel.from_pretrained("bert-base-multilingual-cased") text = "Replace me by any text you'd like." encoded_input = tokenizer(text, return_tensors='pt') output = model(**encoded_input)
-
TensorFlow:
from transformers import BertTokenizer, TFBertModel tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased') model = TFBertModel.from_pretrained("bert-base-multilingual-cased") text = "Replace me by any text you'd like." encoded_input = tokenizer(text, return_tensors='tf') output = model(encoded_input)
-
For optimal performance, especially for large-scale tasks, consider using cloud GPUs such as those provided by AWS or Google Cloud Platform.
License
The BERT Multilingual Base Model is licensed under the Apache-2.0 License.