bert large cased
google-bertIntroduction
BERT (Bidirectional Encoder Representations from Transformers) is a pretrained model designed for English language tasks using a masked language modeling objective. Developed by Google, it leverages a large corpus of English data to learn language representations without human labeling, employing masked language modeling and next sentence prediction during pretraining.
Architecture
BERT-Large Cased model consists of:
- 24 layers
- 1024 hidden dimensions
- 16 attention heads
- 336 million parameters
The model distinguishes between uppercase and lowercase letters and aims to capture bidirectional representations of sentences.
Training
Training Data
BERT was pretrained on:
- BookCorpus: A dataset of 11,038 unpublished books
- English Wikipedia (excluding lists, tables, and headers)
Training Procedure
- Preprocessing: Texts were lowercased and tokenized using WordPiece with a 30,000-word vocabulary. Sentences were arranged as
[CLS] Sentence A [SEP] Sentence B [SEP]
. - Masked Language Modeling: 15% of tokens are masked, with 80% replaced by
[MASK]
, 10% by random tokens, and 10% left unchanged. - Pretraining: Conducted on 4 cloud TPUs in Pod configuration for one million steps. The training used Adam optimizer with a learning rate of 1e-4, a weight decay of 0.01, and linear decay of the learning rate.
Evaluation Results
Fine-tuned on tasks like SQuAD 1.1 and Multi NLI, BERT-Large Cased achieved notable performances, such as an F1/EM score of 91.5/84.8 on SQuAD 1.1.
Guide: Running Locally
-
Installation: Ensure you have the
transformers
library installed.pip install transformers
-
Using the Model: Load BERT for masked language modeling or feature extraction in PyTorch or TensorFlow.
PyTorch:
from transformers import BertTokenizer, BertModel tokenizer = BertTokenizer.from_pretrained('bert-large-cased') model = BertModel.from_pretrained("bert-large-cased") text = "Replace me by any text you'd like." encoded_input = tokenizer(text, return_tensors='pt') output = model(**encoded_input)
TensorFlow:
from transformers import BertTokenizer, TFBertModel tokenizer = BertTokenizer.from_pretrained('bert-large-cased') model = TFBertModel.from_pretrained("bert-large-cased") text = "Replace me by any text you'd like." encoded_input = tokenizer(text, return_tensors='tf') output = model(encoded_input)
-
Hardware Suggestions: For efficient processing, consider using cloud GPUs available through platforms like AWS, Google Cloud, or Azure.
License
BERT-Large Cased is distributed under the Apache 2.0 License, allowing for open use, modification, and distribution.