bert large cased

google-bert

Introduction

BERT (Bidirectional Encoder Representations from Transformers) is a pretrained model designed for English language tasks using a masked language modeling objective. Developed by Google, it leverages a large corpus of English data to learn language representations without human labeling, employing masked language modeling and next sentence prediction during pretraining.

Architecture

BERT-Large Cased model consists of:

  • 24 layers
  • 1024 hidden dimensions
  • 16 attention heads
  • 336 million parameters

The model distinguishes between uppercase and lowercase letters and aims to capture bidirectional representations of sentences.

Training

Training Data

BERT was pretrained on:

  • BookCorpus: A dataset of 11,038 unpublished books
  • English Wikipedia (excluding lists, tables, and headers)

Training Procedure

  • Preprocessing: Texts were lowercased and tokenized using WordPiece with a 30,000-word vocabulary. Sentences were arranged as [CLS] Sentence A [SEP] Sentence B [SEP].
  • Masked Language Modeling: 15% of tokens are masked, with 80% replaced by [MASK], 10% by random tokens, and 10% left unchanged.
  • Pretraining: Conducted on 4 cloud TPUs in Pod configuration for one million steps. The training used Adam optimizer with a learning rate of 1e-4, a weight decay of 0.01, and linear decay of the learning rate.

Evaluation Results

Fine-tuned on tasks like SQuAD 1.1 and Multi NLI, BERT-Large Cased achieved notable performances, such as an F1/EM score of 91.5/84.8 on SQuAD 1.1.

Guide: Running Locally

  1. Installation: Ensure you have the transformers library installed.

    pip install transformers
    
  2. Using the Model: Load BERT for masked language modeling or feature extraction in PyTorch or TensorFlow.

    PyTorch:

    from transformers import BertTokenizer, BertModel
    tokenizer = BertTokenizer.from_pretrained('bert-large-cased')
    model = BertModel.from_pretrained("bert-large-cased")
    text = "Replace me by any text you'd like."
    encoded_input = tokenizer(text, return_tensors='pt')
    output = model(**encoded_input)
    

    TensorFlow:

    from transformers import BertTokenizer, TFBertModel
    tokenizer = BertTokenizer.from_pretrained('bert-large-cased')
    model = TFBertModel.from_pretrained("bert-large-cased")
    text = "Replace me by any text you'd like."
    encoded_input = tokenizer(text, return_tensors='tf')
    output = model(encoded_input)
    
  3. Hardware Suggestions: For efficient processing, consider using cloud GPUs available through platforms like AWS, Google Cloud, or Azure.

License

BERT-Large Cased is distributed under the Apache 2.0 License, allowing for open use, modification, and distribution.

More Related APIs in Fill Mask