mosaic bert base

mosaicml

Introduction

MosaicBERT-Base is a custom BERT architecture optimized for fast pretraining, achieving higher pretraining and finetuning accuracy compared to Hugging Face's bert-base-uncased. It incorporates architectural choices such as FlashAttention, ALiBi, and Gated Linear Units. MosaicBERT is pretrained on the C4 dataset, a curated collection of internet-sourced text documents.

Architecture

MosaicBERT-Base includes several modifications to the traditional BERT architecture:

  1. FlashAttention: Reduces read/write operations between GPU memories, enhancing speed.
  2. Attention with Linear Biases (ALiBi): Replaces position embeddings with a bias matrix in the attention operation for improved sequence handling.
  3. Unpadding: Combines sequences into a single batch to avoid operations on padding tokens.
  4. Low Precision LayerNorm: Utilizes float16 or bfloat16 precision for LayerNorm modules.
  5. Gated Linear Units (GLU): Enhances feedforward layers with an additional gating matrix for improved performance.

Training

MosaicBERT employs a standard Masked Language Modeling (MLM) objective and optimizations like:

  • MosaicML Streaming Dataset: Utilizes the C4 dataset in a streaming format.
  • Higher Masking Ratio: Implements a 30% masking ratio for improved accuracy.
  • Bfloat16 Precision: Uses mixed precision training for stability.
  • Vocabulary Size: Adjusted to be a multiple of 64 for throughput speedup.
  • Hyperparameters: Includes Decoupled AdamW optimizer, specific learning rate schedules, and dropout configurations.

Guide: Running Locally

  1. Install Dependencies: Ensure Python, PyTorch, and Transformers library are installed.
  2. Load Model: Use AutoModelForMaskedLM from the Transformers library with the configuration for MosaicBERT.
  3. Enable ALiBi: Adjust the alibi_starting_size in the configuration for longer sequence extrapolation.
  4. Run Inference: Utilize the model with the fill-mask pipeline for masked language tasks.

For optimal performance, consider using cloud GPUs like those provided by AWS, Google Cloud, or Azure.

License

MosaicBERT-Base is released under the Apache-2.0 license, allowing free use and modification under specified conditions.

More Related APIs in Fill Mask