mdeberta v3 base

microsoft

Introduction

DeBERTaV3 is an improved version of the DeBERTa model, utilizing ELECTRA-Style pre-training and Gradient-Disentangled Embedding Sharing to enhance performance on downstream tasks. The model builds on the innovations of disentangled attention and an advanced mask decoder, significantly outperforming previous models like RoBERTa on various natural language understanding (NLU) tasks.

Architecture

mDeBERTa is a multilingual version of DeBERTa with the same architecture. It includes 12 layers and a hidden size of 768, resulting in 86 million backbone parameters and a vocabulary of 250,000 tokens, leading to 190 million parameters in the Embedding layer. This model was trained with the extensive CC100 multilingual dataset.

Training

The model was fine-tuned for NLU tasks, specifically evaluated on the XNLI dataset using zero-shot cross-lingual transfer settings. Training involved English data, with testing across multiple languages. mDeBERTa-base showed notable improvements over XLM-R-base in these evaluations.

Guide: Running Locally

  1. Install Prerequisites: Ensure you have Python and necessary libraries installed.
  2. Clone Transformers Repository:
    git clone https://github.com/huggingface/transformers
    cd transformers/examples/pytorch/text-classification/
    
  3. Install Datasets:
    pip install datasets
    
  4. Set up Training:
    • Configure distributed training with multiple GPUs.
    • Set environment variables for the number of GPUs and batch size.
    • Run the training script with specified parameters for model name, task, languages, and training strategy.
    output_dir="ds_results"
    num_gpus=8
    batch_size=4
    
    python -m torch.distributed.launch --nproc_per_node=${num_gpus} \
      run_xnli.py \
      --model_name_or_path microsoft/mdeberta-v3-base \
      --task_name $TASK_NAME \
      --do_train \
      --do_eval \
      --train_language en \
      --language en \
      --evaluation_strategy steps \
      --max_seq_length 256 \
      --warmup_steps 3000 \
      --per_device_train_batch_size ${batch_size} \
      --learning_rate 2e-5 \
      --num_train_epochs 6 \
      --output_dir $output_dir \
      --overwrite_output_dir \
      --logging_steps 1000 \
      --logging_dir $output_dir
    

Cloud GPUs Recommendation: Consider using cloud-based GPU services like AWS SageMaker, Google Cloud AI Platform, or Azure Machine Learning for efficient training.

License

The DeBERTaV3 model is released under the MIT License, making it free to use, modify, and distribute.

More Related APIs in Fill Mask