dynamic_tinybert LLM Model

Introduction

Dynamic-TinyBERT is a compact BERT-based model optimized for question answering tasks, leveraging dynamic sequence length and hyperparameter optimization. It is designed to enhance inference efficiency while maintaining performance akin to larger BERT models. Developed by Intel, this model achieves a notable accuracy-speedup trade-off, boasting up to 3.3x speedup with minimal performance degradation.

Architecture

Dynamic-TinyBERT is based on the TinyBERT6L architecture, consisting of:

6 layers
Hidden size of 768
Feed-forward size of 3072
12 attention heads

This configuration allows it to maintain efficiency while handling NLP tasks effectively.

Training

The model is fine-tuned on the SQuAD 1.1 dataset. Training involves:

Starting with a pre-trained general-TinyBERT student model.
Employing transformer distillation from a fine-tuned BERT teacher model.
Utilizing intermediate-layer distillation (ID) and prediction-layer distillation (PD) to capture knowledge from the teacher model.

Performance metrics indicate a maximum F1 score of 88.71, achieving significant speedup over traditional BERT models.

Guide: Running Locally

Follow these steps to use Dynamic-TinyBERT locally:

Install Dependencies: Ensure you have Python and PyTorch installed. Install the Hugging Face Transformers library:
```
pip install transformers
```

Import the Model:

import torch
from transformers import AutoTokenizer, AutoModelForQuestionAnswering

tokenizer = AutoTokenizer.from_pretrained("Intel/dynamic_tinybert")
model = AutoModelForQuestionAnswering.from_pretrained("Intel/dynamic_tinybert")

Prepare Input Data:

context = "remember the number 123456, I'll ask you later."
question = "What is the number I told you?"

tokens = tokenizer.encode_plus(question, context, return_tensors="pt", truncation=True)
input_ids = tokens["input_ids"]
attention_mask = tokens["attention_mask"]

Run Inference:

outputs = model(input_ids, attention_mask=attention_mask)
start_scores = outputs.start_logits
end_scores = outputs.end_logits

answer_start = torch.argmax(start_scores)
answer_end = torch.argmax(end_scores) + 1
answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[0][answer_start:answer_end]))

print("Answer:", answer)

Cloud GPUs

For optimal performance, especially with larger datasets or batch sizes, consider using cloud-based GPUs such as those offered by AWS, Google Cloud, or Azure.

License

Dynamic-TinyBERT is distributed under the Apache 2.0 License, which allows for both commercial and non-commercial use, modification, and distribution.