distilbert tagalog base cased

jcblaise

Introduction

DISTILBERT-TAGALOG-BASE-CASED is a distilled version of BERT tailored for the Tagalog language. This model was developed as part of a research initiative to enhance Natural Language Processing (NLP) capabilities for the Filipino community. Although this model is now deprecated, it remains accessible for use, with recommendations to utilize newer models like jcblaise/roberta-tagalog-base and jcblaise/roberta-tagalog-large for improved performance.

Architecture

The model is based on DistilBERT, a lighter version of BERT, designed for efficiency while maintaining performance. It was distilled from the bert-tagalog-base-cased model, focusing specifically on the Tagalog language.

Training

Training details and methodologies are outlined in associated research papers. The model was trained using data relevant to the Tagalog language, with scripts and utilities available for reference and use in the Filipino-Text-Benchmarks GitHub repository.

Guide: Running Locally

To use the model locally, follow these basic steps:

  1. Install Hugging Face Transformers:

    pip install transformers
    
  2. Load the Model and Tokenizer:

    • For TensorFlow:

      from transformers import TFAutoModel, AutoTokenizer
      
      model = TFAutoModel.from_pretrained('jcblaise/distilbert-tagalog-base-cased', from_pt=True)
      tokenizer = AutoTokenizer.from_pretrained('jcblaise/distilbert-tagalog-base-cased', do_lower_case=False)
      
    • For PyTorch:

      from transformers import AutoModel, AutoTokenizer
      
      model = AutoModel.from_pretrained('jcblaise/distilbert-tagalog-base-cased')
      tokenizer = AutoTokenizer.from_pretrained('jcblaise/distilbert-tagalog-base-cased', do_lower_case=False)
      
  3. Consider Using Cloud GPUs: For more efficient training and inference, consider using cloud GPU services like AWS, Google Cloud, or Azure.

License

This model is open-sourced under the GPL-3.0 license, allowing for wide usage and modification, subject to the terms of the license.

More Related APIs