distilbert tagalog base cased
jcblaiseIntroduction
DISTILBERT-TAGALOG-BASE-CASED is a distilled version of BERT tailored for the Tagalog language. This model was developed as part of a research initiative to enhance Natural Language Processing (NLP) capabilities for the Filipino community. Although this model is now deprecated, it remains accessible for use, with recommendations to utilize newer models like jcblaise/roberta-tagalog-base
and jcblaise/roberta-tagalog-large
for improved performance.
Architecture
The model is based on DistilBERT, a lighter version of BERT, designed for efficiency while maintaining performance. It was distilled from the bert-tagalog-base-cased
model, focusing specifically on the Tagalog language.
Training
Training details and methodologies are outlined in associated research papers. The model was trained using data relevant to the Tagalog language, with scripts and utilities available for reference and use in the Filipino-Text-Benchmarks GitHub repository.
Guide: Running Locally
To use the model locally, follow these basic steps:
-
Install Hugging Face Transformers:
pip install transformers
-
Load the Model and Tokenizer:
-
For TensorFlow:
from transformers import TFAutoModel, AutoTokenizer model = TFAutoModel.from_pretrained('jcblaise/distilbert-tagalog-base-cased', from_pt=True) tokenizer = AutoTokenizer.from_pretrained('jcblaise/distilbert-tagalog-base-cased', do_lower_case=False)
-
For PyTorch:
from transformers import AutoModel, AutoTokenizer model = AutoModel.from_pretrained('jcblaise/distilbert-tagalog-base-cased') tokenizer = AutoTokenizer.from_pretrained('jcblaise/distilbert-tagalog-base-cased', do_lower_case=False)
-
-
Consider Using Cloud GPUs: For more efficient training and inference, consider using cloud GPU services like AWS, Google Cloud, or Azure.
License
This model is open-sourced under the GPL-3.0 license, allowing for wide usage and modification, subject to the terms of the license.