hubert base cc
SZTAKI-HLTIntroduction
The HuBERT-BASE-CC model is a cased BERT model specifically developed for the Hungarian language. It has been trained on a filtered and deduplicated subset of the Hungarian Common Crawl data and a snapshot of the Hungarian Wikipedia. The model is designed for tasks like chunking and named entity recognition (NER), achieving state-of-the-art results in these areas.
Architecture
HuBERT-BASE-CC leverages the BERT architecture, adapted for the Hungarian language. As a cased model, it retains capitalization, which is important for the tasks it is optimized for, such as NER.
Training
The model was trained using a combination of the Hungarian Common Crawl and Wikipedia datasets. The training procedure and initial results are detailed in a PhD thesis, with comprehensive evaluation results to be published in future research. Fine-tuning the model using BertForTokenClassification
on chunking and NER tasks has shown it to surpass multilingual BERT, setting a new standard for these tasks.
Guide: Running Locally
-
Prerequisites: Ensure you have Python and the necessary libraries, such as
transformers
andtorch
. -
Installation: Use pip to install the Hugging Face Transformers library:
pip install transformers
-
Load the Model: Use the Transformers library to load the HuBERT-BASE-CC model:
from transformers import BertTokenizer, BertForTokenClassification tokenizer = BertTokenizer.from_pretrained("SZTAKI-HLT/hubert-base-cc") model = BertForTokenClassification.from_pretrained("SZTAKI-HLT/hubert-base-cc")
-
Inference: Tokenize your input text and perform inference using the model.
-
Cloud GPUs: For optimal performance and faster processing, consider using cloud-based GPU services such as AWS EC2 with NVIDIA GPUs, Google Cloud Platform, or Azure.
License
The HuBERT-BASE-CC model is distributed under the Apache 2.0 license, allowing for wide usage and modification. Users are encouraged to review the license terms to ensure compliance in their applications.