hubert base cc

SZTAKI-HLT

Introduction

The HuBERT-BASE-CC model is a cased BERT model specifically developed for the Hungarian language. It has been trained on a filtered and deduplicated subset of the Hungarian Common Crawl data and a snapshot of the Hungarian Wikipedia. The model is designed for tasks like chunking and named entity recognition (NER), achieving state-of-the-art results in these areas.

Architecture

HuBERT-BASE-CC leverages the BERT architecture, adapted for the Hungarian language. As a cased model, it retains capitalization, which is important for the tasks it is optimized for, such as NER.

Training

The model was trained using a combination of the Hungarian Common Crawl and Wikipedia datasets. The training procedure and initial results are detailed in a PhD thesis, with comprehensive evaluation results to be published in future research. Fine-tuning the model using BertForTokenClassification on chunking and NER tasks has shown it to surpass multilingual BERT, setting a new standard for these tasks.

Guide: Running Locally

  1. Prerequisites: Ensure you have Python and the necessary libraries, such as transformers and torch.

  2. Installation: Use pip to install the Hugging Face Transformers library:

    pip install transformers
    
  3. Load the Model: Use the Transformers library to load the HuBERT-BASE-CC model:

    from transformers import BertTokenizer, BertForTokenClassification
    
    tokenizer = BertTokenizer.from_pretrained("SZTAKI-HLT/hubert-base-cc")
    model = BertForTokenClassification.from_pretrained("SZTAKI-HLT/hubert-base-cc")
    
  4. Inference: Tokenize your input text and perform inference using the model.

  5. Cloud GPUs: For optimal performance and faster processing, consider using cloud-based GPU services such as AWS EC2 with NVIDIA GPUs, Google Cloud Platform, or Azure.

License

The HuBERT-BASE-CC model is distributed under the Apache 2.0 license, allowing for wide usage and modification. Users are encouraged to review the license terms to ensure compliance in their applications.

More Related APIs