FERNET-C5

Introduction

FERNET-C5 (Flexible Embedding Representation NETwork) is a monolingual Czech BERT-base model. It is pre-trained using 93GB of the Czech Colossal Clean Crawled Corpus (C5). It is designed to provide robust text classification capabilities for the Czech language. More details can be found in the referenced paper.

Architecture

FERNET-C5 utilizes the BERT-base architecture, focusing on monolingual text processing for Czech. The model is tailored to handle various text classification tasks effectively, leveraging a large volume of pre-training data.

Training

The model is pre-trained on the Czech Colossal Clean Crawled Corpus (C5), a comprehensive dataset comprising 93GB of text. This extensive training data allows FERNET-C5 to understand and process Czech text effectively, providing high accuracy in text classification tasks.

Guide: Running Locally

To run FERNET-C5 locally, follow these steps:

Install Python and Dependencies: Ensure Python is installed on your system. Install the necessary packages using pip:
```
pip install torch transformers
```
Download the Model: Access the FERNET-C5 model from the Hugging Face model hub and download it.

Load the Model: Use the Transformers library to load and initialize the model:

from transformers import AutoModelForMaskedLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("fav-kky/FERNET-C5")
model = AutoModelForMaskedLM.from_pretrained("fav-kky/FERNET-C5")

Inference: Use the model for inference on your text data.

Cloud GPUs: For improved performance, consider using cloud services such as AWS, GCP, or Azure, which provide access to powerful GPUs suitable for running deep learning models.

License

FERNET-C5 is released under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license. This allows for sharing and adapting the model for non-commercial purposes, provided appropriate credit is given and derivative works are licensed under identical terms.