roberta urdu small
urduhackIntroduction
roberta-urdu-small is a language model specifically designed for the Urdu language. Developed by URDUHACK, this model is part of the Hugging Face model repository and uses the RoBERTa architecture to perform fill-mask tasks in Urdu.
Architecture
The roberta-urdu-small model is based on the RoBERTa architecture and has a model size of 125 million parameters. It is implemented with libraries like Transformers, PyTorch, and JAX. This model is tailored for processing and understanding the Urdu language.
Training
The model was trained using a corpus of Urdu news data sourced from various Pakistani resources. The training process involved normalizing the data to remove characters from other languages, such as Arabic, using the normalization module from the Urduhack library.
Guide: Running Locally
To run the roberta-urdu-small model locally, follow these steps:
-
Install Transformers Library: Ensure you have the Hugging Face Transformers library installed. You can do this with
pip install transformers
. -
Load the Model: Use the provided code snippet to load the model:
from transformers import pipeline fill_mask = pipeline("fill-mask", model="urduhack/roberta-urdu-small", tokenizer="urduhack/roberta-urdu-small")
-
Utilize Cloud GPUs: For efficient processing, consider using cloud-based GPUs available on platforms such as AWS, Google Cloud, or Azure.
License
The roberta-urdu-small model is released under the MIT License, allowing for broad use and distribution. More details can be found in the license file.