umberto commoncrawl cased v1 LLM Model

Introduction

UmBERTo is a Roberta-based language model designed for processing the Italian language. It leverages large Italian corpora and employs innovative techniques such as SentencePiece and Whole Word Masking. The model is part of Musixmatch's research efforts and is available through the Hugging Face platform.

Architecture

The architecture of UmBERTo is based on the Roberta model, utilizing SentencePiece for tokenization and Whole Word Masking for training. This approach allows for effective handling of the Italian language, enhancing the model's ability to perform various natural language processing tasks.

Training

The model was trained on the Italian subcorpus of OSCAR, a large dataset comprising 70 GB of deduplicated text, equating to 210 million sentences and 11 billion words. The data was filtered and shuffled to optimize its use for NLP research. The model's pre-trained version, umberto-commoncrawl-cased-v1, uses a vocabulary size of 32,000 and was trained over 125,000 steps.

Guide: Running Locally

Basic Steps

Installation: Ensure that transformers and torch libraries are installed in your Python environment.
```
pip install transformers torch
```

Load Model and Tokenizer:

import torch
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("Musixmatch/umberto-commoncrawl-cased-v1")
umberto = AutoModel.from_pretrained("Musixmatch/umberto-commoncrawl-cased-v1")

Use the Model: Encode text and obtain outputs.

encoded_input = tokenizer.encode("Umberto Eco è stato un grande scrittore")
input_ids = torch.tensor(encoded_input).unsqueeze(0)  # Batch size 1
outputs = umberto(input_ids)
last_hidden_states = outputs[0]

Predict Masked Token:

from transformers import pipeline

fill_mask = pipeline(
    "fill-mask",
    model="Musixmatch/umberto-commoncrawl-cased-v1",
    tokenizer="Musixmatch/umberto-commoncrawl-cased-v1"
)

result = fill_mask("Umberto Eco è <mask> un grande scrittore")

Suggestion for Cloud GPUs

To accelerate computations, consider using cloud-based GPU services like AWS EC2, Google Cloud Platform, or Azure, which provide scalable resources for handling large models efficiently.

License

The original datasets used for training are publicly available or released with permission under CC0 or CC-BY licenses. For further details, refer to the respective dataset repositories and documentation.

More Related APIs in Fill Mask