umberto commoncrawl cased v1
MusixmatchIntroduction
UmBERTo is a Roberta-based language model designed for processing the Italian language. It leverages large Italian corpora and employs innovative techniques such as SentencePiece and Whole Word Masking. The model is part of Musixmatch's research efforts and is available through the Hugging Face platform.
Architecture
The architecture of UmBERTo is based on the Roberta model, utilizing SentencePiece for tokenization and Whole Word Masking for training. This approach allows for effective handling of the Italian language, enhancing the model's ability to perform various natural language processing tasks.
Training
The model was trained on the Italian subcorpus of OSCAR, a large dataset comprising 70 GB of deduplicated text, equating to 210 million sentences and 11 billion words. The data was filtered and shuffled to optimize its use for NLP research. The model's pre-trained version, umberto-commoncrawl-cased-v1
, uses a vocabulary size of 32,000 and was trained over 125,000 steps.
Guide: Running Locally
Basic Steps
-
Installation: Ensure that
transformers
andtorch
libraries are installed in your Python environment.pip install transformers torch
-
Load Model and Tokenizer:
import torch from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("Musixmatch/umberto-commoncrawl-cased-v1") umberto = AutoModel.from_pretrained("Musixmatch/umberto-commoncrawl-cased-v1")
-
Use the Model: Encode text and obtain outputs.
encoded_input = tokenizer.encode("Umberto Eco è stato un grande scrittore") input_ids = torch.tensor(encoded_input).unsqueeze(0) # Batch size 1 outputs = umberto(input_ids) last_hidden_states = outputs[0]
-
Predict Masked Token:
from transformers import pipeline fill_mask = pipeline( "fill-mask", model="Musixmatch/umberto-commoncrawl-cased-v1", tokenizer="Musixmatch/umberto-commoncrawl-cased-v1" ) result = fill_mask("Umberto Eco è <mask> un grande scrittore")
Suggestion for Cloud GPUs
To accelerate computations, consider using cloud-based GPU services like AWS EC2, Google Cloud Platform, or Azure, which provide scalable resources for handling large models efficiently.
License
The original datasets used for training are publicly available or released with permission under CC0 or CC-BY licenses. For further details, refer to the respective dataset repositories and documentation.