vietnamese accent marker xlm roberta
peterhungIntroduction
This model is designed to insert Vietnamese accent marks (diacritics) into unaccented Vietnamese text. It transforms unaccented text into its accented form, improving text readability and accuracy.
Architecture
The model is a fine-tuned version of XLM-RoBERTa Large, using the token classification approach. Each token in the input text is assigned a tag that indicates how to convert it into its accented form.
Training
The model treats the task as token classification, where each input token is assigned a tag to transform it into an accented token. The model is fine-tuned on XLM-RoBERTa Large, leveraging transformer architecture for handling Vietnamese text.
Guide: Running Locally
- Installation: Install the necessary packages using
pip install transformers torch numpy
. - Load Model: Use
AutoTokenizer
andAutoModelForTokenClassification
from thetransformers
library to load the model. - Process Text: Run text through the model to get token predictions.
- Apply Tags: Use a tags file (
selected_tags_names.txt
) to convert tokens to accented words.
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
import numpy as np
def load_trained_transformer_model():
model_path = "peterhung/vietnamese-accent-marker-xlm-roberta"
tokenizer = AutoTokenizer.from_pretrained(model_path, add_prefix_space=True)
model = AutoModelForTokenClassification.from_pretrained(model_path)
return model, tokenizer
model, tokenizer = load_trained_transformer_model()
# Move model to GPU if available
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)
model.eval()
Suggest Cloud GPUs: For optimal performance, consider using cloud services like AWS EC2 with GPU instances, Google Cloud, or Azure.
License
The model is licensed under Apache 2.0.