vietnamese accent marker xlm roberta

peterhung

Introduction

This model is designed to insert Vietnamese accent marks (diacritics) into unaccented Vietnamese text. It transforms unaccented text into its accented form, improving text readability and accuracy.

Architecture

The model is a fine-tuned version of XLM-RoBERTa Large, using the token classification approach. Each token in the input text is assigned a tag that indicates how to convert it into its accented form.

Training

The model treats the task as token classification, where each input token is assigned a tag to transform it into an accented token. The model is fine-tuned on XLM-RoBERTa Large, leveraging transformer architecture for handling Vietnamese text.

Guide: Running Locally

  1. Installation: Install the necessary packages using pip install transformers torch numpy.
  2. Load Model: Use AutoTokenizer and AutoModelForTokenClassification from the transformers library to load the model.
  3. Process Text: Run text through the model to get token predictions.
  4. Apply Tags: Use a tags file (selected_tags_names.txt) to convert tokens to accented words.
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
import numpy as np

def load_trained_transformer_model():
    model_path = "peterhung/vietnamese-accent-marker-xlm-roberta"
    tokenizer = AutoTokenizer.from_pretrained(model_path, add_prefix_space=True)
    model = AutoModelForTokenClassification.from_pretrained(model_path)
    return model, tokenizer

model, tokenizer = load_trained_transformer_model()
# Move model to GPU if available
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)
model.eval()

Suggest Cloud GPUs: For optimal performance, consider using cloud services like AWS EC2 with GPU instances, Google Cloud, or Azure.

License

The model is licensed under Apache 2.0.

More Related APIs in Token Classification