hupd distilroberta base

HUPD

Introduction

The HUPD DistilRoBERTa model is a fine-tuned version of the RoBERTa architecture on the Harvard USPTO Patent Dataset (HUPD) for masked language modeling tasks. It is designed to assist in understanding and processing patent information by predicting masked words in a text sequence.

Architecture

The model is based on the DistilRoBERTa architecture, which is a lighter, faster version of the original RoBERTa model. It retains much of RoBERTa's accuracy while being optimized for performance.

Training

The model was trained on the HUPD dataset using a masked language modeling objective, which involves predicting missing words in sentences. This dataset is specifically curated for patent applications, making the model particularly suitable for this domain.

Guide: Running Locally

To use the HUPD DistilRoBERTa model locally, you can follow these steps:

  1. Install Hugging Face Transformers: Ensure you have the transformers library installed.

    pip install transformers
    
  2. Load the Model with a Pipeline:

    from transformers import pipeline
    model = pipeline(task="fill-mask", model="hupd/hupd-distilroberta-base")
    result = model("Improved <mask> for playing a game of thumb wrestling.")
    print(result)
    
  3. Load the Model Manually:

    import torch
    from transformers import AutoTokenizer, AutoModelForMaskedLM
    
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    
    tokenizer = AutoTokenizer.from_pretrained("hupd/hupd-distilroberta-base")
    model = AutoModelForMaskedLM.from_pretrained("hupd/hupd-distilroberta-base").to(device)
    
    TEXT = "Improved <mask> for playing a game of thumb wrestling."
    inputs = tokenizer(TEXT, return_tensors="pt").to(device)
    
    with torch.no_grad():
        logits = model(**inputs).logits
    
    mask_token_indices = (inputs.input_ids == tokenizer.mask_token_id)[0].nonzero(as_tuple=True)[0]
    for mask_idx in mask_token_indices:
        predicted_token_id = logits[0, mask_idx].argmax(axis=-1)
        output = tokenizer.decode(predicted_token_id)
        print(f'Prediction for the <mask> token at index {mask_idx}: "{output}"')
    
  4. GPU Support: To leverage faster computation, consider using cloud GPUs provided by services like AWS, Google Cloud, or Azure.

License

The HUPD DistilRoBERTa model is released under the Creative Commons Attribution-ShareAlike 4.0 International License (cc-by-sa-4.0), which allows for sharing and adaptation with appropriate credit.

More Related APIs in Fill Mask