Nu Extract 1.5

numind

Introduction

NuExtract-v1.5 is a model developed by Numind, designed for structured information extraction from long documents across multiple languages, including English, French, Spanish, German, Portuguese, and Italian. It is a fine-tuned version of Microsoft's Phi-3.5-mini-instruct model, focused on extracting information that is explicitly present in the input text.

Architecture

NuExtract-v1.5 builds upon the architecture of Phi-3.5-mini-instruct, known for its capabilities in handling complex natural language processing tasks. The model is optimized for text generation tasks and is equipped to handle long sequences of text effectively.

Training

The model is trained on a proprietary high-quality dataset to enhance its ability to extract structured information. Special emphasis is placed on maintaining fidelity to the original text, ensuring that the generated output matches the input content as closely as possible. The training process includes support for zero-shot and few-shot learning scenarios.

Guide: Running Locally

To run NuExtract-v1.5 locally, follow these steps:

  1. Install Dependencies: Ensure you have Python installed along with the transformers and torch libraries.

    pip install transformers torch
    
  2. Load the Model: Use the provided Python script to load the model and tokenizer.

    import json
    import torch
    from transformers import AutoModelForCausalLM, AutoTokenizer
    
    model_name = "numind/NuExtract-v1.5"
    device = "cuda"
    model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, trust_remote_code=True).to(device).eval()
    tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
    
  3. Prepare Input and Template: Define the input text and the JSON template for the extraction task.

    text = "Your input text here."
    template = """{
        "Key1": "",
        "Key2": ""
    }"""
    
  4. Run Prediction: Use the predict_NuExtract function to extract information based on the template.

    prediction = predict_NuExtract(model, tokenizer, [text], template)[0]
    print(prediction)
    
  5. Hardware Recommendations: For optimal performance, use a GPU. Cloud GPUs from providers like AWS, Google Cloud, or Azure are recommended to handle large models and heavy workloads effectively.

License

NuExtract-v1.5 is released under the MIT License, permitting reuse and modification with attribution.

More Related APIs in Text Generation