Nu Extract 1.5
numindIntroduction
NuExtract-v1.5 is a model developed by Numind, designed for structured information extraction from long documents across multiple languages, including English, French, Spanish, German, Portuguese, and Italian. It is a fine-tuned version of Microsoft's Phi-3.5-mini-instruct model, focused on extracting information that is explicitly present in the input text.
Architecture
NuExtract-v1.5 builds upon the architecture of Phi-3.5-mini-instruct, known for its capabilities in handling complex natural language processing tasks. The model is optimized for text generation tasks and is equipped to handle long sequences of text effectively.
Training
The model is trained on a proprietary high-quality dataset to enhance its ability to extract structured information. Special emphasis is placed on maintaining fidelity to the original text, ensuring that the generated output matches the input content as closely as possible. The training process includes support for zero-shot and few-shot learning scenarios.
Guide: Running Locally
To run NuExtract-v1.5 locally, follow these steps:
-
Install Dependencies: Ensure you have Python installed along with the
transformers
andtorch
libraries.pip install transformers torch
-
Load the Model: Use the provided Python script to load the model and tokenizer.
import json import torch from transformers import AutoModelForCausalLM, AutoTokenizer model_name = "numind/NuExtract-v1.5" device = "cuda" model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, trust_remote_code=True).to(device).eval() tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
-
Prepare Input and Template: Define the input text and the JSON template for the extraction task.
text = "Your input text here." template = """{ "Key1": "", "Key2": "" }"""
-
Run Prediction: Use the
predict_NuExtract
function to extract information based on the template.prediction = predict_NuExtract(model, tokenizer, [text], template)[0] print(prediction)
-
Hardware Recommendations: For optimal performance, use a GPU. Cloud GPUs from providers like AWS, Google Cloud, or Azure are recommended to handle large models and heavy workloads effectively.
License
NuExtract-v1.5 is released under the MIT License, permitting reuse and modification with attribution.