reader lm 1.5b

jinaai

Introduction

Jina Reader-LM is a series of models developed by Jina AI, designed to convert HTML content to Markdown. The model is trained on a curated dataset of HTML and corresponding Markdown content, making it suitable for content conversion tasks.

Architecture

The Jina Reader-LM series includes models with varying capacities, such as reader-lm-0.5b and reader-lm-1.5b, each capable of handling up to 256K context length. These models are multilingual and leverage the transformers library for their architecture.

Training

The models are specifically trained on HTML and Markdown content pairs to ensure effective conversion from HTML to Markdown. The training process involves fine-tuning on structured datasets to capture the nuances of both content formats.

Guide: Running Locally

  1. Install Requirements
    Ensure you have the transformers library installed:

    pip install transformers<=4.43.4
    
  2. Load the Model
    Use the following Python code to load and execute the model:

    from transformers import AutoModelForCausalLM, AutoTokenizer
    
    checkpoint = "jinaai/reader-lm-1.5b"
    device = "cuda"  # Use "cpu" if a GPU is unavailable
    
    tokenizer = AutoTokenizer.from_pretrained(checkpoint)
    model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)
    
    html_content = "<html><body><h1>Hello, world!</h1></body></html>"
    messages = [{"role": "user", "content": html_content}]
    input_text = tokenizer.apply_chat_template(messages, tokenize=False)
    
    inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
    outputs = model.generate(inputs, max_new_tokens=1024, temperature=0, do_sample=False, repetition_penalty=1.08)
    
    print(tokenizer.decode(outputs[0]))
    
  3. Cloud GPUs
    For enhanced performance, consider using cloud-based GPUs on platforms like AWS Sagemaker or Azure Marketplace, which offer both 0.5b and 1.5b versions.

License

The Jina Reader-LM models are distributed under the cc-by-nc-4.0 license, which allows for non-commercial use with attribution.

More Related APIs in Text Generation