reader lm 0.5b

jinaai

Introduction

The Jina Reader-LM is a set of models developed to convert HTML content into Markdown format, aiding in content conversion tasks. It is trained using a curated dataset of HTML and corresponding Markdown content.

Architecture

The Reader-LM models are designed for text generation, leveraging the Transformers library. They support multilingual inputs and are particularly useful for processing raw HTML without requiring prefix instructions.

Training

The models are trained on a large collection of HTML and Markdown pairs. They are optimized for use in applications requiring HTML-to-Markdown conversion, ensuring efficient and accurate results.

Guide: Running Locally

To run the Reader-LM model locally, follow these steps:

  1. Install Dependencies:
    Ensure you have the correct version of the Transformers library:

    pip install transformers<=4.43.4
    
  2. Load the Model:
    Use the following Python code to load and run the model:

    from transformers import AutoModelForCausalLM, AutoTokenizer
    checkpoint = "jinaai/reader-lm-0.5b"
    
    device = "cuda" # Use "cpu" if GPU is unavailable
    tokenizer = AutoTokenizer.from_pretrained(checkpoint)
    model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)
    
    # Example HTML content
    html_content = "<html><body><h1>Hello, world!</h1></body></html>"
    messages = [{"role": "user", "content": html_content}]
    input_text = tokenizer.apply_chat_template(messages, tokenize=False)
    
    inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
    outputs = model.generate(inputs, max_new_tokens=1024, temperature=0, do_sample=False, repetition_penalty=1.08)
    
    print(tokenizer.decode(outputs[0]))
    
  3. Cloud GPUs:
    For enhanced performance, consider using cloud GPU services such as Google Colab's free T4 GPU tier, AWS SageMaker, or Azure Marketplace.

License

The Reader-LM models are released under the Creative Commons Attribution-NonCommercial 4.0 International License (cc-by-nc-4.0).

More Related APIs in Text Generation