reader lm 0.5b
jinaaiIntroduction
The Jina Reader-LM is a set of models developed to convert HTML content into Markdown format, aiding in content conversion tasks. It is trained using a curated dataset of HTML and corresponding Markdown content.
Architecture
The Reader-LM models are designed for text generation, leveraging the Transformers library. They support multilingual inputs and are particularly useful for processing raw HTML without requiring prefix instructions.
Training
The models are trained on a large collection of HTML and Markdown pairs. They are optimized for use in applications requiring HTML-to-Markdown conversion, ensuring efficient and accurate results.
Guide: Running Locally
To run the Reader-LM model locally, follow these steps:
-
Install Dependencies:
Ensure you have the correct version of the Transformers library:pip install transformers<=4.43.4
-
Load the Model:
Use the following Python code to load and run the model:from transformers import AutoModelForCausalLM, AutoTokenizer checkpoint = "jinaai/reader-lm-0.5b" device = "cuda" # Use "cpu" if GPU is unavailable tokenizer = AutoTokenizer.from_pretrained(checkpoint) model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device) # Example HTML content html_content = "<html><body><h1>Hello, world!</h1></body></html>" messages = [{"role": "user", "content": html_content}] input_text = tokenizer.apply_chat_template(messages, tokenize=False) inputs = tokenizer.encode(input_text, return_tensors="pt").to(device) outputs = model.generate(inputs, max_new_tokens=1024, temperature=0, do_sample=False, repetition_penalty=1.08) print(tokenizer.decode(outputs[0]))
-
Cloud GPUs:
For enhanced performance, consider using cloud GPU services such as Google Colab's free T4 GPU tier, AWS SageMaker, or Azure Marketplace.
License
The Reader-LM models are released under the Creative Commons Attribution-NonCommercial 4.0 International License (cc-by-nc-4.0).