reader lm 1.5b
jinaaiIntroduction
Jina Reader-LM is a series of models developed by Jina AI, designed to convert HTML content to Markdown. The model is trained on a curated dataset of HTML and corresponding Markdown content, making it suitable for content conversion tasks.
Architecture
The Jina Reader-LM series includes models with varying capacities, such as reader-lm-0.5b
and reader-lm-1.5b
, each capable of handling up to 256K context length. These models are multilingual and leverage the transformers
library for their architecture.
Training
The models are specifically trained on HTML and Markdown content pairs to ensure effective conversion from HTML to Markdown. The training process involves fine-tuning on structured datasets to capture the nuances of both content formats.
Guide: Running Locally
-
Install Requirements
Ensure you have thetransformers
library installed:pip install transformers<=4.43.4
-
Load the Model
Use the following Python code to load and execute the model:from transformers import AutoModelForCausalLM, AutoTokenizer checkpoint = "jinaai/reader-lm-1.5b" device = "cuda" # Use "cpu" if a GPU is unavailable tokenizer = AutoTokenizer.from_pretrained(checkpoint) model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device) html_content = "<html><body><h1>Hello, world!</h1></body></html>" messages = [{"role": "user", "content": html_content}] input_text = tokenizer.apply_chat_template(messages, tokenize=False) inputs = tokenizer.encode(input_text, return_tensors="pt").to(device) outputs = model.generate(inputs, max_new_tokens=1024, temperature=0, do_sample=False, repetition_penalty=1.08) print(tokenizer.decode(outputs[0]))
-
Cloud GPUs
For enhanced performance, consider using cloud-based GPUs on platforms like AWS Sagemaker or Azure Marketplace, which offer both0.5b
and1.5b
versions.
License
The Jina Reader-LM models are distributed under the cc-by-nc-4.0
license, which allows for non-commercial use with attribution.