FinBERT: Pretrained Financial Language Model

Introduction

FinBERT is a BERT-based language model specifically pre-trained on financial communication texts. It aims to enhance research and practical applications in financial Natural Language Processing (NLP). The model has been pre-trained on a large corpus of financial texts comprising 4.9 billion tokens.

Architecture

The architecture follows the BERT model, which is a transformer-based model designed for various NLP tasks. FinBERT has been customized to handle financial texts effectively by being pre-trained on specific financial documents.

Training

The model was trained on three main types of financial documents:

Corporate Reports (10-K & 10-Q): 2.5 billion tokens
Earnings Call Transcripts: 1.3 billion tokens
Analyst Reports: 1.1 billion tokens

The model can be fine-tuned for specific tasks such as financial sentiment analysis, ESG classification, and forward-looking statement classification.

Guide: Running Locally

To run FinBERT locally, follow these basic steps:

Install Dependencies: Ensure Python and PyTorch are installed. Use pip to install the Transformers library.
```
pip install transformers
```

Load the Model:

from transformers import AutoModelForMaskedLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("yiyanghkust/finbert-pretrain")
model = AutoModelForMaskedLM.from_pretrained("yiyanghkust/finbert-pretrain")

Use the Model: Tokenize your input and use the model for inference as needed.

Consider using cloud GPU services such as AWS, Google Cloud, or Azure for efficient processing if handling large datasets or performing intensive tasks.

License

The licensing details for FinBERT are not specified in the provided text. Please refer to the official repository or contact the authors for specific licensing information.