cryptobert
ElKulakoIntroduction
CryptoBERT is a pre-trained natural language processing (NLP) model designed for analyzing the language and sentiments of social media posts related to cryptocurrencies. It builds upon the vinai's BERTweet-base model and is specifically fine-tuned using a large corpus of cryptocurrency-related social media posts.
Architecture
CryptoBERT is based on the BERT architecture, specifically optimized for sentiment classification in the cryptocurrency domain. It can handle sequences up to 514 tokens, although a maximum sequence length of 128 is recommended for optimal performance.
Training
The model was fine-tuned on a balanced dataset of 2 million labeled StockTwits posts, categorized into "Bearish" (0), "Neutral" (1), and "Bullish" (2) sentiments. The training corpus consisted of 3.2 million unique posts above four words in length, sourced from platforms like StockTwits, Telegram, Reddit, and Twitter.
Guide: Running Locally
- Dependencies: Ensure you have Python and the
transformers
library installed. - Model Setup: Load the CryptoBERT model and tokenizer:
from transformers import TextClassificationPipeline, AutoModelForSequenceClassification, AutoTokenizer model_name = "ElKulako/cryptobert" tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True) model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=3)
- Pipeline Creation: Set up the text classification pipeline:
pipe = TextClassificationPipeline(model=model, tokenizer=tokenizer, max_length=64, truncation=True, padding='max_length')
- Inference Example: Analyze sentiment for a list of posts:
df_posts = ["post_1 content", "post_2 content", "post_3 content"] preds = pipe(df_posts) print(preds)
- Cloud GPU Suggestion: For efficient processing, consider using cloud services with GPU support, such as AWS EC2 with NVIDIA GPUs or Google Cloud's AI Platform.
License
CryptoBERT is released under the MIT License, allowing for flexible use, modification, and distribution.