Industry Classification Model

Introduction

The Industry Classification model uses a DistilBERT architecture to classify business descriptions into one of 62 industry tags. It has been trained on 7,000 samples of business descriptions from companies in India.

Architecture

The model is based on the DistilBERT architecture, which is a smaller, faster, and lighter version of BERT, designed for efficient performance while maintaining accuracy. It is compatible with both PyTorch and TensorFlow frameworks.

Training

The dataset used for training consists of business descriptions and their associated industry labels, specifically from Indian companies. The training data is not diverse in terms of geographic representation, which may introduce bias and limit the model's applicability to non-Indian companies.

Guide: Running Locally

To use this model locally, follow these steps:

Install the Transformers library:
```
pip install transformers
```

Load the model and tokenizer:

from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline

tokenizer = AutoTokenizer.from_pretrained("sampathkethineedi/industry-classification")  
model = AutoModelForSequenceClassification.from_pretrained("sampathkethineedi/industry-classification")

Create a pipeline for sentiment analysis (industry classification in this context):
```
industry_tags = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)
```

Classify a business description:

industry_tags("Stellar Capital Services Limited is an India-based non-banking financial company ... loan against property, management consultancy, personal loans and unsecured loans.")

The output will be a dictionary with the industry label and confidence score, for example:

[{'label': 'Consumer Finance', 'score': 0.9841355681419373}]

For optimal performance, consider using cloud GPUs such as those available on AWS, Google Cloud Platform, or Azure.

License

The model is licensed under the MIT license, allowing for broad usage and modification with attribution.