multilingual cpv sector classifier
MKaanIntroduction
The MULTILINGUAL-CPV-SECTOR-CLASSIFIER is a fine-tuned version of bert-base-multilingual-cased
designed to classify procurement descriptions into 45 sector classes using the CPV (Common Procurement Vocabulary) code descriptions. It supports text input in 104 languages and is tailored for the European Union's public procurement domain, achieving an F1 score of 0.686 on the evaluation set.
Architecture
The model is built upon the bert-base-multilingual-cased
architecture, which is capable of processing text in multiple languages. It classifies procurement descriptions into categories such as administrative services, agricultural products, construction work, and various other sectors defined by CPV codes.
Training
Training and Evaluation Data
- The dataset comprises 744,360 entries, split into training and validation sets in an 80%/20% ratio.
- The data includes contract notice descriptions awarded between 2011 and 2018, written in 22 European languages.
- Malta and Irish language data were excluded due to insufficient data.
Training Procedure
- The model was trained on Google Cloud V3-8 TPUs.
- Hyperparameters used include a learning rate of 2e-05, 3 epochs, gradient accumulation steps of 8, and a total batch size of 32.
Training Results
The model's performance was evaluated across different languages, with Polish achieving the highest F1 score of 0.759.
Guide: Running Locally
To run the model locally, follow these steps:
- Install Dependencies: Ensure you have Python and PyTorch installed, along with the Transformers library from Hugging Face.
- Download the Model: Access the model via Hugging Face's Model Hub and download it using the
transformers
library. - Prepare Input Data: Format your input procurement descriptions in one of the 104 supported languages.
- Run Inference: Use the model to classify your procurement descriptions into CPV sectors.
Cloud GPUs such as those offered by Google Cloud or AWS can be used to speed up inference if processing large volumes of data.
License
The model is released under the Apache-2.0 license, allowing for wide usage and distribution with proper attribution.