bloom deepspeed inference fp16 LLM Model

Introduction

This project provides an optimized version of the BLOOM model for efficient inference using DeepSpeed-MII and DeepSpeed-Inference. The model's weights are adapted to support tensor parallelism, allowing it to run across multiple GPUs.

Architecture

The model architecture involves splitting the original BLOOM weights into 8 shards, designed to be executed on 8 GPUs. This setup leverages DeepSpeed's Tensor Parallelism to enhance inference performance and efficiency.

Training

While this repository focuses on inference, the original BLOOM model is trained to perform various natural language processing tasks. Detailed training information can be found in the original BLOOM model card on Hugging Face.

Guide: Running Locally

Installation: Ensure you have DeepSpeed installed. Follow the instructions in the DeepSpeed-MII GitHub repository for setup.
Download Model: Clone this repository and download the model shards.
Configuration: Set up your environment to utilize 8 GPUs for parallel processing.
Run Inference: Use the guidance from the DeepSpeed-Inference tutorial to execute the model.

Cloud GPUs

Consider using cloud GPU services such as AWS, Google Cloud Platform, or Azure for running the model efficiently, especially if you do not have access to 8 local GPUs.

License

This project is licensed under the bigscience-bloom-rail-1.0 license. For more information, refer to the license details provided in the repository.

More Related APIs in Feature Extraction