Introduction

Sapiens is a family of models developed by Meta for human-centric vision tasks, including 2D pose estimation, body-part segmentation, depth estimation, and surface normal prediction. These models are designed for high-resolution inference and can be fine-tuned for specific tasks using a large dataset of over 300 million human images. They demonstrate strong generalization to real-world data, even with limited labeled data, and improve performance as the model size increases.

Architecture

Sapiens models utilize Vision Transformers and are available in various parameter sizes, ranging from 0.3 to 2 billion. The models are designed to be scalable and consistently outperform existing benchmarks in human-centric vision tasks.

Training

The models are pre-trained on a vast dataset of human images, allowing for fine-tuning on specific tasks. They support different keypoint configurations for pose estimation and multiple classes for body-part segmentation. The model design enables efficient adaptation to various tasks through fine-tuning.

Guide: Running Locally

  1. Install Dependencies: Ensure you have Python and PyTorch installed on your system.
  2. Clone Repository: Clone the Sapiens repository from GitHub using git clone https://github.com/facebookresearch/sapiens.
  3. Download Model Checkpoints: Access the model checkpoints through the provided links and download the desired model format (Original, TorchScript, or BFloat16).
  4. Setup Environment: Configure your Python environment to include necessary libraries such as torch and transformers.
  5. Run Inference: Load the model checkpoints and run inference on your data.

For optimal performance, especially with BFloat16 models, consider using cloud GPUs such as NVIDIA A100.

License

Sapiens models are released under the Creative Commons Attribution-NonCommercial 4.0 license (cc-by-nc-4.0). This license allows for sharing and adaptation but prohibits commercial use.

More Related APIs