We provide all the VitPose original models, converted for inference, with single dataset format output.
In addition to that we also provide a Coco-25 model, trained on the original coco dataset feet https://cmu-perceptual-computing-lab.github.io/foot_keypoint_dataset/
Finetuning is not currently supported, you can check de43d54cad87404cf0ad4a7b5da6bacf4240248b and previous commits for a working state of train.py
Warning
Ultralytics yolov8
has issue with wrong bounding boxes when using mps
, upgrade to latest version! (Works correctly on 8.2.48)
people_out.mp4
zebra_out.mp4
(Credits dance: https://www.youtube.com/watch?v=p-rSdt0aFuw )
(Credits zebras: https://www.youtube.com/watch?v=y-vELRYS8Yk )
- Image / Video / Webcam support
- Video support using SORT algorithm to track bboxes between frames
- Torch / ONNX / Tensorrt inference
- Runs the original VitPose checkpoints from ViTAE-Transformer/ViTPose
- 4 ViTPose architectures with different sizes and performances (s: small, b: base, l: large, h: huge)
- Multi skeleton and dataset: (AIC / MPII / COCO / COCO FEET / COCO WHOLEBODY / APT36k / AP10k)
- Human / Animal pose estimation
- cpu / gpu / metal support
- show and save images / videos and output to json
We run YOLOv8 for detection, it does not provide complete animal detection. You can finetune a custom yolo model to detect the animal you are interested in, if you do please open an issue, we might want to integrate other models for detection.
You can expect realtime >30 fps with modern nvidia gpus and apple silicon (using metal!).
There are multiple skeletons for different dataset. Check the definition here visualization.py.
Important
Install torch>2.0 with cuda / mps support
by yourself.
also check requirements_gpu.txt
.
git clone [email protected]:JunkyByte/easy_ViTPose.git
cd easy_ViTPose/
pip install -e .
pip install -r requirements.txt
- Download the models from Huggingface
We provide torch models for every dataset and architecture.
If you want to run onnx / tensorrt inference download the appropriate torch ckpt and useexport.py
to convert it.
You can useultralytics
yolo export
command to export yolo to onnx and tensorrt as well.
$ python export.py --help
usage: export.py [-h] --model-ckpt MODEL_CKPT --model-name {s,b,l,h} [--output OUTPUT] [--dataset DATASET]
optional arguments:
-h, --help show this help message and exit
--model-ckpt MODEL_CKPT
The torch model that shall be used for conversion
--model-name {s,b,l,h}
[s: ViT-S, b: ViT-B, l: ViT-L, h: ViT-H]
--output OUTPUT File (without extension) or dir path for checkpoint output
--dataset DATASET Name of the dataset. If None it"s extracted from the file name. ["coco", "coco_25",
"wholebody", "mpii", "ap10k", "apt36k", "aic"]
To run inference from command line you can use the inference.py
script as follows:
$ python inference.py --help
usage: inference.py [-h] [--input INPUT] [--output-path OUTPUT_PATH] --model MODEL [--yolo YOLO] [--dataset DATASET]
[--det-class DET_CLASS] [--model-name {s,b,l,h}] [--yolo-size YOLO_SIZE]
[--conf-threshold CONF_THRESHOLD] [--rotate {0,90,180,270}] [--yolo-step YOLO_STEP]
[--single-pose] [--show] [--show-yolo] [--show-raw-yolo] [--save-img] [--save-json]
optional arguments:
-h, --help show this help message and exit
--input INPUT path to image / video or webcam ID (=cv2)
--output-path OUTPUT_PATH
output path, if the path provided is a directory output files are "input_name
_result{extension}".
--model MODEL checkpoint path of the model
--yolo YOLO checkpoint path of the yolo model
--dataset DATASET Name of the dataset. If None it"s extracted from the file name. ["coco", "coco_25",
"wholebody", "mpii", "ap10k", "apt36k", "aic"]
--det-class DET_CLASS
["human", "cat", "dog", "horse", "sheep", "cow", "elephant", "bear", "zebra", "giraffe",
"animals"]
--model-name {s,b,l,h}
[s: ViT-S, b: ViT-B, l: ViT-L, h: ViT-H]
--yolo-size YOLO_SIZE
YOLOv8 image size during inference
--conf-threshold CONF_THRESHOLD
Minimum confidence for keypoints to be drawn. [0, 1] range
--rotate {0,90,180,270}
Rotate the image of [90, 180, 270] degress counterclockwise
--yolo-step YOLO_STEP
The tracker can be used to predict the bboxes instead of yolo for performance, this flag
specifies how often yolo is applied (e.g. 1 applies yolo every frame). This does not have any
effect when is_video is False
--single-pose Do not use SORT tracker because single pose is expected in the video
--show preview result during inference
--show-yolo draw yolo results
--show-raw-yolo draw yolo result before that SORT is applied for tracking (only valid during video inference)
--save-img save image results
--save-json save json results
You can run inference from code as follows:
import cv2
from easy_ViTPose import VitInference
# Image to run inference RGB format
img = cv2.imread('./examples/img1.jpg')
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
# set is_video=True to enable tracking in video inference
# be sure to use VitInference.reset() function to reset the tracker after each video
# There are a few flags that allows to customize VitInference, be sure to check the class definition
model_path = './ckpts/vitpose-s-coco_25.pth'
yolo_path = './yolov8s.pth'
# If you want to use MPS (on new macbooks) use the torch checkpoints for both ViTPose and Yolo
# If device is None will try to use cuda -> mps -> cpu (otherwise specify 'cpu', 'mps' or 'cuda')
# dataset and det_class parameters can be inferred from the ckpt name, but you can specify them.
model = VitInference(model_path, yolo_path, model_name='s', yolo_size=320, is_video=False, device=None)
# Infer keypoints, output is a dict where keys are person ids and values are keypoints (np.ndarray (25, 3): (y, x, score))
# If is_video=True the IDs will be consistent among the ordered video frames.
keypoints = model.inference(img)
# call model.reset() after each video
img = model.draw(show_yolo=True) # Returns RGB image with drawings
cv2.imshow('image', cv2.cvtColor(img, cv2.COLOR_RGB2BGR)); cv2.waitKey(0)
Note
If the input file is a video SORT is used to track people IDs and output consistent identifications.
The output format of the json files:
{
"keypoints":
[ # The list of frames, len(json['keypoints']) == len(video)
{ # For each frame a dict
"0": [ # keys are id to track people and value the keypoints
[121.19, 458.15, 0.99], # Each keypoint is (y, x, score)
[110.02, 469.43, 0.98],
[110.86, 445.04, 0.99],
],
"1": [
...
],
},
{
"0": [
[122.19, 458.15, 0.91],
[105.02, 469.43, 0.95],
[122.86, 445.04, 0.99],
],
"1": [
...
]
}
],
"skeleton":
{ # Skeleton reference, key the idx, value the name
"0": "nose",
"1": "left_eye",
"2": "right_eye",
"3": "left_ear",
"4": "right_ear",
"5": "neck",
...
}
}
Finetuning is possible but not officially supported right now. If you would like to finetune and need help open an issue.
You can check train.py
, datasets/COCO.py
and config.yaml
for details.
-
Download COCO dataset images and labels
- 2017 Val images [5K/1GB]: http://images.cocodataset.org/zips/val2017.zip
The extracted directory looks like this:val2017/ ├── 000000000139.jpg ├── 000000000285.jpg ├── 000000000632.jpg └── ...
- 2017 Train/Val annotations [241MB]: http://images.cocodataset.org/annotations/annotations_trainval2017.zip
The extracted directory looks like this:annotations/ ├── person_keypoints_val2017.json ├── person_keypoints_train2017.json └── ...
- 2017 Val images [5K/1GB]: http://images.cocodataset.org/zips/val2017.zip
-
Run the following command:
$ python evaluation_on_coco.py Command line arguments: --model_path: Path to the pretrained ViT Pose model --yolo_path: Path to the YOLOv8 model --img_folder_path: Path to the directory containing COCO val images (/val2017 extracted in step 1). --annFile: Path to json file for COCO keypoints for val set (annotations/person_keypoints_val2017.json extracted in step 1)
The system may be built in a container using Docker. This is intended to demonstrate container-wise inference, adapt it to your own needs by changing models and skeletons:
docker build . -t easy_vitpose
The image is based on NVIDIA's PyTorch image, which is 20GB large. If you have a compatible GPU set up with NVIDIA Container Toolkit, ViTPose will run with hardware acceleration.
To test an example, create a folder called cats
with a picture of a cat as image.jpg
.
Run ./models/download.sh
to fetch the large yolov8 and ap10k ViTPose models. Then run inference using the following command (replace with the correct cats
and models
paths):
docker run --gpus all --rm -it --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 -v ./models:/models -v ~/cats:/cats easy_vitpose python inference.py --det-class cat --input /cats/image.jpg --output-path /cats --save-img --model /models/vitpose-l-ap10k.onnx --yolo /models/yolov8l.pt
The result image may be viewed in your cats
folder.
- refactor finetuning (currently not available)
- benchmark and check bottlenecks of inference pipeline
- parallel batched inference
- other minor fixes
- yolo version for animal pose, check #18
- solve cuda exceptions on script exit when using tensorrt (no idea how)
- add infos about inferred informations during inference, better output of inference status (device etc)
- check if is possible to make colab work without runtime restart
Feel free to open issues, pull requests and contribute on these TODOs.
Thanks to the VitPose authors and their official implementation ViTAE-Transformer/ViTPose.
The SORT code is taken from abewley/sort