The suggested approach is to use a masked autoencoder (MAE) pre-trained on a facial pose dataset. There are implementations that we can use: * [MAE](https://github.com/facebookresearch/mae) * [VideoMAE](https://github.com/MCG-NJU/VideoMAE) * [VideoPose3D](https://github.com/facebookresearch/VideoPose3D) - [x] Implement a time dilated CNN model (TDCNN) - [x] Implement a MAE model with TDCNN embedding and ViT backbone