Recently, a plethora of pipelines have emerged to generate 3D clothed human avatars from single, in-the-wild images. However, all of them are limited to full-body, front- facing human images with minimal occlusions, objects, and simple poses. To address these limitations, we propose a two-part, inpainting and body fitting pipeline that addresses these issues. The inpainting pipeline uses keypoint detection and a novel keypoint estimation technique, uses LaMa for occluding object removal, Stable Diffusion with ControlNets for generation of missing areas, and a GAN inversion step to create a seamless, plausible human reconstruction. The body fitting pipeline uses an improved regressor and adds more losses to the iterative fitting stage to achieve a better human mesh fit in dynamic poses. Through qualitative comparisons, our pipeline shows improvements in mesh textures and SMPL-X fit over previous methods.
Seen here are key failure cases for SIFU: human-object interactions, large occlusions, and highly dynamic pose estimation.
More outputs from the inpainting process and the resultant improvements to the final normal maps.
Results from the improvements to the body fitting pipeline.