Before joining HUST, I worked in the Advanced Technology and Projects (ATAP) division at Google in Mountain View, USA. As a proud member of the Te'veren team, I collaborated with Rick Marks on advanced sensing and on-device intelligence using computer vision. Prior to Google, I served as a Principal Scientist at DGene, US, where I conducted research on real-time volumetric human capture systems.
I graduated from the University of Delaware in 2017, where I majored in Computer Science. At UDel, I worked with Professor Jingyi Yu on research problems in computational photography and scene understanding. During my PhD, I interned at Adobe, hosted by the ACR team, in 2015.
I am actively seeking creative and highly motivated MS and PhD students who are passionate about research.
We propose the Adaptable Motion Diffusion (AMD) model, which leverages a Large Language Model (LLM) to parse the input text into a sequence of concise and interpretable anatomical scripts that correspond to the target motion.
We propose to harness the capabilities of a Large Language Model (LLM) to decompose text descriptions into coherent directives adhering to stringent formats and progressively generate the target image.
We present a label-free white-box attack approach for ViT-based models that exhibits strong transferability to various black-box models by accelerating the feature collapse.
We propose a Feature Pruning and Consolidation (FPC) framework to circumvent explicit human structure parse, which consists of a sparse encoder, a global and local feature ranking module, and a feature consolidation decoder.
We formulates object detection and association jointly as a consistent denoising diffusion process from paired noise boxes to paired ground-truth boxes.
We propose to conduct inverse volume rendering by representing a scene using microflake volume, which assumes the space is filled with infinite small flakes and light reflects or scattersat each spatial location according to microflake distributions.
We demonstrate the basic vision transformer (ViT) architecture is sufficient for visual tracking with correlative masked modeling for information aggregation enhancement.
We propose an Uncertainty Regulated Dual Memory Units (UR-DMU) model to learn both the representations of normal data and discriminative features of abnormal data.
CSWinTT is a new transformer architecture with multi-scale cyclic shifting window attention for visual object tracking, elevating the attention from pixel to window level.
ARTEMIS, the core of which is a neural-generated (NGI) animal engine, enables interactive motion control, real-time animation and photo-realistic rendering of furry animals.
SportsCap -- the first approach for simultaneously capturing 3D human motions and understanding fine-grained actions from monocular challenging sports video input
We generate a global full-body template by registering all poses in the acquired motion sequence, and then construct a deformable graph by utilizing the rigid components in the global template.
We present a comprehensive theory on ray geometry transforms under light field pose variations, and derive the transforms of three typical ray manifolds.