Yifan Zhao is currently pursuing the Ph.D. degree supervised by Prof. Jia Li with the State Key Laboratory of Virtual Reality Technology and System, School of Computer Science and Engineering, Beihang University. He received the B.E. degree from Harbin Institute of Technology in Jul. 2016. His research interests include semantic part segmentation, image/video synthesis, few-shot learning and fine-grained visual classifcation.
Music-inspired Dancing Video Synthesis.Close your eyes and listen to the music, one can easily imagine an actor dancing rhythmically along with the music. These dance movements are usually made up of dance movements you have seen before. In this paper, we propose to reproduce such an inherent capability of the human-being within a computer vision system. The proposed system consists of three modules. To explore the relationship between music and dance movements, we propose a cross-modal alignment module that focuses on dance video clips, accompanied on pre-designed music, to learn a system that can judge the consistency between the visual features of pose sequences and the acoustic features of music. The learnt model is then used in the imagination module to select pose sequence for the given music. Such pose sequence selected from the music, however, is usually discontinuous. To solve this problem, in the spatial-temporal alignment module we develop a spatial alignment algorithm based on the tendency and periodicity of dance movements to predict dance movements between discontinuous fragments. In addition, the selected pose sequence are often misaligned with the music beat. To solve this problem, we further develop a temporal alignment algorithm to align the rhythm of the music and dance. Finally, the processed pose sequence is used to synthesize realistic dance videos in the imagination module. The generated dance ideos match the content and rhythm of the music. Experimental results and user study show that the proposed approach can perform the function of generating promising dance videos by inputting music.
Achieves state-of-the-art performance with ResNet-50 backbone on 4 representative benchmarks.Fine-grained object recognition aims to learn effective features that can identify the subtle differences between visually similar objects. Most of the existing works tend to amplify discriminative part regions with attention mechanisms. Besides its unstable performance under complex backgrounds, the intrinsic interrelationship between different semantic features is less explored. Toward this end, we propose an effective graph-based relation discovery approach to build a contextual understanding of highorder relationships. In our approach, a high-dimensional feature bank is first formed and jointly regularized with semantic- and positional-aware high-order constraints, endowing rich attributes to feature representations. Second, to overcome the high-dimension curse, we propose a graphbased semantic grouping strategy to embed this high-order tensor bank into a low-dimensional space. Meanwhile, a group-wise learning strategy is proposed to regularize the features focusing on the cluster embedding center. With the collaborative learning of three modules, our module is able to grasp the stronger contextual details of fine-grained objects. Experimental evidence demonstrates our approach achieves new state-of-the-art on 4 widely-used fine-grained object recognition benchmarks.
Explore the Cross camera generalization with a new metric and propose a method to solve this issue.To appear
3D talking head generation.to appear.
Jointly learning the spatial relationships and semantic relationships in multi-label classification.To appear.
State-of-the-art algortihm for Occluded Person ReID.To appear.
Joint visual mudality and linguistic modality for multi-label classification.To appear.
A new fine-grained few shot recognition algorithm.Few-shot learning aims to recognize the novel categories from a few examples. However, most of the existing approaches usually focus on general image classification and fail to handle subtle differences between images. To alleviate this issue, we propose a trilinear spatial-awareness network for fewshot-grained visual recognition, called S3Net, which is composed of a spatial selection module, structural pyramid descriptor, and subtle difference mining module. Specifically, we first build the global relation to strengthen the features by spatial selection module. The structural pyramid descriptor then constructs a multi-scale representation for enhancing the rich contextual information by exploiting different receptive fields in the same feature layer. Furthermore, a similarity loss based on local descriptors and a global classification loss is design to help the network learn discrimination capability by handling subtle differences in confused or near-duplicated samples. Extensive experiments on 4 few-shot fine-grained benchmarks demonstrate that our proposed approach is effective and outperforms state-of-the-art models by large margins.
A new RGB-D segmentation model without depth input during test!Salient object detection (SOD) is a crucial and preliminary task for many computer vision applications, which have made progress with deep CNNs. Most of the existing methods mainly rely on the RGB information to distinguish the salient objects, which faces difficulties in some complex scenarios. To solve this, many recent RGBD-based networks are proposed by adopting the depth map as an independent input and fuse the features with RGB information. Taking the advantages of RGB and RGBD methods, we propose a novel depth-aware salient object detection framework, which has following superior designs: 1) It does not rely on depth data in the testing phase. 2) It comprehensively optimizes SOD features with multi-level depth-aware regularizations. 3) The depth information also serves as error-weighted map to correct the segmentation process. With these insightful designs combined, we make the first attempt in realizing an unified depth-aware framework with only RGB information as input for inference, which not only surpasses the state-of-the-art performance on five public RGB SOD benchmarks, but also surpasses the RGBD-based methods on five benchmarks by a large margin, while adopting less information and implementation light-weighted.
Largest benchmark for cartoonface detection and recognition!Recent years have witnessed increasing attention in cartoon media,powered by the strong demands of industrial applications. As the first step to understand this media, cartoon face recognition is a crucial but less-explored task with few datasets proposed. In this work, we first present a new challenging benchmark dataset, consisting of 389,678 images of 5,013 cartoon characters annotated with identity, bounding box, pose, and other auxiliary attributes. The dataset, named iCartoonFace, is currently the largest-scale, high-quality, rich-annotated, and spanning multiple occurrences in the field of image recognition, including near-duplications, occlusions, and appearance changes. In addition, we provide two types of annotations for cartoon media, i.e., face recognition, and face detection, with the help of a semi-automatic labeling algorithm. To further investigate this challenging dataset, we propose a multi-task domain adaptation approach that jointly utilizes the human and cartoon domain knowledge with three discriminative regularizations. We hence perform a benchmark analysis of the proposed dataset and verify the superiority of the proposed approach in the cartoon face recognition task.
A strong baseline for few shot learning. Feel free to try.Given base classes with sufficient labeled samples, the target of few-shot classification is to recognize unlabeled samples of novel classes with only a few labeled samples. Most existing methods only pay attention to the relationship between labeled and unlabeled samples of novel classes, which do not make full use of information within base classes. In this paper, we make two contributions to investigate the few-shot classification problem. First, we report a simple and effective baseline trained on base classes in the way of traditional supervised learning, which can achieve comparable results to the state of the art. Second, based on the baseline, we propose a cooperative bi-path metric for classification, which leverages the correlations between base classes and novel classes to further improve the accuracy. Experiments on two widely used benchmarks show that our method is a simple and effective framework, and a new state of the art is established in the few-shot classification field.
First deep stitching model for 360 degree Omnidirectional images.360◦ omnidirectional images are very helpful in creating immersive multimedia contents, which enables a huge demand in their efficient generation and effective assessment. In this paper, we leverage an attentive idea to meet this demand by addressing two concerns: how to generate a good omnidirectional image in a fast and robust way and what is a good omnidirectional image for human. To this end, we propose an attentive deep stitching approach to facilitate the efficient generation of omnidirectional images, which is composed of two modules. The low-resolution deformation module aims to learn the deformation rules from dual-fisheye to omnidirectional images with joint implicit and explicit attention mechanisms, while the high-resolution recurrence module enhances the resolution of stitching results with the high-resolution guidance in a recurrent manner. In this way, the stitching approach can efficiently generate high-resolution omnidirectional images that are highly consistent with human immersive experiences.
First introduce the part-level 3D model reconstruction setting.Understanding an image with 3D representations has been an increasingly attractive topic in computer vision. The state-of-the-art 3D reconstruction methods usually focus on the reconstruction of the holistic object, while missing important part information, which is crucial in robotic interaction and virtual reality applications. To solve this issue, we make the first attempt to reconstruct the 3D models with part-level representations in a unified framework. With the input of the singleview images, we first develop a feature enhancement encoder to incorporate discriminative local features into the feature representation. The local features are selected adaptively by a learnable local awareness module. Then the enhanced local features are fused with the global branch to form the 3D representations. We then develop a 3D part generator to decode the image priors to 3D parts with a 3D focal loss, which enables the representations of small parts. Experimental results indicate that our model generates reliable part-level structures while achieving state-of-the-art performance in object-level recovering.
Make the first attempt in multi-class part parsing setting.Object part parsing in the wild, which requires to simultaneously detect multiple object classes in the scene and accurately segments semantic parts within each class, is challenging for the joint presence of class-level and part-level ambiguities. Despite its importance, however, this problem is not sufficiently explored in existing works. In this paper, we propose a joint parsing framework with boundary and semantic awareness to address this challenging problem. To handle part-level ambiguity, a boundary awareness module is proposed to make mid-level features at multiple scales attend to part boundaries for accurate part localization, which are then fused with high-level features for effective part recognition. For class-level ambiguity, we further present a semantic awareness module that selects discriminative part features relevant to a category to prevent irrelevant features being merged together. The proposed modules are lightweight and implementation friendly, improving the performance substantially when plugged into various baseline architectures. Our full model sets new state-of-the-art results on the Pascal-Part dataset, in both multi-class and the conventional single-class setting, while running substantially faster than recent high-performance approaches.
Ordinal multi-task relation construction for part segmentation.Semantic object part segmentation is a fundamental task in object understanding and geometric analysis. The clear understanding of part relationships can be of great use to the segmentation process. In this work, we propose a novel Ordinal Multitask Part Segmentation (OMPS) approach which explicitly models the part ordinal relationship to guide the segmentation process in a recurrent manner. Quantitative and qualitative experiments are conducted first to explore the mutual impacts among object parts and then an ordinal part inference algorithm is formulated via experimental observations. Specifically, our framework is mainly composed of two modules, the forward module to segment multiple parts as individual subtasks with prior knowledge, and the recurrent module to generate appropriate part priors with the ordinal inference algorithm. These two modules work iteratively to optimize the segmentation performance and the network parameters. Experimental results show that our approach outperforms the state-of-the-art models on human and vehicle part parsing benchmarks. Comprehensive evaluations are conducted to demonstrate the effectiveness of our approach in object part segmentation.
Surpasses the state-of-the-art Re-ID methods by a large margin.Vehicle re-identification (Re-ID) has been attracting more interests in computer vision owing to its great contributions in urban surveillance and intelligent transportation. With the development of deep learning approaches, vehicle Re-ID still faces a near-duplicate challenge, which is to distinguish different instances with nearly identical appearances. Previous methods simply rely on the global visual features to handle this problem. In this paper, we proposed a simple but efficient part-regularized discriminative feature preserving method which enhances the perceptive ability of subtle discrepancies. We further develop a novel framework to integrate part constrains with the global Re-ID modules by introducing an detection branch. Our framework is trained end-to-end with combined local and global constrains. Specially, without the part-regularized local constrains in inference step, our Re-ID network outperforms the state-of-the-art method by a large margin on large benchmark datasets VehicleID and VeRi-776.
A new cross-reference dataset and software for 360° Omnidirectional Images.Along with the development of virtual reality (VR), omnidirectional images play an important role in producing media content with immersive experience. However, despite various existing approaches for omnidirectional image stitching, how to assess the quality of stitched images is still insufficiently explored. To address this problem, we first establish a novel omnidirectional image dataset containing stitched images as well as dual-fisheye images with standard quarters of 0, 90, 180 and 270 degrees. In this manner, when evaluating the quality of stitched image, there always exist corresponding fisheye images from at least two degrees (which called cross-reference images) that can provide groundtruth observations of the stitching region. Based on this dataset, we propose a novel Omnidirectional Stitching Image Quality Assessment (OS-IQA) algorithm, where we design the histogram, perceptual hashing and sparse reconstruction based quality measurements of the local stitching region by exploring the relationships between the stitched image and its cross-reference. We further propose two global quality indicators that assess the visual color difference and fitness of blind zones. To the best of our knowledge, it is the first attempt that mainly focuses on assessing the stitching quality of omnidirectional images. Qualitative and quantitative experiments show our method outperforms the state-of-the-art methods and is highly matched with human subjective evaluation.
© Copyrights Sharing 2021. All Rights Reserved - 备案编号：京ICP备19057799号