Abstract:
Image understanding using deep convolutional network has reached human-level performance, yet a closely related problem of video understanding especially, action recognition has not reached the requisite level of maturity. We propose two independent architectures for action recognition using meta-classifiers --- the first is based on combining kernels of support-vector-machines (SVM) and the second is based on distributed Gaussian Processes. Both receive features that are computed using a multi-stream deep convolutional neural network, enabling us to achieve state-of-the-art performance on a 51 and a 101-class activity recognition problem (HMDB-51/UCF-101 dataset). The resulting architecture is named pillar networks as each (very) deep neural network acts as a pillar for the meta classifiers. In addition, we illustrate that hand-crafted features such as the improved dense trajectories (iDT) and Multi-skip Feature Stacking (MIFS), as additional pillars, can further supplement the performance.
Bio:
Dr.Yu Qian is senior research scientist at Cortexica Vision System Ltd, UK. She has educational background in electrical and electronics engineering and computer science with a bachelors and masters from Hefei University of Technology, China and Phd from the school Computer Science of Middlesex University. After completion of her PhD she was appointed as a research officer/fellow and worked on sketch-based video retrieval (EPSRC project) at Media Technology Research Centre(MTRC) of University of Bath and CVSSP at University of Surrey. After, she joined Middlesex University as researcher fellow and worked on EU project for medical image analysis. Her research interest focuses on computer vision and machine learning, especially in visual feature representation for image/video analysis.
Abstract:
The proliferation of new capabilities in affordable smart devices capable of capturing, processing and rendering audio-visual media content triggers a need for coordination and orchestration between these devices and their capabilities, and of the content flowing from and to such devices. The upcoming MPEG Media Orchestration standard (“MORE”, ISO/IEC 23001-13) enables the temporal and spatial orchestration of multiple media and metadata streams. Temporal orchestration is about time synchronisation of media and sensor captures, processing and renderings, for which the MORE standard uses and extends a DVB standard. Spatial orchestration is about the alignment of (global) position, altitude and orientation, for which the MORE standard provides dedicated timed metadata. Other types of orchestration involve timed metadata for region of interest, perceptual quality of media, audio-feature extraction and media timeline correlation.