"We may hope that machines will eventually compete with men in all purely intellectual fields. But which are the best ones to start with? [...] It can [also] be maintained that it is best to provide the machine with the best sense organs that money can buy, and then teach it to understand and speak English."
Following the inspiring Alan M. Turing's quote above, I believe that this conjecture of machines understanding is worth exploring.
This topic poses that interesting, that challenging questions. The first one refers to the form of intelligent behaviour to be investigated, i.e., based on what one can assess that a robot is understanding what is happening in its environment. To me, a reasonable way is testing the ability to produce a natural language description of generic visual sequences. The description can be seen as a manifestation of what the agent learned from the visual and textual data it processed during training and what the agent learned being important to be described. In addition, a natural language description is a good basis for natural language question answering about the events that the agent saw. Hence, this offers a friendly interface also for non-expert people which would then be allowed to effectively interact with their home robot in the near future.
This seminar focuses on my recent work on natural language video description for service robotics applications. My proposed approach consists of a Deep Recurrent Neural Network (D-RNN) architecture completely based on the Gated Recurrent Unit (GRU) paradigm. The robot is able to generate complete sentences describing the scene, dealing with the hierarchical nature of the temporal information contained in image sequences. The proposed approach has fewer parameters than previous State-of-the-Art architectures, thus it is faster to train and smaller in memory occupancy. These benefits do not affect the prediction performance.
Silvia Cascianelli received the B.Sc. degree in Electronic and Information Engineering in 2013, from University of Perugia, with a thesis on System Fault Detection and Accommodation for UAV’s anemometers. Since then she has collaborated with the Intelligent Systems, Automation and Robotics Laboratory (ISARLab). In 2015, she received the M.Sc. magna cum laude degree in Information and Automation Engineering with a thesis on Nuclear Image-based Computer Aided Diagnosis systems for Alzheimer’s Disease from the University of Perugia. She then joined the ISARLab in 2015 as a Ph.D. student. Her research interests are mainly Machine Learning and Computer Vision for Autonomous Robotics applications.
Image understanding using deep convolutional network has reached human-level performance, yet a closely related problem of video understanding especially, action recognition has not reached the requisite level of maturity. We propose two independent architectures for action recognition using meta-classifiers --- the first is based on combining kernels of support-vector-machines (SVM) and the second is based on distributed Gaussian Processes. Both receive features that are computed using a multi-stream deep convolutional neural network, enabling us to achieve state-of-the-art performance on a 51 and a 101-class activity recognition problem (HMDB-51/UCF-101 dataset). The resulting architecture is named pillar networks as each (very) deep neural network acts as a pillar for the meta classifiers. In addition, we illustrate that hand-crafted features such as the improved dense trajectories (iDT) and Multi-skip Feature Stacking (MIFS), as additional pillars, can further supplement the performance.
Dr.Yu Qian is senior research scientist at Cortexica Vision System Ltd, UK. She has educational background in electrical and electronics engineering and computer science with a bachelors and masters from Hefei University of Technology, China and Phd from the school Computer Science of Middlesex University. After completion of her PhD she was appointed as a research officer/fellow and worked on sketch-based video retrieval (EPSRC project) at Media Technology Research Centre(MTRC) of University of Bath and CVSSP at University of Surrey. After, she joined Middlesex University as researcher fellow and worked on EU project for medical image analysis. Her research interest focuses on computer vision and machine learning, especially in visual feature representation for image/video analysis.
The proliferation of new capabilities in affordable smart devices capable of capturing, processing and rendering audio-visual media content triggers a need for coordination and orchestration between these devices and their capabilities, and of the content flowing from and to such devices. The upcoming MPEG Media Orchestration standard (“MORE”, ISO/IEC 23001-13) enables the temporal and spatial orchestration of multiple media and metadata streams. Temporal orchestration is about time synchronisation of media and sensor captures, processing and renderings, for which the MORE standard uses and extends a DVB standard. Spatial orchestration is about the alignment of (global) position, altitude and orientation, for which the MORE standard provides dedicated timed metadata. Other types of orchestration involve timed metadata for region of interest, perceptual quality of media, audio-feature extraction and media timeline correlation.