Designed to test multi-label video tagging.
Much of our communication is embodied in our interaction with the world through gaze and gesture, and our interpretation of our situation in terms of experienced and recognized emotions. We must there for pay close attention to the communication of emotions through the colouring of our speech and livening of our face with appropriate expression.
In addition we also need to be sensitive to the deictic and interpretive cues that are conveyed through our eye gaze, facial expressions, hand movements and body language.
Artificial Intelligence and Image Processing research has in recent years developed a focus on the recognition of emotions as expressed through facial gestures or expressions, conscious and unconscious.
Humans have the ability to detect and interpret such facial movements and adapt their response in seconds or even milliseconds. In the context of our Teaching Head experiments, just providing appropriate rather than neutral or inappropriate facial expressions, during an otherwise identically deliverd lesson, can make a whole grade point of difference in the students results.
Recognition and Synthesis of Facial Gestures The problem of recognizing emotion from the facial expression of a single image has been turned into a straightforward Machine Learning problem by the availability of a number of databases, or corpora, consisting of multiple images for a range of subjects, for each of 6 putative basic emotions, plus an additional neutral emotion case.
Whilst we work with these databases, and have achieved promising results and new optimizations for the Image Processing and Machine Learning task that is involved, the databases themselves, and the paradigm they represent, have a number of limitations.
Some of the most common techniques are wholistic and somewhat simplistic, using Principle Component Analysis on whole images. In collaboration with associate investigators at Beijing University of Technology, we have been exploring the limits of this technique, applying the technique to smaller components of the image and then fusing the results, with good success, as well as exploring the use of appropriate image processing and dimension reduction techniques.
Another approach to recognizing expressions is to recognize the individual facial gestures using Active Appearance Models AAM that are supposed to track individual points on the face, including in particular the mouth and eye areas.
Theoretically, an AAM is capable of modelling and reconstructing any human face, including any facial expression or gestures displayed by the subject. The AAM may also be used to model the speech gestures for purposes of lip-reading, or for synthesis of the visemes we use in the Thinking Head, Head X.
The 6 basic emotions may also be used for synthesis, often based on Active Appearance Models - which is again what we use for our emotion expression in Head X as well.
From the perspective of accurate identification and tracking of these keypoints, and the use of these tracks or deviation from the home or normal position, recognition of an expression of an individual photo of an unknown person remains very difficult, and even for a known person, can be subject to significant error.
However, the average over many images and many subjects delivers a standardized emotion signature that can be used to allow our Thinking Heads to express not only the 6 basic emotions illustrated below, but also arbitrary mixes of them, the so-called hybrid emotions.
For both recognition and synthesis purposes, better accuracy may be obtained with models, including both PCA and AAM, that are based on movie snippets, that is sequences of images, rather than single images. A more general and powerful technique available for moving images is optical flow, in which the movement of small patches from one frame to the next is represented by a motion vector, and this technique also allows for the estimation of depth given known motion most typically where either the camera or the object is stationary.
This is very similar to the way stereo disparity is found between two simultaneous images from a known distance apart. The points that move the fastest or slowest in a particular direction are typically those that are useful for an articulation or gesture model.
A related approach is to look for points of interest in an individual image, or in a sequence of images. However, many interest point detector approaches are ill suited for use in 3D environments because they were originally designed for 2D applications.
Interest point detectors have mainly been developed to take advantage of lines, corners, ridges and blobs, but this biases their effectiveness to environments and objects that have these distinguishing features in them.
Hand Gestures and Body Language The visual techniques we have discussed are not limited to the face, and indeed the later techniques are borrowed from a more general application of image and video processing. Emotions are not expressed only be the face either, and application of similar techniques to the hand in particular, and the body in general, is also important.
Genetic programming techniques can derive novel interest points in an image that do not depend on conventional features or hand identified points of interest. These interest points are robust against various lighting conditions, as well as distortion or rotation.
They are also repeatable and can be found when the same scene is shown with angular distortions. We have developed a new genetic interest point detector algorithm that combines a grammar-guided search process with intermediate caching of results to minimize the total number of required detector evaluations.
The fitness function uniquely uses depth information within a virtual 3D environment to measure the effectiveness of repeatable feature detections as a scene changes and leverages other aspects of 3D environments to better gauge interest point repeatability.
This facilitates evolutionary exploration of the search space and produces interest point detectors that are more robust when handling 3D environments, even when depth data is not directly provided to the interest point detector.
Another technique that is useful for looking at the motion of the human body is to project an array of laser dots on the scene, or to actually fix dots to parts of the human body for motion capture.
These dots may be visible or infrared. We have been experimenting with visible laser dots, and more recently with the Microsoft Kinect which displays a dense matrix of infrared dots. The dots are viewed by an infrared camera a known distance away from the laser, and distance can be calculated from the disparity of the dot from its "at infinity" position as with stereo cameras.
Eye Gaze and Pointing In fact, it is not just the hand that points — we can point with our nose, or most typically, we can indicate something just by looking at it.
Of course this, like most of our gestures, is usually largely unconscious.
The eyes are actually somewhat easier to find than the mouth, and finding a face and mouth often includes locating the eyes as part of it.
Locating the pupil and iris in relation the eyeball, allows relatively accurate identification of gaze.Emotion and Expression, Gaze and Gesture.
In addition we also need to be sensitive to the deictic and interpretive cues that are conveyed through our eye gaze, facial expressions, hand movements and body language. Recognition and Synthesis of Facial Gestures. Emotion Recognition from Facial Expressions using Multilevel HMM Ira Cohen, Ashutosh Garg, Thomas S.
Huang through speech, but also through body gestures, to empha-size a certain part of the speech, and display of emotions. 3 Expression Recognition Using Emotion-Speciﬁc HMMs. This is an incomplete list of datasets which were captured using a Kinect or similar devices.
I initially began it to keep track of semantically labelled datasets, but I have now also included some camera tracking and object pose estimation datasets.
Analysis of Emotion Recognition using Facial Expressions, Speech and Multimodal Information Carlos Busso, Zhigang Deng *, Serdar Yildirim, Murtaza Bulut, Chul Min Lee, Abe Kazemzadeh, Sungbok Lee, Ulrich Neumann*, Shrikanth Narayanan Emotion Research Group, Speech Analysis and Interpretation Lab.
Affective computing (sometimes called artificial emotional intelligence, or emotion AI) is the study and development of systems and devices that can recognize, interpret, process, and simulate human ashio-midori.com is an interdisciplinary field spanning computer science, psychology, and cognitive science.
While the origins of the field may be traced as far back as to early philosophical inquiries. CVonline vision databases page. This is a collated list of image and video databases that people have found useful for computer vision research and algorithm evaluation.