Multi-scale Analysis of Local Phase and Local Orientation for Dynamic Facial Expression Recognition

Automated facial expression analysis is an active research area for human-computer interaction as it enables computers to understand and interact with humans in more natural ways. In this work, a novel local descriptor is proposed for facial expression analysis in a video sequence. The proposed descriptor is based on histograms of local phase and local orientation of gradients obtained from a sequence of face images to describe the spatial and temporal information of the face images. The descriptor is able to effectively represent the temporal local information and its spatial locations which are important cues for facial expression recognition. This is further extended to multi-scale to achieve better performance in natural settings where the image resolution varies. The experimental results conducted on the Cohn-Kanade (CK+) database to detect six basic emotions achieved an accuracy of 94.58%. For the AVEC 2011 video-subchallenge, the detection of four emotion dimensions obtained comparable accuracy with the highest reported average accuracy in the test evaluation.The advantages of our method include local feature extraction incorporating temporal domain, high accuracy and robustness to illumination changes. Thus the proposed descriptor is suited for continuous facial expression analysis in the area of human-computer interaction.


Introduction
Facial expression plays an important role in human-human interaction and may be considered as the most important communication link to convey the information related to the states of human emotion.This motivated many researchers to develop automatic solutions that enable the computers to recognize facial expression and to derive the associated emotional states in order to develop better human-computer interaction systems.If successful, the solutions will also allow computers to generate human-like response that commensurate with the recognized emotion.
Although much progress has been reported in this field, there are still difficult research challenges remaining to achieve a fully automated facial expression recognition system.Majority of the previous works focused on facial expression processing via static images and ignored the dynamic properties of a face while expressing an emotion [1][2][3][4][5].However, it has been shown that human visual system is able to detect an expression more accurately when its temporal information is taken into account [6].
Considering the temporal information of a dynamic event, a few techniques have been developed to deal with facial expression recognition.These include geometrical displacement [7], Hidden Markov Models [8,9], dynamic texture descriptors [10], and Dynamic Bayesian Networks (DBN) [11].
Deriving a proper facial representation from a sequence of images is crucial to the success of a facial expression recognition system, especially if the application requires continuous processing of the video stream such as in the case of human-computer interaction.The local energy and local phase analysis methods have been successfully used to detect various interest points in images such as edges, corners, valleys, and lines.Based on physiological evidence, human visual system is able to detect the patterns in an image where the phase information is highly ordered [12].There were several reported works done to detect the low-level facial image features using local energy or local phase information.In addition, phase-based feature detection is popularly used in medical image analysis [13].
Chi Ho et al. [14] proposed a multi-scale local phase quantization histogram for face detection to alleviate the blur and illumination conditions challenges.Their novel descriptor is formed by projecting the Multi-scale local phase information into an LDA space.
Sarfraz and Hellwich [15] presented a robust pose estimation procedure for face recognition using the local energy model.Their feature descriptor is claimed to be insensitive to illumination and some other common variations in facial appearance such as skin color.Gundimada and Asari [12] also proposed a novel feature selection on phase congruency images to recognize faces with extreme variations such as partial occlusion, non-uniform occlusion, and varying expressions.
The work by Buciu [16] is the only study we found that uses phase congruency to extract the facial features for facial expression classification.The author measured the similarity of sample phases in the frequency domain and used them to construct reliable features.
In this paper, we focus on face emotion recognition from video sequences considering the spatio-temporal information of such expressions.In our proposed approach, a novel phase-based descriptor is proposed to process the local structures of sequenced images.We also extend our feature set by considering the complementary information of local orientations.The final descriptor is the concatenation of spatio-temporal histogram of local phase and spatio-temporal histogram of local orientation.The novelty of our proposed method includes: (1) Extending the phase-based descriptor for spatiotemporal event analysis.(2) Formulating the 3D local orientation of features as additional information to represent all local structures.
(3) Combining the spatio-temporal histogram of local phase and local orientation to extract both static and dynamic information as well as spatial information of a face from a video sequence.(4) Incorporating multi-scale analysis for better performance in natural settings with varying image resolution.

Background
Local phase and local energy are two important concepts used to describe the local structural information of an image.Local phase is able to effectively depict useful image structures such as transitions or discontinuities, providing the type of structure and location information while preserving the image structures.Another advantage of phase-based feature is that it is not sensitive to intensity variation.Local energy, as a complementary descriptor to local phase, contains the strength or sharpness of the feature [17].
The concept of local phase and energy was originally proposed for 1 dimension (1D) signal analysis.For 1D signal, orientation is trivial and does not contain other additional information.Therefore, the local structure is totally described by local phase and energy [13].However, for higher dimensional signal (3D in our case), local orientation is needed in addition to local phase and energy, to completely describe the features in the signal.
To extend the concept of local analysis to 3D, a quadrature pair of oriented bandpass filters is used.However, since the oriented filter bank is discrete, thus proper filter selection is required to cater to different orientation and scale present in an image.This motivate researchers to use the monogenic signal concept [17].
The monogenic signal provides an isotropic extension of the analytic signal to 3D by introducing a vector-valued odd filter (Riesz filter) with Fourier representation [17] as: where u, v, and w are the Fourier domain coordinates and i represents the imaginary part of the signal.
where  0 is the filter's centre frequency, parameter k controls the bandwidth of the filter.Fig. 1 illustrates the monogenic components of a sample video frame.

Methodology
To recognize facial expressions from video sequences, a set of features that best describe the facial muscle changes during an expression is required.
In this paper, a novel local volumetric feature is proposed to describe the local regions of a video sequence.Each pixel of a video frame is encoded by a phase angle.To construct a weighted histogram of local phase, the phase is quantized into equal sized bins.Energy of each pixel is then used as a vote to the corresponding phase bin.Fig. 2a shows the image of energy of a sample video frame.The votes belonging to the same bin are accumulated over local spatiotemporal regions that we call cells.To reduce the noise, only the pixels that have energy more than a predefined threshold can participate in the voting.The next step is grouping cells into larger spatial segments named blocks.Each block descriptor is composed by concatenating all the cell histograms within that block. .The block histogram contrast normalization is then applied to get a coherent description.The final descriptor of the frame is then obtained by concatenating all the normalized block histograms of that frame.
It is worth noting that our proposed spatiotemporal descriptor is able to handle different length of the video sequence.It can also support video segments of varying lengths.

Spatio-Temporal Histogram of Local Orientation (STHLO)
We To construct a weighted histogram of orientations similar to STHLP,  and  are quantized into equally sized bins.For the vote's weight, pixel contribution can be used as the magnitude of the orientation vector given by the following equations:   = √( ,2 (, , )) 2 + ( ,3 (, , )) 2  (15) where   is used to vote the bins related to  and   for voting the bins corresponding to .The voting components (  and   ) are depicted in Fig. 2b and Fig. 2c respectively for a sample video frame.Such proposed method is efficient as it utilizes the magnitude information already computed previously.This is different from the common approach used is to compute the local orientation, based on the output of ensembles of oriented filters like Prewitt, Sobel, and Laplace at differing orientation [19][20][21][22].

Multi-scale Analysis
Multi-scale or multi-resolution analysis of facial features has been used in many research works [23,24].Since some facial features are detectable at a certain scale and may not be as distinguishable at other scales, it is more reliable to extract the features at various scales.The multi-resolution representation can be achieved by varying the wavelengths of the bandpass filter in Eq. ( 7).Finally, the LPLO is obtained from the combination of the features at different scales.Fig. 3 illustrates how the energy component defined by Eq. ( 10) varies by changing the filter scale.

Results and Discussion
We evaluated our proposed method on two publicly available databases for facial expression recognition.The first is Cohn-Kanade (CK + ) [25] dataset consisting of acted emotional states under controlled environment.The second is the Audio Visual Emotion Challenge (AVEC 2011) [26] which contains spontaneous emotional states in natural settings.

Cohn-Kanade Database
CK + includes 593 video sequences recorded from 123 university students.The subjects were asked to express a series of 23 facial displays including single or combined action units.Six of the displays were based on descriptions of prototype basic emotions (joy, surprise, anger, fear, disgust, and sadness).For our experiment, 309 sequences from the dataset which have been labeled with one of the six basic emotions are used.Other sequences may be labeled as "contempt" which is not our objective or may have no label.Fig. 4 shows the sample images of a subject expressing six basic emotions To evaluate our proposed approach, Leave-One subject-Out (LOO) cross validation is used.In this method, there is no information of the same subject in both the training and test samples and consequently, our experiments will be subject independent.To classify the samples into six basic emotions, a Support Vector Machine (SVM) with polynomial kernel function is used.SVM has been originally proposed as a binary classifier, and subsequently extended to multi-class problems [27].For our database, we used one-againstall technique that constructs 6 binary SVM classifiers to categorize each emotion against all the others.The final outcome is obtained using majority voting.
We also applied a preprocessing stage before feature extraction.The images were aligned to have a constant distance between the two eyes, and then rotated to line up the eye coordinates horizontally.Finally, the faces were cropped into a rectangle of size 100×100.Since our method is robust to illumination variation, we do not require illumination normalization at the pre-processing step.The next experiment was carried out using different log-Gabor parameters to check the effect of filter wavelength and bandwidth on the classification performance.Our experimental results show that the bandwidth of 0.75 and 3 scales of the bandpass filter with wavelengths of {4, 8, 12} are superior to other settings in term of detection rate.The results are shown in Table 1 and Table 2 respectively.These parameters are then fixed in our subsequent experiments.
The next experiment was conducted on different number of blocks and cells to compare the length of features and the detection rate.Table 3 tabulates the results obtained.Based on this experiment, partitioning the data into 32 blocks (4×4×2) and 9 cells (3×3×1) outperforms the other settings in term of classification accuracy.
We also evaluated both feature sets (STHLP and STHLO) individually and in combination to validate whether they are indeed complementary descriptors.The combined feature is named LPLO in this table.The results of our evaluation are summarized in Table 4.The detection rate of the combined features is better than each feature set individually.
We have also presented the results of LOO cross validation as a confusion matrix in Table 5 to analyze the performance of our proposed method on each expression.It can be seen that the detection rate of two expressions "fear" and "sadness" are not as high as the other expressions.
In the CK + database, there are limited samples with illumination and skin colour variations.In addition, the variation is only minor and not sufficient to test the effect of illumination variation on our proposed descriptor.As such, we have recorded an additional 15 video sequences of happy and surprize expressions under different illumination conditions.We used the recorded samples to evaluate our classifier trained using the CK + database (which has only minor illumination variation).Our proposed method is able to detect the true label for 14 out of the total 15 sequences (93.33% accuracy).Fig. 5 shows the recorded signals for happy expression, and a sample face of each sequence together with the recognition results.The reason one of the sequences cannot be detected is due to the strong contrast appearing at the mouth region causing the feature at part of the mouth with much darker illumination not to be properly obtained.
Table 6 presents a comparison of our proposed approach to the other reported methods in the literatures that also used the CK + database.Brief information of each method including the number of subjects, dynamic or static process, evaluation measurement, and classification accuracy is included.It is noted that since the experimental setting used in these approaches is not identical, it is difficult to compare quantitatively the performance of listed approaches.However, it is still useful to compare them in relative sense to know the strengths and weaknesses of each method.Our experiment was done using the largest number of subjects, yet the result is comparable to the best approach, LBPTOP [10].However, our approach is an order faster than LBPTOP.The computational time of our proposed feature extraction for one subject is around 0.50 sec while LBPTOP requires 5.5 sec.

AVEC 2011 Database
We also evaluated our proposed approach using the AVEC 2011 database which is more challenging as it is captured in a natural setting.This database consist of 95 video recorded at 49.979 frames per second at a spatial resolution of 780×580 pixels.The binary labels along the four affective dimensions (activation, expectation, power and valence) are provided for each video frame.The data is divided into 3 subsets: training, development, and testing.The training set consists of 31 records, while the development set (for validation of the model parameters) consists of 32 records and the test set consists of 11 video sequences.
As described in the challenge baselines [26], because of the large amount of data (more than 1.3 million frames), we sampled the videos using a constant sampling rate.We segmented every 60 frames of each video with 20% overlap and then sampled it with a down sampling rate of 6.So each volume data includes 10 frames.Due to memory constrains, we processed only 1550 frames of each video for the training and development sets (total of 48050 frames for training and 49600 frames for development).
The information describing the position of the face and eyes are provided in the database.The preprocessing stage includes only normalizing the faces to have a constant distance between the two eyes.
Table 5 shows the recognition results of our approach compared to the other reported results.We just reported the weighted accuracy of the methods which is the correctly detected samples divided by the total number of samples.The average results obtained by our method are above the baseline results and also [34], [35] for the development subset.For test subset, we achieved the best average accuracy among all competitors.This means that the proposed descriptor can be effective for natural and spontaneous emotion detection in natural setting.

Conclusion
In this paper, we proposed a novel local descriptor to analyze dynamic facial expression from video sequences.Our novel descriptor composed of two feature sets, STHLP and STHLO, to describe the local phase and orientation information of the structures in the images.Our proposed phase-based descriptor provides a measure that is independent to the signal magnitude, making it robust to illumination variations.Our experimental results prove that the combined local phase and local energy model is a useful approach that improves the reliability of emotion recognition system in real world application in the presence of scale variation and illumination variations.
Two complementary feature sets are proposed in this study: Spatio-Temporal Histogram of Local Phase (STHLP) and Spatio-Temporal Histogram of Local Orientation (STHLO).The final feature set is formed by concatenating both STHLP and STHLO which we named Local Phase-Local Orientation (LPLO).

Figure 1 .
Figure 1.Illustration of the components used to vote the phase and orientation bins for one sample image of a video sequence; (a) Energy signal for voting the phase bins; (b)   for voting the bins related to ; (c)   for voting the bins corresponding to .

Figure 3 .
Figure 3. Illustration of the energy component at different scales.The scale of the bandpass filter is increased from (a) to (d).

Figure 4 .
Figure 4.Example of basic emotions from CK + database.Each image is the last frame of a video clip that shows the most expressive face.

Figure 5 .
Figure 5. Variation of illumination; (a) Recorded samples of happy expression under different illumination conditions; (b) Sample face of each sequence; (c) Classification results.

Table 3 .
Comparison on number of blocks and number of cells on classification accuracy.

Table 5 .
Confusion matrix for six basic emotions.

Table 6 .
Comparison of other approaches on CK + database.

Table 7 .
Comparison of the detection rate for the AVEC 2011 database.A stands for activation, E for expectancy, P for power, and V for valence.