Blind image quality assessment via probabilistic latent semantic analysis

We propose a blind image quality assessment that is highly unsupervised and training free. The new method is based on the hypothesis that the effect caused by distortion can be expressed by certain latent characteristics. Combined with probabilistic latent semantic analysis, the latent characteristics can be discovered by applying a topic model over a visual word dictionary. Four distortion-affected features are extracted to form the visual words in the dictionary: (1) the block-based local histogram; (2) the block-based local mean value; (3) the mean value of contrast within a block; (4) the variance of contrast within a block. Based on the dictionary, the latent topics in the images can be discovered. The discrepancy between the frequency of the topics in an unfamiliar image and a large number of pristine images is applied to measure the image quality. Experimental results for four open databases show that the newly proposed method correlates well with human subjective judgments of diversely distorted images.

Distortion-specific NR IQAs are made under the assumption that the image quality is affected by one or several particular kinds of distortions, such as blockiness (Wang et al. 2000;Pan et al. 2004), rings (Liu et al. 2010), blurring (Ferzli and Karam 2009;Varadarajan and Karam 2008), and compression (Sheikh et al. 2005;Babu et al. 2007;Sazzad et al. 2008;Liang et al. 2010). The application domain of these approaches is limited as they are only suitable for the presumed distortion types. In contrast, uni-versally purposed models (Li et al. 2011;Mittal et al. 2013;Zhang et al. 2015;Moorthy and Bovik 2011;Saad et al. 2012;Moorthy and Bovik 2010;Saad et al. 2010) are intended to handle multiple, possibly unknown distortions and typically involve machine learning techniques. GRNN (Li et al. 2011) deployed a generalized regression neural network combined with image perceptually relevant features to train an IQA model. DIIVINE (Moorthy and Bovik 2011) is a later extended version of BIQI (Moorthy and Bovik 2010), and both are based on a two-step framework for quality estimation. The BRISQUE ) model deploys a space-domain natural scene statistic (NSS) model to quantify possible losses of naturalness in an image due to the presence of distortions. However, the abovementioned algorithms require auxiliary information in the form of human opinion scores, and most are based on machine learning principles used to teach the regression model. These NR IQAs are thus only suitable for images whose distortion types have already been trained for, resulting in weak generalization capability. Collecting enough training samples of all such manifold distortion types, and then obtaining their human opinion scores, is an expensive and time-consuming procedure.
On the basis of the shortcomings mentioned above, it is meaningful to develop a highly unsupervised IQA method that requires no training on human opinion scores and does not need training samples of distortions. In reference , the author proposed a blind IQA method based on latent quality factors (LQF), which conducts probabilistic latent semantic analysis (PLSA) on the NSS-based features of the image patches of the training set.
Motivated by the success of PLSA for image latent topic discovery, we propose a blind IQA based on PLSA. The framework of our method mainly contains (1) feature extraction based on grayscale fluctuation (GF) (Yang et al. 2014) analysis; (2) construction of a dictionary of visual words; (3) discovery of image latent distortion-affection topics via PLSA; and (4) measurement of image quality. The main benefits of our method are (1) that it is highly unsupervised and requires no a priori in-formation such as human opinion scores; (2) it has a distortion-affected feature section that reveals image structural information in terms of both intensity and distribution; and (3) it is not distortion-specific and effective across multiple distortion types. In this paper, we compare the performance of the proposed method with that of established methods on four open databases: LIVE2 (Sheikh et al. 2006), CSIQ (Larson and Chandler 2010), TID2008 (Ponomarenko et al. 2009) and LIVE Multiply Distorted (Jayaraman et al. 2012). The experimental results for the four open databases show that the newly proposed method accords closely with human subjective judgments of diversely distorted images.
The remainder of this paper is organized as follows. In the next section, we discuss the grayscale fluctuation analysis. In "Proposed approach" section, we introduce our method in detail. The experimental results of the proposed method are presented in "Experiments and results" section, followed by the conclusions in "Conclusion" section.

Grayscale fluctuation analysis
Image degradation refers to imaging that fail to fully reflect the true content of the scene, and it affects image features in terms of, for example, smoothing, sparsity and regularity. A direct effect on an image is a change to the image texture (Liu and Yang 2008;Acharya and Ray 2005). Meanwhile, the image GF reflects changes in image texture. We can thus learn the degree of image degradation by analyzing the image GF.

Grayscale fluctuation primitive
Using the GF primitive (Yang et al. 2014), we can analyze the image GF relationships between a certain pixel and its neighbors. As shown in Fig. 1, the primitive is a 3 × 3 square centered at V 5 . To obtain the GF relationship between the center pixel and all its neighbors, the detection directions are set to 0 • , 45 • , 90 • , and 135 • .
The neighboring pixel grayscale vector angle (GVA) and neighboring pixel grayscale mutually exclusive value (MEV) are two variables used to represent image GF. These two variables are denoted Ga x and Dt x {x = 1, 2, 3, 4} respectively. The method of calculating Ga x is shown in Fig. 2. The central pixel of the primitive is defined as the origin of the coordinate axes (o). o 1 and o 2 are the neighboring pixels in the current detection direction. d 1 and d 2 are the absolute values differences between the central pixel and its neighbors; these two variables are calculated by (1, d 2 ). In addition, θ([0, 180 • ]) is the GVA used to reflect the GF of the center pixel. GF increases with decreasing θ. cos θ is thus used to represent Ga x . Ga x is calculated according to formula (1) and ranges [−1, 1]. Dt x represents the change trend of GF in the current detector direction and is assigned to be 1 or −1, and is calculated as follows: . The GF has a relatively big change trend when the plus-or-minus signs of c 1 and c 2 are different.

Grayscale fluctuation primitive map
The GF primitive is employed to analyze an image pixel-by-pixel. The GF of the central pixel can be calculated as where B x is the GF value in one certain detection direction and B is the overall GF value of the central pixel. The initial values of B x and B are set to 0, thus, B x and B are both integers. In addition, ϕ is the threshold of Ga x . The value of x is {x = 1, 2, 3, 4}, thus, the range of B x and B are [0, 2] and [0,8]. Note that B is an integer; hence, there are 9 different values of B. By utilizing B to replace the value of the central pixel in the corresponding location of the image, we can obtain the GF map. The values of the pixels in the GF map are proportional to the grayscale fluctuations between the pixels and their neighbors in the corresponding location of the original image. The flowchart of the calculation procedure of GF map is shown in Fig. 1. We obtain the first GF map through the first primitive analysis. Then, by analyzing the first GF map, we obtain the second GF map.
GF map represents the relationship between pixels and their neighbors in the original image. Thus, it can reflect the degree of change in the image texture and can be used to further analyze the degree of image distortion. To demonstrate this fact, we select one reference image from the LIVE2 image database and versions of the image with different degrees of distortion. Figure 3 shows the reference image named "parrots" and its first and second primitive maps. Figure 3a is the reference image whose difference mean opinion score (DMOS) is 100, Fig. 3b is the first primitive map of (a), and Fig. 3c is the second primitive map of (b). The first primitive analysis threshold is ϕ 1 = 90 • and the second primitive analysis threshold is ϕ 2 = 90 • . Corresponding to Figs. 3 and 4 shows JPEG2000 distortion images of the reference image sorted in descending order of the degree of distortion from (a) to (c). Using ϕ 1 and ϕ 2 , we get the primitive maps, as shown in (d-i). In the same way, we obtain the primitive maps of images distorted by white noise (WN), as shown in Fig. 5. We draw the following conclusions from Figs. 3, 4 and 5. First, there is a clear connection between primitive maps and the degree of distortion. Second, the behavior of the first primitive map is more representative in visual effects. Third, using primitive maps to analyze distortion may be applicable for different distortion types. According to the above discussion, we use primitive maps to analyze image quality.

Proposed approach
This section discusses the proposed approach in detail. The approach can be divided into four steps: (1) the GF primitive is employed to extract distortion-affected features; (2) a visual word dictionary is constructed by relying on the extracted features; (3) the PLSA algorithm is used to discover the latent topics in images; and (4) the differences between the probabilities over latent topics that are found in the test image and the pristine images are used as the measurement of image quality. The flow chart of the proposed approach is shown in Fig. 6. Are the first primitive maps of (a-c). g-i Are the second primitive maps of (a-c)

Feature extraction
It has been shown that the image GF primitive map and the degree of image distortion are closely correlated. Hence, we take the image GF primitive map as a rich descriptor of image quality. The values of pixels in the primitive map reflect the GF situation. Hence, the histogram is the most direct representation of the distribution of values in a primitive map. Figure 7 presents histograms of the first and second primitive maps of WN-distorted images. Histogram curves in Fig. 7 correspond to the first and second primitive maps shown in Fig. 5. Figure 7 provides interesting findings. First, distortions affect the distributions of the values of pixels in primitive maps. Second, the first and second primitive maps have different shapes and properties of the histogram curve. We have found that these phenomena are broadly observed in natural images not reported here. The image texture has regional characteristics. By block analyzing image primitive maps, we can learn the texture situation of the local area. We thus use the block-based local histogram of the primitive map as quality-aware features to measure image quality. Additionally, the block-based local mean-value of pixel values in the image primitive map is chosen to represent the intensity of the GF situation in different image regions. We define the block-based local mean-value as M b , and the calculation formula is given as formula (4). In formula (4), the dimensions of the image block are n × n, and I(i, j) is the value of the pixel in the designated location.
(4) The use of contrast is an effective way of analyzing the correlation between pixels in a block. We divided the primitive map into overlapping blocks having the dimensions n × n, with an overlap of n x × n x between neighboring blocks. In each block, the local contrast is computed over each patch of size n p × n p . The calculation method used in this paper is the root mean square (RMS) contrast method (Peli 1990), where the RMS contrast is defined as the standard deviation of the pixel intensities, calculated as follows.
Here, I(i, j) is the intensity at the designated location of the path, and I is the average intensity of all pixel values in the patch. Each block then has a two-dimensional feature array of size m x × m x , m x = ⌊n/n p ⌋. We calculate the mean and variance of the feature array, respectively denoted as C mean and C var .
In summary, this paper uses the block-based local histogram, the block-based local mean value, the mean value of contrast within block and the variance of contrast within block of the primitive map as four quality-aware features with which to construct an image visual dictionary.

Construction of a visual word dictionary
The approach we take to build the visual word dictionary is similar to that described in reference (Hartigan and Wong 1979). The visual words are formed by clustering features computed from multiple blocks across all the primitive maps in the training set, and the scale of the training set is assumed to be N tr . Each primitive map is divided into overlapping blocks of size n × n, with an overlap of n x × n x between neighboring blocks. According to the discussion presented in the previous section, we calculate the local histogram, M b , C mean and C var over each block. The range of B is [0, 8] and the length of the local histogram is thus 9. Therefore, for the i-th primitive map in the training set sized M i × N i , we get a two-dimensional feature matrix sized X × n b i , where X = 12 and n b i = ⌊M i /n⌋ × ⌊N i /n⌋. Feature vectors from blocks across all images are gathered into a large feature matrix sized 12 × n b , where n b = N tr i=1 n b i . Because the feature matrix is so large, the k-means clustering algorithm (Sivic 2005) with the squared Euclidean distance metric is used to reduce the dimensionality. By assigning each block to the nearest cluster center, the dimensionality of the feature matrix is reduced to 12 × n w . We define each cluster center as one visual word in the visual word dictionary, and use n w to denote the scale of the dictionary.

Probabilistic latent semantic analysis
The present paper uses the probabilistic latent semantic analysis (PLSA) model of Hofmann (2001) to find certain latent characteristic differences between distorted images and pristine images. The PLSA model of Hofmann was first employed to discover latent topics embedded within a collection of text docu-ments in a corpus. In the present Fig. 7 Histograms of primitive maps. a First primitive maps, b second primitive maps paper, the corpus is an assortment of pristine and distorted images. The scale of the corpus is the scale of the training set, denoted N. The images in the corpus can be described as an empirical distribution over visual words from the visual word dictionary. The visual words in the visual word dictionary are one-dimensional vectors. The total number of visual words contained in the dictionary is n w . We suppose I i be the i-th image in the corpus. I i comprises n w i words with the j-th word denoted w ij . We then assume that there are K latent topics pervaded in the collection of images in the corpus, with the k-th topic denoted by the indicator variable z k . All images in the corpus can be represented as a distribution over K topics, with a latent topic z k associated with each word w ij in the image I i . The conditional probability of the word w ij occurring in an image I i can be obtained as follows by marginalizing over the latent topics z k , with k = 1 . . . K .
Here, P(z k | I i ) is the probability of the k-th topic occurring in the i-th image and P(w ij | z k ) is the probability of the j-th visual word occurring in the k-th topic. Thus, the k-th topic can be represented by the n w -dimensional vector P(w j | z k ), with j = 1 . . . n w , and the image I i can be loaded using a K-dimensional vector P(z k | I i ), with k = 1 . . . K . Under the above assumptions, there are topics that pervade the collection of images, and their loadings for a given image can be inferred by finding the model that best explains the probability distribution of the visual words in the images. The present paper uses the expectation-maximization (EM) algorithm (Hofmann 2001) to make the maximum likelihood estimate of the model parameters. It is noted that the PLSA framework uses the "bag of words" approach as the spatial arrangement of word occurrences is not taken into account.

Image quality inference
P(w | z) learned from the training set that comprises both pristine and distorted images via the EM model fitting is used to infer the latent quality factors in a new image outside the training set. We denote the new image as I new , and P(z | I new ) can be calculated using the fold-in heuristic described in reference (Hofmann 2001). For the new image I new , the visual word distribution P(w | I new ) is first calculated. P(z | I new ) is then sought such that the Kull-back-Leibler divergence of the empirical visual word distribution P(w | I new ) = K k=1 P(z k | I new )P(w | z k ) is minimized. This time EM is again employed to estimate P(z | I new ), but only the loadings are updated while P(w | z) learned from the training set is held fixed. The frequency of different topics in I new (i.e., P(z | I new )) is compared with the frequency of different topics for each pristine image in the training set. We denote the frequency of different topics of each pristine image by P(z | I p i ) , where I p i is the i-th pristine image in the training set. P(z | I p i ) is calculated during the model fitting procedure that is carried out to learn the topic-specific word distribution P(w | z). Referring to reference , we make the comparison by computing the dot product between P(z | I new ) and P(z | I p i ). After the dot product has been calculated across all pristine images in the training set, the average dot product is used as the indicative index of the image quality, and the quality of I new is denoted as Q(I new ). The calculation formula is given as formula (7).
Here, the symbol ′ is the transpose operator, and N p is the number of pristine images in the training set. Owing to the linearity of the dot product, formula (7) can be written as formula (8).
Formula (7) shows that the quality assessment in this paper can be seen as a measurement of the change in the topic distribution due to distortion. The topic distribution of the undistorted image is taken as the mean value of the topic distribution of the pristine images in the training set proposed as 1

Databases and metrics
To determine the performance of the newly proposed IQA method, we conduct experiments on four publicly available image databases in the IQA community, namely LIVE2 (Sheikh et al. 2006), CSIQ (Larson and Chandler 2010), TID2008 (Ponomarenko et al. 2009) and LIVE Multiply Distorted (Jayaraman et al. 2012). It is worth noting that the LIVE Multiply Distorted database includes images with multiple distortions. Images were constructed under two scenarios of multiple distortion: (1) image storage where images are first blurred and then compressed by a JPEG encoder and (2) a camera image acquisition process where images are first blurred by the narrow depth of field or another defocusing mechanism and then corrupted by white Gaussian noise to simulate sensor noise. This paper refers to the two distortion scenarios as LiveMD1 and LiveMD2, respectively.
Two commonly used performance metrics are employed to evaluate the IQA methods. The first is the Spearman rank-order correlation coefficient (SROCC), which is a measure of the prediction monotonicity of an IQA metric. The second metric is the Pearson linear correlation coefficient (LCC) for the relation between the DMOS values and objective scores after nonlinear regression. Additionally, we use the logistic function (Group 2003) to fit the results of the newly proposed method to the subjective data. On each image database, we perform a 100-fold validation experiment, and take the median value as the final result. In each run of the experiment, we randomly select the same number of reference images and their distorted versions for learning the latent quality factors, and the other reference images and their associated distorted versions are used for performance evaluation. In this way, we may minimize the interference caused by the choice of training set.

Performance evaluation of the grayscale vector angle threshold
ϕ is an important factor of image GF analysis, and it inevitably affects IQA results. For an unseen image, the first primitive analysis threshold ϕ 1 is used to obtain the image first primitive map. The second primitive map can then be obtained using the second primitive analysis threshold ϕ 2 to analyze the first primitive map. To analyze the effect of ϕ and to select the optimal threshold value, we set the groups of threshold combinations given in Table 1.
We perform the comparative experiment on the LIVE2 database and the predicted quality results are illustrated in Fig. 8 for the different threshold combinations given in Table 1. In each run of the experiment, we randomly select 80 % of the reference images and their distorted versions for training, and the other 20 % reference images and their associated distorted versions for performance evaluation. We draw the following conclusions from the results presented in Fig. 8. First, the chosen threshold values affect the quality assessment results. Second, the SROCC value is more likely to be affected by threshold values. Third, each distortion type has outstanding results in that they have high SROCC and LCC values for different threshold combinations of different distortion types, but the high SROCC and LCC values of different distortion types correspond to different threshold combinations; the differences between the best and worst results are around 0.1. Hence, the prediction performance of the newly proposed IQA method has According to the above findings, we select threshold combinations that achieve good results for all types of distortion. The chosen threshold combinations are the 5th, 6th and 30th data in Table 1. We use the chosen threshold combinations to extract image features and then combine the features to construct the visual word dictionary. Figure 9 shows the comparison results for the mixed threshold combinations and the single threshold combinations. The result for the single threshold combinations is the optimal result among the 30 groups of threshold combinations of different distortion types. We draw the following conclusions from the results presented in Fig. 9. First, the differences in SROCC and LCC values between the mixed threshold and the best performing single group threshold are all less than 0.03 for different distortion types. Second, for most encountered distortion types in the LIVE2 database, the newly proposed IQA method  Fig. 9 Comparison of results obtained for mixed threshold combinations and single threshold combination. a SROCC, b LCC delivers promising results. Third, the new assessment shows a good overall result. Therefore, by choosing suitable threshold combinations, the newly proposed method can be universally purposed without requiring auxiliary information outside the image itself, such as human opinion scores.

Performance depending on the training set proportion
Since the newly proposed method requires distorted and pristine images to teach the model, the size of the training set may affect the quality prediction results. We partition the LIVE2 database into a training set and test set. We conduct a comparative experiment for seven partition proportions; i.e., 10, 20, 30, 40, 50, 60, and 80 % of the reference images and their associated distortion images are used for the training set, while the remaining reference images and distortion images are used for testing. The SROCC and LCC scores of different distortions are presented in Fig. 10. The figure shows that the proportion of the training set affects the performance of the newly proposed method. First, the performance of the newly proposed method for most distortion types tends to fall with a decrease in the proportion of the training set. Second, the quality assessment for the whole database gradually increases as the proportion of the training set increases. Third, the prediction results do not fluctuate greatly as the proportion ratio changes; as the proportion ratio changes from 20 to 80 %, the fluctuations of the prediction are no more than 0.05. Fourth, when the proportion ratio is 10 %, only the SROCC value of WN is under 0.65 while the SROCC and LCC values of other distortion types exceeding 0.75. The newly proposed IQA method is clearly robust against variations in the proportion ratio. In practical application, the quantity of images that need to be evaluated is tremendous when compared with the scale of the training set. The newly proposed method should be suitable for practical application as its performance is not strongly affected by variations in the proportion of the training set.

Performance comparison on publicly available image databases
We compare the newly proposed IQA method with three blind IQA methods: Natural Image Quality Evaluator (NIQE) Mittal et al. (2013), Integrated Local NIQE (ILNIQE) Zhang et al. (2015), and LQF Mittal et al. (2012). In the proposed method, the number of visual words is 400 and the topic number is set to be 4. For each database, 80 % of the reference images and corresponding distorted images are selected to constitute the training set, and the remaining images constitute the test set. We repeat the train-test procedure 100 times on each database. Instead of using their own training set, in each train-test procedure we use the reference images in the randomly partitioned training set to train the model for NIQE and ILNIQE for fairness.  Table 3 compares the performances of the IQA methods on individual distortion types. For LIVE2, TID2008 and CSIQ, we choose four distortion types: JPEG2000 compression (JPEG2000), JPEG compression (JPEG), WN and Gaussian blur (GBLUR), since these four distortion types are the only distortion types that are included in all three databases. In Tables 2 and 3 the best two results are highlighted in italic. We draw the following conclusions from Tables 2 and 3. First, the new IQA method correlates well with human opinion scores over all four publicly available image databases; the overall results of the proposed method are in the best two results except for the SROCC on TID2008. Second, in the first average case the proposed method performs better than its competitors. In the second average case, the proposed method performs better than its competitors in LCC. Meanwhile, in this case the SROCC average performance of the proposed method is only 0.0422 worse than the ILNIQE. Third, the SROCC values of the proposed method are in the best two results for JPEG and GBLUR on LIVE2, WN and GBLUR on CSIQ, and JEPG2000 on TID2008. The LCC values of the proposed method are in the best two values for JPEG and GBLUR on LIVE2, WN on CSIQ, and JPEG2000 on TID2008. Even in other circumstances where the prediction results are not in the best two, the results of the proposed method maintain competitiveness. Fourth, the newly proposed method outperforms LQF for almost all distortion types. Hence, the features we choose for the construction of the visual word dictionary are more quality aware. Fifth, when aimed at multiple distortion types, the newly proposed method is adaptable and has an obvious advantage; the SROCC and LCC metrics for the proposed method are among the best two results. Additionally, the two metrics are approximately 0.9 in the comparison experiments conducted on the LiveMD1 database. On the basis of the above discussion, the proposed method is promising for the following reasons. First, the performance of the proposed method is competitive with the the state of the art IQA methods: NIQE and ILNIQE. Second, unavoidable problems must be considered; i.e., the high expense of collecting human opinion scores, the difficulty of obtaining reference images in practical applications, and most distortions in real life being multiple distortions. The proposed method does not need human opinion scores for training, and performs well in predicting the quality of images which contain multiple distortions. Third, the image is a direct expression of information, and the decay of image quality can be modeled by latent semantics associated with an information expression. Therefore, by choosing appropriate quality-aware or information-aware features to analyze image latent semantics, we may find an effective way of predicting image quality.

Sensitivity to the training set
To measure the robustness of the proposed method, we perform comparison experiments on the LIVE2 database in two cases. In the first case, for each set of experiments, we construct the training set for all but one distortion type in the LIVE2 database. In the second case, the proposed method is trained on TID2008 and CSIQ, then tested on LIVE2. The results are shown in Tables 4 and 5. It is seen in Tables 4 and 5 that the newly proposed assessment method performs well in terms of correlating with human opinions; the SROCC and LCC metrics exceed 0.8 for most distortion types. This demonstrates the good robustness of the proposed method. In Table 4, only when WN distortion is removed from the training set is the quality of WN-affected images not correctly predicted. One possible reason is that the characterization model of WN is different from that of the other distortion types. In future work, we may need to investigate how to develop a similarity characterization topic model of different distortion types, and decrease the sensitivity to the training distortion type.

Conclusion
We proposed a blind quality assessment that is highly unsupervised. PLSA is used to discover latent quality semantics in a set of pristine and distorted images. The feature extraction of the newly proposed method is based on image GF analysis. The qualityaware features are taken from the obtained GF primitive map. We then use the features to construct a visual word dictionary. By using PLSA algorithm and the visual word dictionary, we discover the latent characteristics from pristine and distorted images in the training set and construct a topic model. The discrepancy of topic distribution between  an unseen image and a set of pristine images is used to measure image quality. It is worth noting that the new method removes the effects of human opinion scores. Our future work will focus on gaining more distortion-affected features that are suitable for image topic model construction. According to the experimental results, the newly proposed method is a little sensitive to the distortion types in the training set. We may need to investigate the interplay between the distortion types and the image topics. In this way, we may obtain a robust model with which to infer image quality.