A dataset can be clustered by merging the bucket indices that come from the random projection of locality sensitive hashing functions. It should be noted that for this to work the merging interval must be calculated first. To improve the feasibility of large scale data clustering in high dimensional space we propose an enhanced Locality Sensitive Hashing Clustering Method. Firstly, multiple hashing functions are generated. Secondly, data points are projected to bucket indices. Thirdly, bucket indices are clustered to get class labels. Experimental results showed that on synthetic datasets this method achieves high accuracy at much improved cluster speeds. These attributes make it well suited to clustering data in high dimensional space.
Data clustering is an important task in many areas, as such many clustering algorithms have been developed. However, many of these are neither effective nor efficient for data points in high dimensional spaces. There are several main reasons for this. Firstly, the inherent sparsity of high dimensional data hinders conventional clustering algorithms. Secondly, the distance between any two points becomes almost the same in high dimensional space
, therefore it is difficult to differentiate similar data points from dissimilar ones. Thirdly, clusters are often embedded in subspaces of high dimensional space, and different clusters may exist in different subspaces
Image clustering is a typical application of high dimensional clustering algorithms, this is because for image features nearly all the dimensions are high. So, the problem of high dimensional clustering is very apparent in image clustering applications. What’s more, the scale of an image dataset is generally large, and this scale may also change in some online applications. The performance of conventional clustering algorithms deteriorates significantly when used on this kind of dataset. For example, k-means is the mainstream cluster method in the image visual bag of words model for visual dictionary construction. However, when the number of images is large, the time taken to cluster is very long and may even be unacceptable for practical purposes. On top of this, when the image data is ever increasing, re-clustering is needed for k-means as it is not a dynamic cluster method. These limitations of k-means seriously damage its feasibility for use with large incremental image datasets. Some improvements to k-means, such as hierarchal k-means
and approximate k-means
, have been presented, however, these have been shown not to support dynamic clustering
. Currently, widely used clustering methods such as spectrum clustering
and Affinity Propagation
methods incur high memory and computation costs due to the matrix factorization that they require. Therefore, a more efficient, new cluster method is needed.
Random projection is used in many areas including fast approximate nearest-neighbor applications
, signal processing
, anomaly detection
and dimension reduction
. This is largely due to the fact that distances are preserved under such transformations in certain circumstances
. Moreover, random projections have also been applied to classifications for a variety of purposes
LSH(Exact Euclidean Locality Sensitive Hashing) is a special case of random projection, and it was first introduced as approximate near neighbor algorithm
LSH has attracted much attention recently, and has mainly been used in retrieval
. In fact, the data points projected into the same buckets were found to be much more similar than those projected into different buckets. So, if we take a dataset and divide into groups according to its bucket indices, the task of data clustering has already be achieved to a certain approximation. What’s more, E
LSH is a data independent method, so it can create a dynamic index for an incremental dataset. In this way, E
LSH can be used as a dynamic clustering method. In fact, the introducer of LSH has pointed out that LSH can serve as a fast clustering algorithm, but he did this without going on to validate the claim. In fact, E
LSH has even been applied to noun clustering by Ravichandran D
. However, clustering image data is more difficult than text word data because of its added complexity.
Therefore, we proposed an enhanced Locality Sensitive Clustering (LSC) method for use in high dimensional space based on E
LSH. This method takes advantage of Locality Sensitive Hashing (LSH) and can cluster high dimension data at a high speed.
2. ENHANCED LOCALITY SENSITIVE CLUSTERING BASED ON RANDOM PROJECTION
The main property of random projection is dimension reduction. Compared with the classical dimension reduction algorithm PCA (Principal Component Analysis), random projection offers many benefits
. For example, generally PCA can’t be used to reduce the dimension of a mixture of
Gaussians to below (Ω
), whereas random projection can reduce the dimension to just Ω(log
). Random Projection has another tremendous benefit, even if the original Gaussians are highly skewed, their projected counterparts will be more spherical. This is a great advantage as it is much easier to design algorithms for spherical clusters than ellipsoidal ones.
As for data clustering, the Johnson-Lindenstrauss Lemma shows that the distances to data points are preserved after projection. This makes it capable for use in approximate nearest neighbor search when performing information retrieval.
- 2.1 The distance preservation of random projection
The Johnson-Lindenstrauss Lemma is famous for its distance preservation property. It can be described like this: Given 0<ε<1 and any set
, for a positive integer
, there exists a map
, such that for all
This lemma says that all pairwise distances are preserved, with high probability, up to 1±γ after mapping.
(0,1) denote the standard normal distribution with mean 0 and variance 1, and
(-1,1) denote the distribution that has the probability 1/2 on -1 and probability 1/2 on 1.
is the random matrix, whose entries are chosen independently from either or . Then
Imagine a set
of data in some high-dimensional space
, and suppose that we randomly project the data down to
. By the Johnson-Lindenstrauss Lemma,
|) is sufficient so that, with high probability, all angles between points change by at most ±γ/2
. In particular, consider projecting all points in
and the target vector ω, if initially data was separable by margin γ, then after projection, since angles with ω have changed by at most γ/2, the data will still be separable.
- 2.2 Locality sensitive clustering
LSH is a special case of LSH (Locality Sensitive Hashing), and it is a random projection based method. This can be illustrated by the definition of the hashing function. The
hashing functions are generated by random methods, and their innerproduct performs the data projection. This, however, is different from general random projection, because each data point is projected by
hashing functions, and as such results in
bucket indices that represent an original point. The
hashing functions also indicate a difference from general random projection in two ways. The first is the projection itself, the projection was not performed on the whole axis in one direction, but on parts of the axis. The second is that general projection is done by a matrix operation (a data point multiplies a
random projection matrix A), but in E
LSH, a data point multiplies a single hashing vector, as such the matrix operation is omitted in E
LSH. This is of vital importance for large-scale data processing, because the multiple matrix operations needed by traditional algorithms are impractical due to high computation and memory costs.
LSH can also be used for data clustering. Based on the separability description above, we can say a dataset’s distance, margin and angle could be preserved after projection. These properties make E
LSH feasible for data clustering. In fact, the bucket indices found after projection can be used to group data points. This comes from the fact that similar data are arranged into a single bucket or the adjacent several buckets. So if we group these into a cluster, and further group all similar groups into corresponding clusters, the goal of data clustering can be achieved. This is the main idea behind Locality Sensitive Clustering.
LSH is based on the
- stable distribution function, its single hashing function is defined as:
where a is a n-dimension vector generated by the
- stable distribution function, and the inner-product (
) works as a single channel random projection,
is the offset added to the random projection, and the module operation ensure the projected value (bucket index) is in a specific range.
The projection function is similar to LSH and projects points in ℝ
The enhanced Locality Sensitive Clustering (LSC) includes several main steps. Firstly, optimal parameters
are computed. Secondly, the hashing function for point v
is constructed. Thirdly, all points are projected to bucket index (
), and the bucket indices of all points are clustered to get the cluster labels. The procedure for enhanced Locality Sensitive Clustering (LSC) is shown below:
Step 1, the optimal parameterskandLare calculated for a datasetS.
Step 2,kdimension vectorAis generated from the Gaussian distribution,bandware also generated according the definition of the LSH function.
Step 3, all the pointsvi∈Sare projected to akdimension bucket indexBi, and these bucket indices constitute a matrixB.
Step 4, randomly selected one columnBjfrom matrixB, and cluster bucket indicesBjto getnclusters wherej∈[l,m], m=|S| andnis the number of clusters.
denotes the cluster labels,
. Step 5, assign points in
denotes the points whose bucket index is
- 3.1 Experiments on an image dataset
To verify the effect of the new clustering method on real data, we constructed an image set from the TRECVID image dataset, the set contains four categories and 75 images total. The 4 categories are ‘compere’, ‘singer’, ‘rice’ and ‘sports’.
To compare the performance of our new algorithm, we first ran k-means on this image set. The results of k-means on the image dataset are showed in
Clustering results of k-means for image dataset.
The result of LSC on the image dataset is shown in
. Most of the cluster labels in cluster 2 and 4 are correct, the clustering labels of cluster 3 are correct while several cluster labels of cluster 1 are wrong. We can see that the accuracy of clustering labels are less than in the synthetic data, this is because the distinctness of inter-clusters are less than in the synthetic data. However, results on real data are more meaningful.
The clustering results of LSC for image dataset.
As LSC is a randomized algorithm, it is understandable that its clustering results are less accurate than k-means. On the other hand, the advantage of LSC lies in its low computation cost, fast running speed and the dynamic clustering which comes from E
To compare the accuracy of the two methods, we computed a MAP of the clustering results for the 4 classes using LSC, k-means, Affinity Propagation (AP) cluster and Spectrum Cluster (SC) methods. The results are showed in
. It can be seen that the accuracy of LSC is about 0.9, which is less than k-means, AP and SC.
Accuracy of clustering results for four clustering methods on the image dataset.
To compare clustering times, we ran the four cluster methods (LSC, k-means, AP cluster and SC) on the image dataset. The results are showed in
. It can be seen that LSC consumes the least time while AP and SC cost significantly more time than LSC and k-means.
Clustering times of the four clustering methods on the image dataset
Clustering times of the four clustering methods on the image dataset
- 3.2 Experiments on an incremental dataset
To conveniently see the running time of our new clustering method, we construct a synthetic dataset 1 and synthetic dataset 2. Synthetic dataset 1 is an incremental dataset in scale, which contains 5 clusters of 100 dimension pieces of data, the number of data points increases from 1,000 to 10,000. Synthetic dataset 2 is an incremental dataset in dimension, which contains 5 clusters of 1,000 data points, the dimension of the points increases from 10 dimension data to 100 dimension data.
The running time of LSC and k-means for these two datasets is shown in
Figure 4 (a)
indicates that the running time of k-means increases much more quickly than LSC when the number of data points increases from 1,000 to 10,000. When the number of data points is at 1,000, the time cost of k-means is at least 100 times more than that of LSC, and when the number of data points increases to 1,000, the time cost of k-means is at least 1,000 times more than that of LSC. Therefore, the advantage of LSC becomes more notable in larger scale datasets. In
Fig. 4 (b)
, we can see the higher the dimensions of the data used is, the more no-table the advantage of LSC is. Even in with low dimension data of 10, the time cost of k-means is at least 10 times more than LSC. To compare the cluster accuracy, MAPs of the two methods were calculated for these two datasets. The results are shown in
. The figure indicates that the cluster accuracy of LSC is similar to k-means, and may be better than k-means on some datasets. Therefore, LSC is more suitable for incremental datasets, and it can be a good cluster method for both high dimension and large-scale datasets.
Running times of the two methods for the two datasets. (a) running times for synthetic dataset 1 and (b) running times for synthetic dataset 2.
MAP of the two methods for the two datasets. (a) MAP of the two methods for synthetic dataset 1 and (b) MAP of the two methods for synthetic dataset 2.
To improve the feasibility of high dimensional data clustering, especially for use in image clustering, an enhanced Locality Sensitive Clustering method based on E
LSH is presented. This method first generates multiple hashing functions, then it projects each point using these hashing functions to get bucket indices. The bucket indices are then clustered to get class labels. In terms of clustering accuracy, experimental results show that LSC’s performance is close to k-means, AP and SC. The advantage of LSC is its fast running speed that makes it suitable for incremental clustering. This property makes it a very valuable method for large dataset clustering and especially for incremental dataset clustering.
This work was supported by Nature Science Foundation of China No. 60872142.
Projective ART for Clustering Datasets in High Dimensional Spaces (Neural Networks,15, 2002)
Proc. of SIGMOD Record ACM Special Interest Group on Management of Data
Proc. of IEEE Conference on Computer Vision and Pattern Recognition[C]
Proc. of IEEE Conference on Computer Vision and Pattern Recognition
Proc. of ACM SIGMM International Conference on Multimedia Information Retrieval Philadelphia
IEEE Transactions on Pattern Analysis and Machine Intelligence
Frey B. J.
Proc. of the Symposium on Theory of Computing
Randomized Partition Trees for Exact Nearest Neighbor Search JMLR: Workshop and Conference Proc
Schulman L. J.
Clustering for edge-cost minimization
Proc. Annual ACM Symp. Theory of Computing
Donoho D. L.
IEEE Trans. Information Theory
Fowler J. E.
IEEE Transaction on Image Processing
Ensembles of Classifiers Based on Dimensionality Reduction
Random Structures & Algorithms
Balcan M. F.
Smola A. J.
J. Mach. Learn. Res.
Communications of the ACM
Liang Y. Y.
Li J. M.
Proc. of International Conference on Multimedia
International Journal of Computer Vision
Proc. of the 43rd Annual Meeting on Association for Computational Linguistics
Experiments with Random Projection
Proceedings of the 16th Conference on Uncertainty in Artificial Intelligence
Proc. of the 2005 International Conference on Subspace, Latent Structure and Feature Selection
International Conference on Machine Learning