my $clusterer = Algorithm::KMeans->new(datafile => $datafile, mask => $mask, K => $K, cluster_seeding => random, # also try smart use_mahalanobis_metric => 1, # also try 0 terminal_output => 1, write_clusters_to_files => 1, );
A call to new() constructs a new instance of the Algorithm::KMeans class. When $K is a non-zero positive integer, the module will construct exactly that many clusters. However, when $K is 0, the module will find the best number of clusters to partition the data into. As explained in the Description, setting cluster_seeding to smart causes PCA (principal components analysis) to be used for discovering the best choices for the initial cluster centers. If you want purely random decisions to be made for the initial choices for the cluster centers, set cluster_seeding to random.
The data file is expected to contain entries in the following format
where the first column contains the symbolic ID tag for each data record and the rest of the columns the numerical information. As to which columns are actually used for clustering is decided by the string value of the mask. For example, if we wanted to cluster on the basis of the entries in just the 3rd, the 4th, and the 5th columns above, the mask value would be N0111 where the character N indicates that the ID tag is in the first column, the character 0 that the second column is to be ignored, and the 1s that follow that the 3rd, the 4th, and the 5th columns are to be used for clustering.
If you wish for the clusterer to search through a (Kmin,Kmax) range of values for K, the constructor should be called in the following fashion:
where obviously you can choose any reasonable values for Kmin and Kmax. If you choose a value for Kmax that is statistically too large, the module will let you know. Again, you may choose random for cluster_seeding, the default value being smart.
If you believe that the variability of the data is very different along the different dimensions of the data space, you may get better clustering by first normalizing the data coordinates by the standard-deviations along those directions. When you set the constructor option do_variance_normalization as shown below, the module uses the overall data standard-deviation along a direction for the normalization in that direction. (As mentioned elsewhere in the documentation, such a normalization could backfire on you if the data variability along a dimension is more a result of the separation between the means than a consequence of the intra-cluster variability.):
datafile: This parameter names the data file that contains the multidimensional data records you want the module to cluster. mask: This parameter supplies the mask to be applied to the columns of your data file. See the explanation in Synopsis for what this mask looks like. K: This parameter supplies the number of clusters you are looking for. If you set this option to 0, that means that you want the module to search for the best value for K. (Keep in mind the fact that searching for the best K may take a long time for large data files.) Kmin: If you supply an integer value for Kmin, the search for the best K will begin with that value. Kmax: If you supply an integer value for Kmax, the search for the best K will end at that value. cluster_seeding: This parameter must be set to either random or smart. Depending on your data, you may get superior clustering with the random option. The choice smart means that the clusterer will (1) subject the data to principal components analysis to determine the maximum variance direction; (2) project the data onto this direction; (3) find peaks in a smoothed histogram of the projected points; and (4) use the locations of the highest peaks as seeds for cluster centers. If the smart option produces bizarre results, try random. use_mahalanobis_metric: When set to 1, this option causes Mahalanobis distances to be used for clustering. The default is 0 for this parameter. By default, the module uses the Euclidean distances for clustering. In general, Mahalanobis distance based clustering will fail if your data resides on a lower-dimensional hyperplane in the data space, if you seek too many clusters, and if you do not have a sufficient number of samples in your data file. A necessary requirement for the module to be able to compute Mahalanobis distances is that the cluster covariance matrices be non-singular. (Lets say your data dimensionality is D and the module is considering a cluster that has only d samples in it where d is less than D. In this case, the covariance matrix will be singular since its rank will not exceed d. For the covariance matrix to be non-singular, it must be of full rank, that is, its rank must be D.) do_variance_normalization: When set, the module will first normalize the data variance along the different dimensions of the data space before attempting clustering. Depending on your data, this option may or may not result in better clustering. terminal_output: This boolean parameter, when not supplied in the call to new(), defaults to 0. When set, you will see in your terminal window the different clusters as lists of the symbolic IDs and their cluster centers. You will also see the QoC (Quality of Clustering) values for the clusters displayed. write_clusters_to_files: This parameter is also boolean. When set to 1, the clusters are written out to files that are named in the following manner:
cluster0.txt cluster1.txt cluster2.txt ... ...
Before the clusters are written to these files, the module destroys all files with such names in the directory in which you call the module.
<B>B>read_data_from_file()<B>B> $clusterer->read_data_from_file() <B>B>kmeans()<B>B> $clusterer->kmeans(); or my ($clusters_hash, $cluster_centers_hash) = $clusterer->kmeans();
The first call above works solely by side-effect. The second call also returns the clusters and the cluster centers. See the cluster_and_visualize.pl script in the examples directory for how you can in your own code extract the clusters and the cluster centers from the variables $clusters_hash and $cluster_centers_hash.
This call makes sense only if you supply either the K=0 option to the constructor, or if you specify values for the Kmin and Kmax options. The K=0 and the (Kmin,Kmax) options cause the module to determine the best value for K. Remember, K is the number of clusters the data is partitioned into.
presents a table with K values in the left column and the corresponding QoC (Quality-of-Clustering) values in the right column. Note that this call makes sense only if you either supply the K=0 option to the constructor, or if you specify values for the Kmin and Kmax options.
<B>B>visualize_clusters()<B>B> $clusterer->visualize_clusters( $visualization_mask )
The visualization mask here does not have to be identical to the one used for clustering, but must be a subset of that mask. This is convenient for visualizing the clusters in two- or three-dimensional subspaces of the original space.
<B>B>visualize_data()<B>B> $clusterer->visualize_data($visualization_mask, original); $clusterer->visualize_data($visualization_mask, normed);
This method requires a second argument and, as shown, it must be either the string original or the string normed, the former for the visualization of the raw data and the latter for the visualization of the data after its different dimensions are normalized by the standard-deviations along those directions. If you call the method with the second argument set to normed, but do so without turning on the do_variance_normalization option in the KMeans constructor, it will let you know.
<B>B>which_cluster_for_new_data_element()<B>B> If you wish to determine the cluster membership of a new data sample after you have created the clusters with the existing data samples, you would need to call this method. The which_cluster_for_new_data.pl script in the examples directory shows how to use this method. <B>B>which_cluster_for_new_data_element_mahalanobis()<B>B> This does the same thing as the previous method, except that it determines the cluster membership using the Mahalanobis distance metric. As for the previous method, see the which_cluster_for_new_data.pl script in the examples directory for how to use this method. <B>B>cluster_data_generator()<B>B> Algorithm::KMeans->cluster_data_generator( input_parameter_file => $parameter_file, output_datafile => $out_datafile, number_data_points_per_cluster => 20 );
for generating multivariate data for clustering if you wish to play with synthetic data for clustering. The input parameter file contains the means and the variances for the different Gaussians you wish to use for the synthetic data. See the file param.txt provided in the examples directory. It will be easiest for you to just edit this file for your data generation needs. In addition to the format of the parameter file, the main constraint you need to observe in specifying the parameters is that the dimensionality of the covariance matrix must correspond to the dimensionality of the mean vectors. The multivariate random numbers are generated by calling the Math::Random module. As you would expect, this module requires that the covariance matrices you specify in your parameter file be symmetric and positive definite. Should the covariances in your parameter file not obey this condition, the Math::Random module will let you know.
When the option terminal_output is set in the call to the constructor, the clusters are displayed on the terminal screen.
When the option write_clusters_to_files is set in the call to the constructor, the module dumps the clusters in files named
cluster0.txt cluster1.txt cluster2.txt ... ...
in the directory in which you execute the module. The number of such files will equal the number of clusters formed. All such existing files in the directory are destroyed before any fresh ones are created. Each cluster file contains the symbolic ID tags of the data samples in that cluster.
The module also leaves in your directory a printable .png file that is a point plot of the different clusters. The name of this file is always clustering_results.png.
Also, as mentioned previously, a call to kmeans() in your own code will return the clusters and the cluster centers.
This module requires the following three modules:
With regard to the third item above, what is actually required is the Math::GSL::Matrix module. However, that module is a part of the Math::GSL distribution. The acronym GSL stands for the GNU Scientific Library. Math::GSL is a Perl interface to the GSL C-based library.
The examples directory contains several scripts to help you become familiar with this module. The following script is an example of how the module can be expected to be used most of the time. It calls for clustering to be carried out with a fixed K:
The more time you spend with this script, the more comfortable you will become with the use of this module. The script file contains a large comment block that mentions six locations in the script where you have to make decisions about how to use the module.
See the following script if you do not know what value to use for K for clustering your data:
This script uses the K=0 option in the constructor that causes the module to search for the best K for your data. Since this search is virtually unbounded --- limited only by the number of samples in your data file --- the script may take a long time to run for a large data file. Hence the next script.
If your datafile is too large, you may need to range limit the values of K that are searched through, as in the following script:
If you also want to include data normalization (it may reduce the performance of the clusterer in some cases), see the following script:
When you include the data normalization step and you would like to visualize the data before and after normalization, see the following script:
After you are done clustering, lets say you want to find the cluster membership of a new data sample. To see how you can do that, see the script:
This script returns two answers for which cluster a new data sample belongs to: one using the Euclidean metric to calculate the distances between the new data sample and the cluster centers, and the other using the Mahalanobis metric. If the clusters are strongly elliptical in shape, you are likely to get better results with the Mahalanobis metric. (To see that you can get two different answers using the two different distance metrics, run the which_cluster_for_new_data.pl script on the data in the file mydatafile3.dat. To make this run, note that you have to comment out and uncomment the lines at four different locations in the script.)
The examples directory also contains the following support scripts:
For generating the data for experiments with clustering:
For cleaning up the examples directory:
The examples directory also includes a parameter file, param.txt, for generating synthetic data for clustering. Just edit this file if you would like to generate your own multivariate data for clustering. The parameter file is for the 3D case, but you can generate data with any dimensionality through appropriate entries in the parameter file.
None by design.
K-Means based clustering usually does not work well when the clusters are strongly overlapping and when the extent of variability along the different dimensions is different for the different clusters. The module does give you the ability to normalize the variability in your data with the constructor option do_variance_normalization. However, as described elsewhere, this may actually reduce the performance of the clusterer if the data variability along a direction is more a result of the separation between the means than because of intra-cluster variability. For better clustering with difficult-to-cluster data, you could try using the authors Algorithm::ExpectationMaximization module.
Please notify the author if you encounter any bugs. When sending email, please place the string KMeans in the subject line.
Download the archive from CPAN in any directory of your choice. Unpack the archive with a command that on a Linux machine would look like:
tar zxvf Algorithm-KMeans-2.05.tar.gz
This will create an installation directory for you whose name will be Algorithm-KMeans-2.05. Enter this directory and execute the following commands for a standard install of the module if you have root privileges:
perl Makefile.PL make make test sudo make install
If you do not have root privileges, you can carry out a non-standard install the module in any directory of your choice by:
perl Makefile.PL prefix=/some/other/directory/ make make test make install
With a non-standard install, you may also have to set your PERL5LIB environment variable so that this module can find the required other modules. How you do that would depend on what platform you are working on. In order to install this module in a Linux machine on which I use tcsh for the shell, I set the PERL5LIB environment variable by
setenv PERL5LIB /some/other/directory/lib64/perl5/:/some/other/directory/share/perl5/
If I used bash, Id need to declare:
I thank Slaven for pointing out that I needed to downshift the required version of Perl for this module. Fortunately, I had access to an old machine still running Perl 5.10.1. The current version, 2.02, is based on my testing the module on that machine.
I added two which_cluster methods in Version 2.0 as a result of an email from Jerome White who expressed a need for such methods in order to determine the best cluster for a new data record after you have successfully clustered your existing data. Thanks Jerome for your feedback!
It was an email from Nadeem Bulsara that prompted me to create Version 1.40 of this module. Working with Version 1.30, Nadeem noticed that occasionally the module would produce variable clustering results on the same dataset. I believe that this variability was caused (at least partly) by the purely random mode that was used in Version 1.30 for the seeding of the cluster centers. Version 1.40 now includes a smart mode. With the new mode the clusterer uses a PCA (Principal Components Analysis) of the data to make good guesses for the cluster centers. However, depending on how the data is jumbled up, it is possible that the new mode will not produce uniformly good results in all cases. So you can still use the old mode by setting cluster_seeding to random in the constructor. Thanks Nadeem for your feedback!
Version 1.30 resulted from Martin Kalin reporting problems with a very small data set. Thanks Martin!
Version 1.21 came about in response to the problems encountered by Luis Fernando DHaro with version 1.20. Although the module would yield the clusters for some of its runs, more frequently than not the module would abort with an empty cluster message for his data. Luis Fernando has also suggested other improvements (such as clustering directly from the contents of a hash) that I intend to make in future versions of this module. Thanks Luis Fernando.
Chad Aeschliman was kind enough to test out the interface of this module and to give suggestions for its improvement. His key slogan: If you cannot figure out how to use a module in under 10 minutes, its not going to be used. That should explain the longish Synopsis included here.
Avinash Kak, email@example.com
If you send email, please place the string KMeans in your subject line to get past my spam filter.
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
Copyright 2014 Avinash Kak
|perl v5.20.3||ALGORITHM::KMEANS (3)||2014-12-12|